Annotation of large protein databases

CPAS Forum (Inactive)
Annotation of large protein databases weiclav  2012-01-03 23:16
Status: Closed
 
Dear all,

I encountered problems with annotation of the large databases in LabKey server (the whole NCBInr database - FASTA format, about 8 GB; UniRef100 - FASTA and XML format, about 8GB and 22GB, respectively). In the case of processing database in FASTA format the LabKey server reports at least error with small Java heap space ("java.lang.OutOfMemoryError: Java heap space"). When I want to annotate database in XML format I did not get any error report (labkey.log, labkey-errors.log) and the database processing just stopped (CPU usage by tomcat5.exe droped down and the RAM utilization by the process went also down). (I was able to repeat this problem on several computers)

I tried increasing the Java heap space and "MaxPermSize" up to 1400 MB and 256 MB, respectively, but without any luck - the behaviour was the same. Recently I moved to 64-bit Java and tried to increase heap space even more. 3GB were not satisfactory (UniRef100.fasta; I get the same Java heap space error as with 32-bit Java). 6GB Java heap space proved to be "sifficient" without any other changes e.g. in "MaxPermSize" (at least for the UniRef100.fasta, I did not try other large databases yet).

From the labkey.log file I found out that in the case of UniRef100.fasta the LabKey server is trying to processe 20,000 sequences per batch. On the other hand when I annotated smaller databases (e.g. UniProt.fasta(.xml)) the annotation algorithm took 5,000 sequences per batch...

Here are my questions:
1) is it possible to modify the number of sequences processed within the single batch?
2) is the number of sequences per batch connected somehow to the total number of sequences in the database or is it connected with something else?
3) is it possible to get to the state where e.g. 2GB of Java heap space would by ok for 8GB database by modifying any other parameter of JVM or LabKey server?
4) is there any othe place where I would be able to find any log entry for the problems occured while annotating large database in XML format? (I am not sure whether the problem was also in the Java heap size...).

Thanks in advance for your kind responces!

Regards,
David

PS: now, during the continuing annotation, the tomcat5.exe process uses about 2.6GB of RAM; for the first batch to finish it needed more, about 4GB. So maybe the number of annotating sequences does not matter and there is another initial step in database annotation that causes the problem? However, about 4.4 GB is still used by Java heap based on info at "Memory usage" page...
 
 
jeckels responded:  2012-01-04 10:05
Hi David,

Before digging into your questions, do you have the full error message from when your server is running out of memory? It would look something like this:

java.lang.OutOfMemoryError: Java heap space
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
    at java.lang.StringBuffer.<init>(StringBuffer.java:91)
    at org.labkey.api.reader.AbstractTabLoader.readFields(AbstractTabLoader.java:207)
    at org.labkey.api.reader.AbstractTabLoader.access$200(AbstractTabLoader.java:55)
    at org.labkey.api.reader.AbstractTabLoader$TabLoaderIterator.hasNext(AbstractTabLoader.java:495)
    at org.labkey.api.reader.DataLoader.load(DataLoader.java:354)
    at org.labkey.study.model.StudyManager.parseData(StudyManager.java:2219)
    at org.labkey.study.model.StudyManager.importDatasetData(StudyManager.java:2300)
    at org.labkey.study.pipeline.DatasetImportRunnable.run(DatasetImportRunnable.java:163)

That will help us track down exactly where it's running out of memory, and therefore give better information about possible workarounds.

Thanks,
Josh
 
weiclav responded:  2012-01-05 09:04
Hi Josh,

thanks for your answer!

Enclosed please find errors I found. I did not realize there are differences in reported errors so I am not able to say now whether there was different database annotated or Java heap space allocated, sorry :(. I am sure there were all encountered using 64-bit Java. Did you see something clear to you that could explain any of the mentioned behaviour?

Btw: I wil try few tests to report the errors correctly (hopefully including the heap dump files). What tests would you recommend to you get as much info as possible? (I will try to load different databases - NCBI.fasta, UniRef100.fasta; 32-bit, 64 bit; Java heap size 1024MB and 2048MB for 32-bit and 64-bit, respectively. I will reinstall the LabKey server prior to each test).

Thanks again for your responce!

David
 
weiclav responded:  2012-01-06 12:09
Hi once more time,

I performed few test and here are the error reports. (I have also memory dump file but only for 64 bit Java. I could try to upload them if it would be of any help; if you would prefer I could repeat the tests with 32 bit Java as well...)

So far I completed three tests, conditions are mentioned prior the error reports. I hope it will help. I will update the post once I finish the last test I plan for now.

Please, let me know if you need any other info...

Thanks!
David

EDIT: added results for Test4 - the error starts with "java.lang.OutOfMemoryError: GC overhead limit exceeded" that is something different then in three previous tests...
 
jeckels responded:  2012-01-06 15:54
Hi David,

Thanks for the additional info. I've taken a look and opened this bug to track this problem:

https://www.labkey.org/issues/home/Developer/issues/details.view?issueId=13768

The problem with loading the FASTA file is that the server runs out of memory when checking to see if there are any duplicate protein names in the file. It's only a rough guess, but I'd guess that with a 64 bit VM and a heap that's 3 or 4 GB you could get past this particular issue. I can't think of any parameters to tweak that would let you get by with the current code and a smaller heap though. It's possible for us to use a different data structure to check for duplicates, which should greatly reduce the memory usage, and is what I suggest in the bug.

As for the XML import, what do you see under Admin->Manage Site->Admin Console->Protein Databases under the "Protein Annotations Loaded" header? Is there an entry for the XML file? How many records did it process, and what was the final state?

Thanks,
Josh
 
weiclav responded:  2012-01-08 02:40
Hi Josh,

there was no record in the "Protein Annotations Loader"... The CPU was busy by tomcat5.exe process after annotation start but after approximately 5 minutes the CPU utilization by tomcat went down to zero. I was not able to find any log entry in the labkey.log or error logs. It appeared like I did not even try to annotate the database... Interestingly, at least for me, the RAM utilization by tomcat was relatively low - approximately 200MB.

I did another testing and this behaviour was the same for 32-bit and 64-bit Java with Java heap space set to 1GB and 2GB, respectively. I repeat that this database in xml has over 22GB (uniref100.xml, version from 12-12-2011). The test with uniprot_sprot.xml (about 5GB) database from the same release went or at least started as expected in both scenarios (32 and 64-bit). (I did not let the annotation process to finish... However, the annotation did about 100,000 sequences from more than 500,000. My experience is, that once LabKey server shows it succesfully processed the first batch of sequences in the log file, it finish the job :).)

Enclosed please find labkey-action-stats.tsv and labkey-query-stats.tsv that were changed during the attempt to annotate the uniref100.xml database with 64-bit Java and 2GB java heap space. It was the only thing I see changing during the annotation process. The files were copied after the CPU utilization drop.

Next I add files containing the "currently running threads" at different states - prior the annotation start, after the start, prior the annotation stop and after the annotation process "stop". I am not familiar with thread description so I am not able to see there anything :). Again it is for 64-bit Java, 2GB Java heap space. (The threads were monitored for the second try to annotate uniref100.xml database. I did not restart the server after the first try.)

Hope this helps...

Is there any other place wwhere I can see the progres of annotation?

Thanks again for your time!

David
 
jeckels responded:  2012-01-09 17:15
Hi David,

I've made the change for 12.1 that lets us validate the NCBInr database in FASTA format (9.2GB for the version that I found) with a 2GB heap. I haven't let it try doing the full import yet, but at least it's getting past the error that you're seeing.

From the thread dumps that you attached, the thread (JobThread-4.1 in your files) has clearly stopped working on the import.

I'd suggest changing your log level to collect more information:

Shut down Tomcat.
Edit <LABKEY_HOME>/labkeyWebapp/WEB-INF/classes/log4j.xml, and change:

    <logger name="org.labkey">
        <level value="info"/>
    </logger>

to:

    <logger name="org.labkey">
        <level value="debug"/>
    </logger>

Then restart your server and you should get additional messages in labkey.log.

Thanks,
Josh
 
weiclav responded:  2012-01-09 23:33
Hi Josh,

thanks for solving "the problem" with NCBI database annotation. I will try it out once the new version will be out.

The debug log file is attached. I copied out only the part regarding the annotation process. If you need log also for start of the server etc., please let me know.

Thanks,
David
 
jeckels responded:  2012-01-11 17:58
Hi David,

From your log, my guess is that the XML file isn't in the same format that we're expecting. I'm running low on drive space right now so I can't check the UniRef100 file itself at the moment. I'll have a secondary drive in a couple of days, so I'll inspect it at that point and respond with an update.

Thanks,
Josh
 
weiclav responded:  2012-01-12 01:00
Hi Josh,

thanks for the info. I am looking forward to your results.

Did you please look at the xml format schemes at uniprot.org site? I found some differences in xml schemes used for SwissProt and TrEMBL databases, and for UniRef databases. Do you count with both types of xml schemes in LabKey?

LabKey server is able to annotate SwissProt database so I will try to annotate also TrEMBL database and post the results...

Thank again!
David
 
weiclav responded:  2012-01-12 13:03
Hi again,

ok, so annotation of TrEMBL database seems to be ok in LabKey... Please see log file from the annotation process. I just did not use the whole TrEMBL database, "only" about 1/4 of it (about 23 GB) and let LabKey server to annotate it. LabKey server starts the process as expected, the database is also present in the respective table in Protein Database Admin. I did not let the process ti finish.

So maybe there is something wrong with xml scheme in the case of UniRef100?

Maybe I will try to load also UniRef90 and UniRef50 whether there will be the same problem...

Please let me know if you have some info!

Thanks.

David
 
weiclav responded:  2012-01-13 10:16
I can confirm that there is probably the same problem also with UniRef90 and UniRef50 (the logs looks simillar to those observed for UniRef100).

David
 
jeckels responded:  2012-01-13 14:01
Hi David,

Thanks for your patience and your investigations. With my new storage online, I've confirmed that the UniRef XML files are in a different format than we currently know how to parse.

There are multiple changes we could make:

First, we could give better feedback if you try to load a file that isn't of the expected format. There is already a bug open on this:

https://www.labkey.org/issues/home/Developer/issues/details.view?issueId=11428

Second, we could support parsing the UniRef format in addition to the UniProt format. I'll open a bug to track this, but we don't have it prioritized for inclusion in any particular release at this point.

https://www.labkey.org/issues/home/Developer/issues/details.view?issueId=13810

Thanks,
Josh
 
weiclav responded:  2012-01-22 13:29
Hi Josh,

thanks for the info.

I tried to finish import of the uniref100 database in fasta format and I was not succesfull. Below is the log report of the error.

Thanks,
David

....
INFO FastaDbLoader 2012-01-20 01:57:32,005 AnnotLoader1 : Importing FASTA file sequences: 52% complete
ERROR AnnotationUploadManager 2012-01-20 01:57:38,279 AnnotLoader1 : Failed to parse file d:/LabKey Server/files/databases/UniRef100/uniref100_2011-12-12.fasta
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
    at java.lang.String.charAt(Unknown Source)
    at org.labkey.ms2.protein.organism.GuessOrgByParsing.guess(GuessOrgByParsing.java:118)
    at org.labkey.ms2.protein.FastaDbLoader.preProcessSequences(FastaDbLoader.java:261)
    at org.labkey.ms2.protein.FastaDbLoader.processMouthful(FastaDbLoader.java:537)
    at org.labkey.ms2.protein.FastaDbLoader.parse(FastaDbLoader.java:722)
    at org.labkey.ms2.protein.FastaDbLoader.parseFile(FastaDbLoader.java:177)
    at org.labkey.ms2.protein.AnnotationUploadManager$AnnotationLoadJob.run(AnnotationUploadManager.java:134)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
...
 
jeckels responded:  2012-01-24 09:11
Hi David,

Sorry for the trouble. It looks like there is one or more proteins in the FASTA that contain "[]" in the description. This parsing code is incorrectly expecting there to be something between the brackets. It's a simple fix and I've made it for 12.1. Until then, you could potentially do a simple Find/Replace on the FASTA to insert some character between the brackets.

Thanks,
Josh
 
weiclav responded:  2012-01-25 12:43
Hi Josh,

thanks for the quick response and implementation. I will try find/replace workaround now and see if it finish the annotation.

Thanks,
David