jeckels responded: |
2012-01-04 10:05 |
Hi David,
Before digging into your questions, do you have the full error message from when your server is running out of memory? It would look something like this:
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuffer.<init>(StringBuffer.java:91)
at org.labkey.api.reader.AbstractTabLoader.readFields(AbstractTabLoader.java:207)
at org.labkey.api.reader.AbstractTabLoader.access$200(AbstractTabLoader.java:55)
at org.labkey.api.reader.AbstractTabLoader$TabLoaderIterator.hasNext(AbstractTabLoader.java:495)
at org.labkey.api.reader.DataLoader.load(DataLoader.java:354)
at org.labkey.study.model.StudyManager.parseData(StudyManager.java:2219)
at org.labkey.study.model.StudyManager.importDatasetData(StudyManager.java:2300)
at org.labkey.study.pipeline.DatasetImportRunnable.run(DatasetImportRunnable.java:163)
That will help us track down exactly where it's running out of memory, and therefore give better information about possible workarounds.
Thanks,
Josh |
|
weiclav responded: |
2012-01-05 09:04 |
Hi Josh,
thanks for your answer!
Enclosed please find errors I found. I did not realize there are differences in reported errors so I am not able to say now whether there was different database annotated or Java heap space allocated, sorry :(. I am sure there were all encountered using 64-bit Java. Did you see something clear to you that could explain any of the mentioned behaviour?
Btw: I wil try few tests to report the errors correctly (hopefully including the heap dump files). What tests would you recommend to you get as much info as possible? (I will try to load different databases - NCBI.fasta, UniRef100.fasta; 32-bit, 64 bit; Java heap size 1024MB and 2048MB for 32-bit and 64-bit, respectively. I will reinstall the LabKey server prior to each test).
Thanks again for your responce!
David |
|
|
weiclav responded: |
2012-01-06 12:09 |
Hi once more time,
I performed few test and here are the error reports. (I have also memory dump file but only for 64 bit Java. I could try to upload them if it would be of any help; if you would prefer I could repeat the tests with 32 bit Java as well...)
So far I completed three tests, conditions are mentioned prior the error reports. I hope it will help. I will update the post once I finish the last test I plan for now.
Please, let me know if you need any other info...
Thanks!
David
EDIT: added results for Test4 - the error starts with "java.lang.OutOfMemoryError: GC overhead limit exceeded" that is something different then in three previous tests... |
|
|
jeckels responded: |
2012-01-06 15:54 |
Hi David,
Thanks for the additional info. I've taken a look and opened this bug to track this problem:
https://www.labkey.org/issues/home/Developer/issues/details.view?issueId=13768
The problem with loading the FASTA file is that the server runs out of memory when checking to see if there are any duplicate protein names in the file. It's only a rough guess, but I'd guess that with a 64 bit VM and a heap that's 3 or 4 GB you could get past this particular issue. I can't think of any parameters to tweak that would let you get by with the current code and a smaller heap though. It's possible for us to use a different data structure to check for duplicates, which should greatly reduce the memory usage, and is what I suggest in the bug.
As for the XML import, what do you see under Admin->Manage Site->Admin Console->Protein Databases under the "Protein Annotations Loaded" header? Is there an entry for the XML file? How many records did it process, and what was the final state?
Thanks,
Josh |
|
weiclav responded: |
2012-01-08 02:40 |
Hi Josh,
there was no record in the "Protein Annotations Loader"... The CPU was busy by tomcat5.exe process after annotation start but after approximately 5 minutes the CPU utilization by tomcat went down to zero. I was not able to find any log entry in the labkey.log or error logs. It appeared like I did not even try to annotate the database... Interestingly, at least for me, the RAM utilization by tomcat was relatively low - approximately 200MB.
I did another testing and this behaviour was the same for 32-bit and 64-bit Java with Java heap space set to 1GB and 2GB, respectively. I repeat that this database in xml has over 22GB (uniref100.xml, version from 12-12-2011). The test with uniprot_sprot.xml (about 5GB) database from the same release went or at least started as expected in both scenarios (32 and 64-bit). (I did not let the annotation process to finish... However, the annotation did about 100,000 sequences from more than 500,000. My experience is, that once LabKey server shows it succesfully processed the first batch of sequences in the log file, it finish the job :).)
Enclosed please find labkey-action-stats.tsv and labkey-query-stats.tsv that were changed during the attempt to annotate the uniref100.xml database with 64-bit Java and 2GB java heap space. It was the only thing I see changing during the annotation process. The files were copied after the CPU utilization drop.
Next I add files containing the "currently running threads" at different states - prior the annotation start, after the start, prior the annotation stop and after the annotation process "stop". I am not familiar with thread description so I am not able to see there anything :). Again it is for 64-bit Java, 2GB Java heap space. (The threads were monitored for the second try to annotate uniref100.xml database. I did not restart the server after the first try.)
Hope this helps...
Is there any other place wwhere I can see the progres of annotation?
Thanks again for your time!
David |
|
|
jeckels responded: |
2012-01-09 17:15 |
Hi David,
I've made the change for 12.1 that lets us validate the NCBInr database in FASTA format (9.2GB for the version that I found) with a 2GB heap. I haven't let it try doing the full import yet, but at least it's getting past the error that you're seeing.
From the thread dumps that you attached, the thread (JobThread-4.1 in your files) has clearly stopped working on the import.
I'd suggest changing your log level to collect more information:
Shut down Tomcat.
Edit <LABKEY_HOME>/labkeyWebapp/WEB-INF/classes/log4j.xml, and change:
<logger name="org.labkey">
<level value="info"/>
</logger>
to:
<logger name="org.labkey">
<level value="debug"/>
</logger>
Then restart your server and you should get additional messages in labkey.log.
Thanks,
Josh |
|
weiclav responded: |
2012-01-09 23:33 |
Hi Josh,
thanks for solving "the problem" with NCBI database annotation. I will try it out once the new version will be out.
The debug log file is attached. I copied out only the part regarding the annotation process. If you need log also for start of the server etc., please let me know.
Thanks,
David |
|
|
jeckels responded: |
2012-01-11 17:58 |
Hi David,
From your log, my guess is that the XML file isn't in the same format that we're expecting. I'm running low on drive space right now so I can't check the UniRef100 file itself at the moment. I'll have a secondary drive in a couple of days, so I'll inspect it at that point and respond with an update.
Thanks,
Josh |
|
weiclav responded: |
2012-01-12 01:00 |
Hi Josh,
thanks for the info. I am looking forward to your results.
Did you please look at the xml format schemes at uniprot.org site? I found some differences in xml schemes used for SwissProt and TrEMBL databases, and for UniRef databases. Do you count with both types of xml schemes in LabKey?
LabKey server is able to annotate SwissProt database so I will try to annotate also TrEMBL database and post the results...
Thank again!
David |
|
weiclav responded: |
2012-01-12 13:03 |
Hi again,
ok, so annotation of TrEMBL database seems to be ok in LabKey... Please see log file from the annotation process. I just did not use the whole TrEMBL database, "only" about 1/4 of it (about 23 GB) and let LabKey server to annotate it. LabKey server starts the process as expected, the database is also present in the respective table in Protein Database Admin. I did not let the process ti finish.
So maybe there is something wrong with xml scheme in the case of UniRef100?
Maybe I will try to load also UniRef90 and UniRef50 whether there will be the same problem...
Please let me know if you have some info!
Thanks.
David |
|
|
weiclav responded: |
2012-01-13 10:16 |
I can confirm that there is probably the same problem also with UniRef90 and UniRef50 (the logs looks simillar to those observed for UniRef100).
David |
|
jeckels responded: |
2012-01-13 14:01 |
|
|
weiclav responded: |
2012-01-22 13:29 |
Hi Josh,
thanks for the info.
I tried to finish import of the uniref100 database in fasta format and I was not succesfull. Below is the log report of the error.
Thanks,
David
....
INFO FastaDbLoader 2012-01-20 01:57:32,005 AnnotLoader1 : Importing FASTA file sequences: 52% complete
ERROR AnnotationUploadManager 2012-01-20 01:57:38,279 AnnotLoader1 : Failed to parse file d:/LabKey Server/files/databases/UniRef100/uniref100_2011-12-12.fasta
java.lang.StringIndexOutOfBoundsException: String index out of range: 0
at java.lang.String.charAt(Unknown Source)
at org.labkey.ms2.protein.organism.GuessOrgByParsing.guess(GuessOrgByParsing.java:118)
at org.labkey.ms2.protein.FastaDbLoader.preProcessSequences(FastaDbLoader.java:261)
at org.labkey.ms2.protein.FastaDbLoader.processMouthful(FastaDbLoader.java:537)
at org.labkey.ms2.protein.FastaDbLoader.parse(FastaDbLoader.java:722)
at org.labkey.ms2.protein.FastaDbLoader.parseFile(FastaDbLoader.java:177)
at org.labkey.ms2.protein.AnnotationUploadManager$AnnotationLoadJob.run(AnnotationUploadManager.java:134)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
... |
|
jeckels responded: |
2012-01-24 09:11 |
Hi David,
Sorry for the trouble. It looks like there is one or more proteins in the FASTA that contain "[]" in the description. This parsing code is incorrectly expecting there to be something between the brackets. It's a simple fix and I've made it for 12.1. Until then, you could potentially do a simple Find/Replace on the FASTA to insert some character between the brackets.
Thanks,
Josh |
|
weiclav responded: |
2012-01-25 12:43 |
Hi Josh,
thanks for the quick response and implementation. I will try find/replace workaround now and see if it finish the annotation.
Thanks,
David |
|