escaping special characters in FASTA for XML - pep.xml

CPAS Forum (Inactive)
escaping special characters in FASTA for XML - pep.xml ksdoctor  2010-02-12 11:04
Status: Closed
 
Hi,

We are using LabKey 9.3 on a linux system.

In trying to run Mascot (2.1.03) remotely, some data uploads fail due to finding proteins with special characters in their fasta header files.

I have installed cgi/labkeydbmgmt.pl on our Mascot server as per you instructions.

[BTW, there is no Mascot 2.1.3 that I know of, instead the current linux version is 2.1.03 and the current Win is 2.2.x -- please correct your "mascot setup" documentation.. I wasted 1/2hr on that alone.]

LabKey runs the Mascot queries very well -- thanks! But sometimes fails in loading the results and gives this error:

11 Feb 2010 22:15:27,918 INFO : Starting to import spectra from /home/labkey/Masc_Larry/Sample3746_SCXf14_IMACe_lcmsms_1.mzXML
11 Feb 2010 22:15:28,024 INFO : Importing MS/MS results is 32% complete
11 Feb 2010 22:15:28,380 INFO : Importing MS/MS results is 33% complete
11 Feb 2010 22:15:28,889 INFO : Importing MS/MS results is 34% complete
11 Feb 2010 22:15:29,178 ERROR: XMLStreamException in hasNext()
com.ctc.wstx.exc.WstxParsingException: Unexpected close tag </search_hit>; expected </psi>.
 at [row,col {unknown-source}]: [258980,12]
        at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:605)
        at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
        at com.ctc.wstx.sr.BasicStreamReader.reportWrongEndElem(BasicStreamReader.java:3256)

(and stops after a few of these error lines).

I have found this to be solely due to special characters in the FASTA files, replacing those characters immediately fixes the problem.

Here are the cases where I LabKey is failing to put import into XML properly:
IPI human:
>IPI:IPI00431197.3| .... intron 4&9 variant
>IPI:IPI00465120.3|TREMBL:P78550;Q6I955 Tax_Id=9606 Gene_Symbol=- 3<beta>-HSD <psi>1 protein
>IPI:IPI00816409.1|TREMBL:A0N5T0 Tax_Id=9606 Gene_Symbol=- V<gamma>1 protein (Fragment)
>IPI:IPI00816761.1|TREMBL:Q16366 Tax_Id=9606 Gene_Symbol=CREB1 <alpha>CREB-1 protein (Fragment)

It fails the first due to ampersand and then a number.
It fails the second (and I expect others also) due to the <psi> (unmatched XML end-tag).

They are unusual characters in the FASTA files, so I expect you just missed them when testing the otherwise excellent app.

Can you fix this?

thanks,
Kutbuddin
 
 
ksdoctor responded:  2010-02-12 11:27
Here is a snippet of the all.pep.xml file generated by LabKey where the error lies:

<search_hit hit_rank="1" peptide="MGWSCLVTGAGGFPGQR" peptide_prev_aa="M" peptide_next_aa="I" protein="IPI:IPI00465120.3|TREMBL:P78550;Q6I955" num_tot_pro
teins="1" num_matched_ions="9" tot_num_ions="32" calc_neutral_pep_mass="1957.9678" massdiff="-1.1054" num_tol_term="1" num_missed_cleavages="0" is_reject
ed="0" protein_descr="Tax_Id=9606 Gene_Symbol=- 3<beta>-HSD <psi>1 protein">
<beta>
<psi>
<beta>
<psi>
<beta>
<psi>
<modification_info modified_peptide="M[147]GWS[167]CLVT[181]GAGGFPGQR">
<mod_aminoacid_mass position="1" mass="147.195505"/>
<mod_aminoacid_mass position="4" mass="167.057204"/>
<mod_aminoacid_mass position="5" mass="161.179000"/>
<mod_aminoacid_mass position="8" mass="181.083804"/>
</modification_info>
<search_score name="ionscore" value="15.65"/>
<search_score name="identityscore" value="44.71"/>
<search_score name="star" value="0"/>
<search_score name="homologyscore" value="27.42"/>
<search_score name="expect" value="40.28"/>
<analysis_result analysis="peptideprophet">
 
jeckels responded:  2010-02-17 13:44
Hi Kutbuddin,

I updated the Mascot documentation to correctly refer to version 2.1.03.

Thanks for the bug report. This is a problem with the Mascot2XML utility from the TPP that we use to convert the Mascot .dat results to the pepXML file format. I found and fixed the bug. I'm attaching an updated Windows build of the utility, and a patch file you can use to build a new executable on other platforms. I'll submit the patch to the TPP folks, but I'm not sure what release it will go into.

Thanks,
Josh