FASTA parsing and Gene name: /home/Support/Inactive Forums/CPAS Forum (Inactive)

FASTA parsing and Gene name

CPAS Forum (Inactive)

View Message

FASTA parsing and Gene name

richard wilson

2013-07-10 22:32

Status: Closed

Dear Labkey,
I'm using the mouse Complete Proteome Set as my FASTA, which contains a mixture of SwissProt and TrEMBL entries. It would be a massive help if the FASTA file is parsed to reconise GN= as the text that specifies the gene name which can then be viewed in MS dashboard. As the system is currently set up very few entries appear under the Heading 'Best gene name'
Can this be achieved?
Thanks
Richard

jeckels responded:	2013-07-19 13:39
Hi Richard, It's possible that the gene names are already available, but aren't being stored in the "Best gene name" field. If you customize your view, find the "First gene name" field that's part of the protein/sequence data. Does that show what you expect? If you can't find it, can you give additional details about what part of the user interface you're looking at? I can then give more specific guidance on how to pull in the "First gene name field". If you can find the field, and it doesn't have the value you're looking for, can you reply with a few representative entries from the FASTA file you're using? We do attempt to parse out "GN=" values, but small differences in FASTA headers can make a difference in our being able to identify the right pieces. Thanks, Josh

richard wilson responded:	2013-07-21 18:38
Assigned To: jeckels
Hi Josh, Thanks for your response. In Protein Prophet I selected "Firstgenename" using Pick Protein Columns. So the fields selected are: GroupNumber, GroupProbability, PctSpectrumIds, Protein, SequenceMass, Peptides, UniquePeptides, AACoverage, BestName, Description, FirstGeneName However, when I return to view the protein data, there is still no column or entries for FirstGeneName- This is what is displayed for the first entry in the list of proteins: Group Prob Spectrum Ids Protein SequenceMas PP Peps PP Unique AACoverage BestName 3 1.0000 49.78% sp\|P51942\|MATN1_MOUSE 54421.41916 59 17 42.80% Q80VN5_MOUSE Description matrilin 1, cartilage matrix protein 1 [Mus musculus] So I'm not sure what is happening... The FASTA entries in the mouse Complete Proteome Set are all either tr\| or sp\| and look like this: >tr\|A0A504\|A0A504_MOUSE MCG116182, isoform CRA_b OS=Mus musculus GN=Smim17 PE=4 SV=1 or: >sp\|A0AUP1\|CC112_MOUSE Coiled-coil domain-containing protein 112 OS=Mus musculus GN=Ccdc112 PE=2 SV=2 Hope this provides enough information, Many thanks Richard

richard wilson responded:	2013-08-26 21:22
Hi Josh, I'm still keen to resolve the issue around parsing the gene names into best gene name or one of the other viewable fields. Have you had any further thoughts how to tackle this - it would be fantastic for many of our users working with mouse and human "complete proteome set" databases Many thanks Richard

jeckels responded:	2013-08-29 14:15
Hi Richard, Thanks for checking back in. Sorry for the long delay. It sounds like you may be using an older version of the view that does not support FirstGeneName and some other columns. Try choosing "Standard" from the grouping drop down and clicking Go. Then, using the Views menu about the peptide list, choose ProteinProphet. Then, click Views->Customize View. Scroll down to find the "ProteinProphet Data" node. Expand it, and expand "Protein Group" and "First Protein". Select the "Best Gene Name", "First Gene Name", and any other columns that might be of interest. Click Save, and call it "ProteinProphet with gene name" or similar. Does that show you what you'd expect to see? Thanks, Josh

richard wilson responded:	2013-08-29 20:55
Hi Josh, Thank you for the response and guidance. I have followed your suggestions, but unfortunately the entries for First protein Best gene name and First gene name are still unpopulated. Could it be that I need to upgrade the labkey server to a more recent version? Thanks again for all your help, Regards Richard
Labkey Screenshot.docx

jeckels responded:	2013-08-30 13:15
Hi Richard, I did some more investigation. It turns out that we aren't recognizing the "GN=" as a prefix for gene names. Instead, the current code (which has been in place for some time) is expecting "Gene_Symbol=" instead. I've made change to our parsing so that we'll correctly detect the "GN=" in the next version, 13.3. In the meantime, you could create a copy of the FASTA file that uses Gene_Symbol instead and import a data file that uses it. I believe that will be sufficient to populate the FirstGeneName column, even for searches that don't use the modified FASTA file. From your screenshot, it looks like you're running an older version of the server. I'd encourage you to upgrade sometime soon (maybe with 13.3 later this year) as we do continue to improve our FASTA header parsing from release to release. Thanks, Josh

richard wilson responded:	2013-09-02 16:45
Hi Josh, I'm sorry... the saga continues. I've replaced GN= with Gene_Symbol= and the parsing recognises the gene names. However, I repeated a search using the updated FASTA file and the search engine still doesn't see the gene entries. See screenshots attached. I'm sure we're very close to an answer! BW Richard
Labkey Screenshot_2.docx

richard wilson responded:	2013-09-10 18:52
Hi Josh, Good news, the gene names are now displayed (using the Protein (Legacy) view). And, not only that, it makes sample run comparisons a lot more straightforward when normalising protein groups (ie only one entry per gene name) Many thanks Richard

LabKey Support

LabKey Support

FASTA parsing and Gene name

View Message