FASTA parsing and Gene name

CPAS Forum (Inactive)
FASTA parsing and Gene name richard wilson  2013-07-10 22:32
Status: Closed
 
Dear Labkey,
I'm using the mouse Complete Proteome Set as my FASTA, which contains a mixture of SwissProt and TrEMBL entries. It would be a massive help if the FASTA file is parsed to reconise GN= as the text that specifies the gene name which can then be viewed in MS dashboard. As the system is currently set up very few entries appear under the Heading 'Best gene name'
Can this be achieved?
Thanks
Richard
 
 
jeckels responded:  2013-07-19 13:39
Hi Richard,

It's possible that the gene names are already available, but aren't being stored in the "Best gene name" field. If you customize your view, find the "First gene name" field that's part of the protein/sequence data. Does that show what you expect?

If you can't find it, can you give additional details about what part of the user interface you're looking at? I can then give more specific guidance on how to pull in the "First gene name field".

If you can find the field, and it doesn't have the value you're looking for, can you reply with a few representative entries from the FASTA file you're using? We do attempt to parse out "GN=" values, but small differences in FASTA headers can make a difference in our being able to identify the right pieces.

Thanks,
Josh
 
richard wilson responded:  2013-07-21 18:38
Assigned To: jeckels
Hi Josh,

Thanks for your response.

In Protein Prophet I selected "Firstgenename" using Pick Protein Columns. So the fields selected are: GroupNumber, GroupProbability, PctSpectrumIds, Protein, SequenceMass, Peptides, UniquePeptides, AACoverage, BestName, Description, FirstGeneName
However, when I return to view the protein data, there is still no column or entries for FirstGeneName- This is what is displayed for the first entry in the list of proteins:

Group    Prob    Spectrum Ids    Protein    SequenceMas PP Peps    PP Unique    AACoverage    BestName   
3    1.0000    49.78%    sp|P51942|MATN1_MOUSE    54421.41916    59    17    42.80%    Q80VN5_MOUSE   

Description
matrilin 1, cartilage matrix protein 1 [Mus musculus]

So I'm not sure what is happening...


The FASTA entries in the mouse Complete Proteome Set are all either tr| or sp| and look like this:

>tr|A0A504|A0A504_MOUSE MCG116182, isoform CRA_b OS=Mus musculus GN=Smim17 PE=4 SV=1 or:
>sp|A0AUP1|CC112_MOUSE Coiled-coil domain-containing protein 112 OS=Mus musculus GN=Ccdc112 PE=2 SV=2

Hope this provides enough information,

Many thanks

Richard
 
richard wilson responded:  2013-08-26 21:22
Hi Josh,

I'm still keen to resolve the issue around parsing the gene names into best gene name or one of the other viewable fields. Have you had any further thoughts how to tackle this - it would be fantastic for many of our users working with mouse and human "complete proteome set" databases

Many thanks

Richard
 
jeckels responded:  2013-08-29 14:15
Hi Richard,

Thanks for checking back in. Sorry for the long delay.

It sounds like you may be using an older version of the view that does not support FirstGeneName and some other columns.

Try choosing "Standard" from the grouping drop down and clicking Go. Then, using the Views menu about the peptide list, choose ProteinProphet. Then, click Views->Customize View. Scroll down to find the "ProteinProphet Data" node. Expand it, and expand "Protein Group" and "First Protein". Select the "Best Gene Name", "First Gene Name", and any other columns that might be of interest. Click Save, and call it "ProteinProphet with gene name" or similar.

Does that show you what you'd expect to see?

Thanks,
Josh
 
richard wilson responded:  2013-08-29 20:55
Hi Josh,

Thank you for the response and guidance. I have followed your suggestions, but unfortunately the entries for First protein Best gene name and First gene name are still unpopulated.

Could it be that I need to upgrade the labkey server to a more recent version?

Thanks again for all your help,

Regards

Richard
 
jeckels responded:  2013-08-30 13:15
Hi Richard,

I did some more investigation. It turns out that we aren't recognizing the "GN=" as a prefix for gene names. Instead, the current code (which has been in place for some time) is expecting "Gene_Symbol=" instead.

I've made change to our parsing so that we'll correctly detect the "GN=" in the next version, 13.3.

In the meantime, you could create a copy of the FASTA file that uses Gene_Symbol instead and import a data file that uses it. I believe that will be sufficient to populate the FirstGeneName column, even for searches that don't use the modified FASTA file.

From your screenshot, it looks like you're running an older version of the server. I'd encourage you to upgrade sometime soon (maybe with 13.3 later this year) as we do continue to improve our FASTA header parsing from release to release.

Thanks,
Josh
 
richard wilson responded:  2013-09-02 16:45
Hi Josh,

I'm sorry... the saga continues. I've replaced GN= with Gene_Symbol= and the parsing recognises the gene names. However, I repeated a search using the updated FASTA file and the search engine still doesn't see the gene entries. See screenshots attached.

I'm sure we're very close to an answer!

BW Richard
 
richard wilson responded:  2013-09-10 18:52
Hi Josh,

Good news, the gene names are now displayed (using the Protein (Legacy) view). And, not only that, it makes sample run comparisons a lot more straightforward when normalising protein groups (ie only one entry per gene name)

Many thanks

Richard