Cannot control dataset key when copying assay to study: /home/Support/Inactive Forums/Study Forum (Inactive)

Cannot control dataset key when copying assay to study

Study Forum (Inactive)

View Message

Cannot control dataset key when copying assay to study

Andy Straw

2014-03-03 12:23

Status: Closed

As far as I can tell, there is no way to control/define the key of a dataset that is created when you copy an assay to study. The dataset created by copy-to-study always seems to have RowId as an additional key column (in addition to ParticipantId and SequenceNum). I understand that this is the safest thing for LabKey code to do, since it can't know if ParticipantId+SequenceNum is unique, and can't know what other field might be the additional key column. However, often for our assays, ParticipantId+SequenceNum IS unique, or there's an additional column we could use to make the dataset rows unique (e.g., a stimulation condition).

Using RowId as an additional key column means that we have to write queries to join datasets that were copied to study from different assays, rather than using Customize View to join them, like we can do with clinical datasets and demomgraphic datasets - even if the data from those assays have the same keys.

Am I missing something, or is there any way to work around this? Or is it reasonable to add a feature to copy-to-study that would allow the user to define the additional key column (or indicate that one is not needed)? Thanks for any info/suggestions you can provide.

Andy

jeckels responded:	2014-03-05 22:36
Hi Andy, You're correct. We use the RowId as the extra key because we don't know the nature of the assay data. This approach ensures that we can always copy the rows to the study dataset (since the same PTID/SequenceNum may contain many rows, one for each analyte/gene/protein/etc). As you note, this has the drawback of not being able to leverage the auto-joins. We've talked about adding extra assay configuration to specify what column (if any) to use as the extra key, but this hasn't yet been prioritized. In the meantime, you might consider create a query snapshot dataset. Effectively, you'd do a SELECT * FROM ASSAY_DATASET. This will create an (automatic) duplicate copy of the data, but in a dataset that doesn't have an extra key column, allowing you to auto-join. https://www.labkey.org/wiki/home/Documentation/page.view?name=querySnapshot Thanks, Josh

Andy Straw responded:	2014-03-06 11:53
Thanks for the suggestion re: Query Snapshots. Unfortunately, they don't seem to work properly with the specific datasets we're dealing with right now, that is, with datasets that are created by copying-to-study the FCSAnalyses assay. FCSAnalyses is the assay into which FlowJo workspace stats are imported via the Flow module. The problems seem to be caused by the column names that are typical for this type of assay: they have special characters that are not handled well; they are longer than 63 characters; they are for nested properties (i.e., statistic.*). Here is what I'm seeing. I have a dataset copied to study from FCSAnalyses with the following stat column names: 1. Statistic.Count 2. Statistic.FSC-A, SSC-A SUBSET/SINGLETS:Count 3. Statistic.FSC-A, SSC-A SUBSET/SINGLETS/CD235-L-D-(LIVE)/CD3+CD14-/CD4+:Count 4. Statistic.FSC-A, SSC-A SUBSET/SINGLETS/CD235-L-D-(LIVE)/CD3+CD14-/CD4+:Freq_Of_Parent (There are more, but these will illustrate the problems) This is the dataset with key that includes RowId; I want to create a Query Snapshot of this dataset with key that does NOT include RowId. Before we go there, here's more info about the source dataset. According to Customize View, the field key for #2 is Statistic.FSC-A$C SSC-A SUBSET$SSINGLETS:Count", that is, comma replaced by $C, slash replaced by $S. If I hover over column header for #2, I see "Statistic/FSC-A$C SSC-A SUBSET$SSINGLETS:Count", same as field key, except slash versus dot after "Statistic". If I write a query against this table, and SELECT Statistic."FSC-A, SSC-A SUBSET/SINGLETS:Count"; that parses fine, executes fine, and produces the results I expect. If I write a query against this table, and SELECT Statistic."FSC-A$C SSC-A SUBSET$SSINGLETS:Count"; that parses fine, executes without error, but produces blank values. If I write a query against this table, and SELECT "Statistic.FSC-A$C SSC-A SUBSET$SSINGLETS:Count"; that fails to parse: "Could not resolve column: Statistic$PFSC-A$DC SSC-A SUBSET$DSSINGLETS:Count" Something is wrong here, but now that I know what naming format to use in my queries, I'm fine. Not sure if this is related to problems below. Now I try to create a Query Snapshot of this dataset. (From what I saw when creating Query Snapshots on OTHER datasets that have column names that are short and without special characters, the SQL generated seems to select ALL columns of the source dataset, but the default definition of the generated dataset contains only the columns in the current view. Is that correct?) If I include both columns #3 and #4 in my view before attempting to create the snapshot, I get an error when I click on the "Edit Dataset Definition" button: ERROR: column "statistic.fsc-a$c ssc-a subset$ssinglets$scd235-l-d-(live)$scd1" of relation "c128d6196_flow1_snapshot" already exists The problem seems to be the length of the column names. LabKey has truncated the names of column #3, and #4 to 63 charactes, and so they are the same, and that obviously is not okay for a dataset definition. I cancel out of creating the snapshot. If I remove column #4 from the view of the source dataset, and try again, I can create the Query Snapshot, but it doesn't work correctly: Column #1 (which has no special characters) has the correct value, but columns #2 and #3 (which have special characters) are blank. So I Edit Snapshot. To my surprise, the SQL displayed is selecting NONE of the Statisic columns - not even Statistic.Count, which is displaying with correct values! The dataset definition includes the other two Statistic columns I had in the view of the original dataset, but they're not in the SQL either, and display blank when I view the snapshot. Back in the view of this weird snapshot, I see the following: According to Customize View, the field key for #2 is Statistic.FSC-A$C SSC-A SUBSET$SSINGLETS:Count", same as for original dataset. At least, it looks that way. However, if I hover over column header for #2, I see "Statistic$PFSC-A$DC SSC-A SUBSET$DSSINGLETS:Count" - Note "$P" instead of period; "$DC" instead of dollar-C; "$DS" instead of dollar-S. Why is that? Does this matter? If I write a query against this snapshot table, and SELECT Statistic."Count", I get a parse error: "Could not resolve column Statistic/Count." But if I SELECT "Statistic.Count" (note difference in where quotes are), that parses, and executes fine, and produces what I expect for that column. If I write a query against this snapshot table, and SELECT "Statistic.FSC-A$C SSC-A SUBSET$SSINGLETS:Count"; that parses fine, executes without error, but produces blank results. So, I can't tell whether the problem is with how special charactes are being handled, or due to the "nested" nature of the statistic fields, or what. Is the dataset definition of the snapshot okay, but the SQL is wrong that's used to populate the snapshot from the source? What exactly are the rules about special characters and the dollar-sign escapes, etc.? And will this ever be able to handle column names longer than 63 characters? My source dataset has longer column names, that are nested and have special characters, so it's not an inherent limitation of datasets, right? Please help. Thanks. Andy

Andy Straw responded:	2014-03-07 12:10
After some more experimenting, I have learned more about how Query Snapshot seems to work (what fields are captured, how nested properties are handled). I have also figured out what's going wrong with special characters. I do not yet see a way to work around the 63-character limitation on field name length. It seems that a certain set of fields from the original dataset are grabbed by the snapshot SQL, regardless of whether I include them in the dataset definition or not. In my example, the fields Name, Run, Flag, and CompensationMatrix (along with ParticipantId and SequenceNum, of course) are always included in the query, regardless of whether they were included in the view from which I created the snapshot. I can create the snapshot without them in the view, and they're not in the dataset definition, or the resulting snapshot view. But if I edit the snapshot, I can add them to the dataset definition, update the snapshot, and now they are in the snapshot view. However, for "composite" fields that came from the assay like Run and Statistic, if I want to include component properties like Run.Name or Run.RowId or Statistic.anything in the snapshot, I must include that component property explicitly in the view from which I create the snapshot. Then it appears in the dataset definition, and in the snapshot view. If I only include Run in the original view from which the snapshot is created, I cannot later add Run.Name to the snapshot. Not sure if this is specific the Flow FCSAnalyses assay being copied to study, or is the case for any assay copied to study. Unfortunately, we have dozens or even hundreds of Statistic.* columns in the original dataset, so manually adding them all to the view before creating the snapshot isn't very feasible; I suppose we could use some JavaScript to define a view that contains all of them. Regarding special characters: When I create the snapshot, if I edit the dataset definition by un-escaping the field names, they now work. In my example above, the default generated field name in the snapshot dataset definition for column #2 is: "Statistic.FSC-A$C SSC-A SUBSET$SSINGLETS:Count"; if I leave that, the snapshot creates without error, but the column has blank values. But if during creation of the snapshot, I change that to "Statistic.FSC-A, SSC-A SUBSET/SINGLETS:Count", then the column ends up with the values I expect. Unfortunately, again, the large number of Statistic.* columns in the original dataset makes manually editing their names not feasible. I believe we can use a combination of "Export Fields", a small edit script, and "Import Fields" instead - though I haven't tried this yet. But all of this is for naught if we can't figure out a way to deal with the long field names. I have not figured out how to work around this. I did find that there seem to be two different issues: 1. As I noted above, if I try to include two columns with "somewhat" long names that are the same in the first 63 characters (like #3 and #4 above), I get an error about a duplicate column; 2. I have also found that if I include just ONE field that is too long, but it's "very" long, I get a different error: "PropertyURI cannot exceed 200 characters, but was 225 characters long. Please use a shorter field name". Odd thing about this is that the field name in this case was closer to 110 characters (not sure if the name includes escaping or not): Statistic.FSC-A, SSC-A SUBSET/SINGLETS/CD235-L-D-(LIVE)/CD3+CD14-/CD4+/CD28+CD27+(EXCD27HI)/CD45RA+CCR7+:Count Again - any help would be appreciated. Andy

LabKey Support

LabKey Support

Cannot control dataset key when copying assay to study

View Message