BLIZAAR Bio-data data set

This is the biological data in text form. A Neo4j version will be forthcoming. The raw text data can be found downloaded from here: blizaarbiodataset.zip

Dataset in a nutshell

we measured, at several time points, contigs (one quantify genes), spots (one quantify one or more proteins, if many we cannot say how each is changing), metabolites. We can say that a contig corresponds to proteins. We measured some genes, but not all the corresponding proteins (many are missing, for technical reasons realted to the ease fo measuring genes with respect to proteins). We measured some proteins, but not all the corresponding genes (but nearly). Metabolites are produces one from others by enzymes (special proteins). Genes do not really interact, they rather do that through proteins. Proteins do physically interact. Metabolites interact, but implicitly through a metabolite. Each gene can produce more proteins, this is why sometime a gene is refered to woth its code followed by .1, .2, etc. to indicate the version fo the gene leading to a specific proteins (usually these proteins are very similar, but can do different things). This is why some data refers to the gene code without the isoform (data applicable to all isoforms) or with the isoform (data related to specific isoform). This is particularly true for the gene ontology data.

At this point we would only have nodes, but no edges. To get relationships we used the STRING database, which is the main protein-protein (and so also gene-gene) interactions database. Each edge is drawn based on the presence of experimental, coexpression (similar behavior across several public available experiment), text mining (appearing in the same phrase), pathway (participating to the same known biological network), and a combined score. Score are between 0 and 1000.

STITCH is a twin database but for metabolites, including also enzymes interactions (enzymes are special protein acting on metabolites).

Data files

FIRST COLUMN IS THE ID FOR NODES FILES, AND CORRESPONDS TO THE IDS IN THE EDGE FILES, WHERE FIRST AND SECOND COLUMNS REPRESENT THE CONNECTED NODES IDS

allStringProteins.txt all proteins in STRING, also those we not (yet) measured. Ideally all known proteins in A. Thaliana (our model organism) Please note that once the ids have been loaded, a new identifier must be added, copying the id with the isoform (xxxxxx.y) to an id without isoform (xxxxxx)

allProteinInteractionsStringv10 all proteins interactions in STRING

CANCANProteinsIds4Neo4J.txt contains the data related to proteins identified in one or more spots

CANCANMetabolites4Neo4J.txt has the data about all the metabolites identified in the metabolic analyses

CANCANTranscrData4Neo4J-OnlyIds these are the genes we measured, but only the ids.

CANCANMetabolicInteractionsSTITCH.txt edges between proteins (enzymes) and metabolites, and also between proteins, according to the STITCH database.

Note: if you load the following files, richer but larger, you can ignore this one.

stitch.3702.protein_chemical.links.detailed.v5.0_cids.txt edges between proteins (enzymes) and metabolites, and also between proteins, according to the STITCH database.

stitch.chemical_chemical.links.detailed.v5.0_cids.txt edges between metabolites, according to the STITCH database.

In the GoForNeo4J Folder GONodesSimplified data related to the ontological class

GOEdgesIsoformOnlyC.txt see below GOEdgesIsoformOnlyF.txt see below GOEdgesIsoformOnlyP.txt see below GOEdgesNoIsoformOnlyC.txt see below GOEdgesNoIsoformOnlyF.txt see below GOEdgesNoIsoformOnlyP.txt see below Files containg the an edge between a gene/protein, in the case it is valid in general (no isoform) or for a specific isoform. C stands for cellular component, F for function, P for Biological process.

Note:

The following files, in the “Confidential data” folder, contains not yet published data and shall not be distributed outside the project (e.g. students)

CANCANTranscrData4Neo4J.txt these are the genes we measured, at 4 time points (6,9,15 and 20 days), with some statistical values and averages for the replicates.

CANCANSpots.txt these are the values we measured in spots (a spot contains and allows to quantify ideally one protein, if more, the values cannot be disentangled, being so relatively useless if referred to any og the contained proteins singularly).

CANCANspot-contig-edges.txt contains the edges between a spot and the contained proteins