PLAZA 4.5 PLAZA 4.0 PLAZA 3.0 PLAZA 2.5 PLAZA 1.0

PLAZA Data Warehouse

PLAZA Data Warehouse functionality is only supported from PLAZA 3.0 onwards.

Overview

PLAZA instances can also function as a data ware house, providing external tools (e.g. Galaxy) with parsed data from PLAZA in a consistent format. This allows these external tools to rely on PLAZA as a central resource for data content, trusting that the provided data follows the FAIR data principles:

  1. Findable: The data and meta-data can be found per PLAZA instance, and in the general PLAZA overview website.
  2. Accessible: Meta-data can be accessed in a standardized way.
  3. Interoperable: The data is provided in standardized formats, allowing it to be used and integrated directly by other tools.
  4. Reusable: The data and meta-data can easily be replicated and/or combined in different settings.

PLAZA Warehouse API

All the data is directly accessible at the FTP-server(s), and is also browseable and findeable through the download-section per PLAZA Instance.

In order to offer other webservice(s) the tools to automatically find and download the reference data that makes up a PLAZA instance, each PLAZA instance offers certain non-JWT API-calls (in contrast to the API described here). These API call(s) return a JSON object which contains the information available on the FTP-server per species, for that instance.

Instance API call(s)

Per PLAZA instance the following API call(s) are available:

  1. /api/get_species_data

Global API call(s)

The global PLAZA overview website (https://bioinformatics.psb.ugent.be/plaza/) also offers some API call(s) to retrieve data from a set of PLAZA instances all at once. This is important because of the following reasons:

  1. The same version (genome assembly + gene annotation) of a species can be present in multiple PLAZA instances. In those cases one would not want to have this redundant information present when retrieving all genome information from multiple PLAZA instances.
  2. One species can have different versions (different genome assembly and/or gene annotation) in different PLAZA instances. Thus, when retrieving all genome data for a given species, it is important to make the distinction between these versions while also indicating that the species itself is the same.

The following API call(s) are available at the PLAZA overview website:

  1. /api/get_species_data_links
    • Example: https://bioinformatics.psb.ugent.be/plaza/api/get_species_data_links
    • This call searches all available PLAZA instances, and returns those for which the /api/get_species_data API call is available.
      Associated with the PLAZA instance names, the correct URLs to the API calls are also provided.
      As such, this API-call does NOT return any species data and/or meta-data, but rather returns information about the various PLAZA instances and the correct location of associated API calls.
  2. /api/get_species_data
  3. /api/defined_instances and /api/available_instances
    • Example: https://bioinformatics.psb.ugent.be/plaza/api/available_instances
    • These API calls return an overview of the defined/available PLAZA instances defined in the overview database.
    • The difference between defined and available is that the defined call returns the data as present in the database. The available call performs additional checks to see whether the HTTP headers for each instance indicate that this instance is currently running and available. Therefore the defined API call is very fasta, while the available API call can be much slower.
    • Additional options can be provided to both API calls, in order to filter the returned instances, using standard query format parameters:
      • since : Minimum year selection. Instances from that year onward are returned.
      • host : Host server selection. Instances running on that server are returned.
      • version : Exact software version selection. Instances running that software version are returned.
      • minimum_version : Minimum software version selection. Instances running from that software version onwards are returned.
      • archived : Archive status selection. Returns only archived instances or not.
      • species : Species presence selection. Returns only those instances for which the common-names contain the given species name. Can be slow because multiple instances might need to be queried.
      Examples:
      https://bioinformatics.psb.ugent.be/plaza/api/available_instances/?since=2015 → Only the instances from 2015 onwards
      https://bioinformatics.psb.ugent.be/plaza/api/available_instances/?minimum_version=4&species=triticum → Only the instances with software version 4 (or higher) containing species with name 'triticum'

PLAZA Warehouse Meta-Data

The meta-data per species can be used to uniquely identify each species. The following information is provided in this meta-data:

  • PLAZA species identifier: Internal identifier used by the PLAZA system (e.g. ath)
  • Common Name : The scientific name used for the species (e.g. Arabidopsis thaliana)
  • Eco-type : The eco-type for the species (e.g. COL-0, but undefined for most species)
  • Taxonomy ID : NCBI taxonomy identifier (e.g. 3702)

Additional information about the release of the gene and genome annotation can also be provided:

  • Version : Gene and genome annotation version (e.g. TAIR10)
  • Pubmed ID : Publication identifier for the gene/genome release (e.g. 11130711)
  • URL : The URL from which the genome annotation was downloaded into the PLAZA platform (e.g. http://www.arabidopsis.org)

PLAZA Warehouse Data

The following data types are offered per PLAZA instance:

GFF3 files

GFF3 (Generic Feature Format v3) is one of the most common ways used in bioinformatics to exchange annotations, feature files, etc. Multiple tools can make direct use of GFF3 files, and they can also be easily converted into BED format if so required.

Despite it's widespread use, the format (especially the last column) is only loosely defined (see here for more information). Within the PLAZA platform, we provide GFF-files which are all in the same format style.

Per Locus
All Transcriptsa Selected Transcriptb
All Featuresc
Exon Featuresd
  1. All isoforms given by the original data provider.
  2. Normally the longest transcript per locus, except in the case of the Arabidopsis thaliana, where the default transcript per locus is provided by TAIR/Araport.
  3. All the GFF features. For coding genes these are: gene, mRNA, exon, five_prime_UTR, three_prime_UTR, CDS, intron.
  4. Only exon (and parental) features: gene, mRNA, exon.

CSV files

CSV files are tab or semi-colon separated files where each gene corresponds to a single line, in contrast to the GFF3 files.

Per Locus
All Transcriptsa Selected Transcriptb
Annotation
  1. All isoforms given by the original data provider.
  2. Normally the longest transcript per locus, except in the case of the Arabidopsis thaliana, where the default transcript per locus is provided by TAIR/Araport.

FASTA files

FASTA files contain sequence information per data-type.

Per Locus
All Transcriptsa Selected Transcriptb
CDS
Transcript
Proteome
  1. All isoforms given by the original data provider.
  2. Normally the longest transcript per locus, except in the case of the Arabidopsis thaliana, where the default transcript per locus is provided by TAIR/Araport.
Genome
Chromosome