| DPDB Home Page | Search | Analysis | Help | Statistics | Links | Contact us |
(5) HOW TO DOWNLOAD THE DATABASE?
(6) HOW TO ASSESS THE CONFIDENCE ON AN ANALYSIS UNIT
(7) AMNIS: ALGORITHM FOR THE MAXIMIZATION OF THE NUMBER OF INFORMATIVE SITES IN THE ALIGNMENT
(8) DPDB DATA MODEL
(1) THE SEARCH INTERFACE The DPDB Search Tool is the web interface which allows you to retrieve both secondary information (diversity measures) and related primary information (sequences and references) from the DPDB database. The search options are explained below for General, Comparative and Graphic searches, as well as for performing quick searches by DPDB or GenBank accession numbers.
Here you can
select from the lists the set of organisms
and/or genes you want to include in your
results. Click the "List" buttons
that are next to the boxes and select the
items from the list; then click on "Add
selected organisms/genes" at the end of
the page. In the case you leave the boxes
empty, all organisms/genes will be included.
Here you can define threshold values for the different parameters of nucleotide diversity, linkage disequilibrium and codon bias. The parameters are distributed into four categories:
Each category can accept two different values related by a boolean operator (and, or, not). You can use both or only one, or leave them empty or 0 as defaults. Be aware that only D (linkage disequilibrium) and Tajima's D (polymorphism) can accept negative values. Note that synonymous and non-synonymous polymorphisms and codon bias estimates are calculated on CDS (coding regions) only, and linkage disequilibrium only on exons and introns.
Please refer
to
http://pda.uab.es/pda/pda_help.asp#param
for a more extensive explanation of the
parameters and their references. Filter for degree of confidence on the polymorphic set We assess the confidence of each polymorphic set taking into account the quality of each alignment and the sequences source.
In this part of the form you can define other advanced options for your search:
The different polymorphic sets are displayed on a table showing the basic parameters of each analysis. From this table you can (see the figure below):
Here you can
select from the lists the set of organisms you want to include in your
results. Click the "Sp" button to
select species or the "Tax" button to
select taxonomic groups, the later list is
expandable (this feature works on some
browsers). In any list click on "Add
selected organisms" at the end of the
page. You must select at least one organism
or taxonomic
group from which you want all available
estimates to be averaged. Select the diversity parameters Here you can select the diversity parameters you want to include into the comparison. The parameters are distributed into three categories:
Note that
synonymous and non-synonymous polymorphism and
codon bias
estimates are
calculated on CDSs (coding regions)
only. Filter for degree of confidence on the polymorphic set
Please refer
to the same section in the
General Search help.
In this part of the form you can define other advanced options for your search:
The results are represented on a table as shown on the figure:
Tajima's D is a special case on this table. The number of Tajima's tests shown in the table are those which gave significant values at the 95% confidence interval (e.g. a row showing 3 - 2 means that 3 tests gave a significantly negative Tajima's D, and 2 gave a significantly positive Tajima's D).
Please refer
to the same section in the
General Search help.
Select one
parameter from one list. The
distribution of this parameter will be
displayed in the results. Filter for degree of confidence on the polymorphic set
Please refer
to the same section in the
General Search help.
In this part of the form you can define other advanced options for your search:
The results are represented on a histogram or frequency representation as shown on the figure below. You can retrieve all analysis units for each class by clicking in the frequency range at the left, the histogram bar, or the count number at the right.
d. SEARCH BY DPDB OR GENBANK ACCESSION NUMBERS At the top of the General Search (section "Search by Id"), enter any DPDB accession (e.g. SET000033 for polymorphic sets, DPpol000025 for analysis units, DPseq001739 for sequences) or GenBank accession (AF175215 for sequence accession numbers, AF175215.1 for sequence versions, 6002968 for sequence GIs) and click the button 'Go'. You will retrieve all related analysis units from DPDB.
(2) ANALYSIS SECTION This section provides you a collection of programs for sequence analysis. (a) SEQUENCE COMPARISON: CLUSTALW: Multiple Sequences Alignment The ClustalW software with default parameters optimized for alignment of Drosophila polymorphic sequences (as manually checked) is available. ClustalW is a Multiple Sequences Alignment program. It aligns different sequences avoiding gaps as much as possible, depending on the parameters values chosen. It can also construct phylogenetic trees. See the Clustal help for more information.
The Blast package is implemented to search for homologous sequences in the primary DPDB database and in the genome of Drosophila melanogaster. Blast is a Basic Local Alignment Search Tool. It calculates similarity for biological sequences and produces local alignments: only a portion of each sequence must be aligned. It uses statistical theory to determine if a match might have occurred by chance. There are many different BLAST output formats: pairwise report, query-anchored report with identities, query-anchored report without identities and XML output. See the Blast help for more information.
Jalview is a multiple sequence alignment viewer and editor. Alignments can be divided into subfamilies using a tree or by hand. Conservation can then be calculated using physico-chemical properties within subfamilies or across the whole alignment. Principal component analysis can also be used as an alternative way of clustering the sequences. An SRS server can be used to fetch and display the sequence features and any PDB structures listed. See the Jalview help for more information.
(b) NUCLEOTIDE DIVERSITY ANALYSIS: SNPs - Graphic: Analysis of nucleotide diversity in Sliding Windows This is a web module that estimates several measures of DNA sequence polymorphism and allows performing these analyses by the sliding windows method, obtaining graphic representations. Aligned DNA sequences are introduced as input in FASTA format. The output is a web page, saved in the server for 24 hours, where results are displayed in text and graphs. See the SNPs-Graphic help for more information.
PDA, "Pipeline Diversity Analysis", is a collection of programs and modules mainly written in Perl that automatically can:
PDA has a user-friendly, web-based interface where the user can select the sequences to be analyzed and the parameters to be used. Sequences can be retrieved from either GenBank or the DPDB database as a list of accession numbers or a set of organisms and/or genes. Low quality sequences coming from large-scale sequencing projects (i.e. working draft), where most missing data is, will be excluded from the analysis. Alternatively, sequences can be introduced manually in Fasta or GenBank formats. All sequences will be grouped by organism and gene, and groups will be aligned using the ClustalW algorithm. After, different analyses of polymorphism in synonymous and non-synonymous sites, linkage disequilibrium and codon bias will be performed. See the PDA help for more information.
(3) STATISTICS The DPDB Statistics Section shows the contents of the database and includes tabular and graphic information on the secondary and primary database: number of polymorphic sets and analysis units available classified by functional regions, species, genes, GO categories (17), quality of alignments, confidence of data source, total number of sequences and references,… All the information, tables and graphs are updated on a daily base, after the updating of the database itself. The number of polymorphic sets analyzed can be viewed also through the Drosophila Phylogeny graph. Categories are based in the phylogeny of the NCBI's taxonomy browser. All entries are alphabetically ordered, and the links in each item lead to the next lower level, or to a direct search to the database (in the lowest level). To display your own graphs, please use the Graphical Search tool.
(4) LINKS This section offers a selected collection of web addresses, specially related to the study of nucleotide polymorphism and bioinformatics. These are distributed in different categories:
(5) HOW TO DOWNLOAD THE DATABASE? The database can be freely downloaded using our Downloads page. It contains a compressed gzip copy of the MySQL database (dpdb.contents.gz). Download the file and load it into a new database in you MySQL Server, as follows:
Note that you must do it from an account with privileges to create a new database in your MySQL server.
(6) HOW TO ASSESS THE CONFIDENCE ON AN ANALYSIS UNIT
The
results stored in the Drosophila
Polymorphism Database are obtained by an automatic process of analysis
using PDA
(Casillas
& Barbadilla 2004)
(http://pda.uab.es).
We highly recommend users of this database
to follow the following steps in order to
assess the confidence on any analysis unit. 1. Revise the parameters about the QUALITY OF THE ALIGNMENT (Figure 1a, Figure 2a): To assess the quality of an alignment we used three criteria: the number of sequences included in the alignment, the percentage of gaps o ambiguous bases within the alignment and the percentage of difference between the shortest and the longest sequences. For each criterion three qualitative categories were defined: low quality, medium quality and high quality:
2. Check the ALIGNMENT and the DND TREE FILE (Figure 2b): The ALIGNMENT (generated with MUSCLE) is given in CLUSTAL, FASTA and JALVIEW formats. JALVIEW is recommended, because it allows you to view the alignment in colors, do manual edition, output the alignment in different formats, etc. However, if you just want to download the alignment in order to use it in another program, we recommend to download the FASTA file. You can open the DND Tree File as text, but if TREEVIEW is installed on your computer, you will be able to see it graphically.
3. Revise the parameters about the QUALITY OF THE DATA SOURCE (Figure 1b): The following four criteria were used to determine if the study had a polymorphism goal:
Two values are assigned to each criterion: true (complies the requirement) or false (does not comply the requirement).
4. Revise the ORIGIN OF THE SEQUENCES (Figure 2c): In the main results page, three parameters are given when available in the GenBank annotations: the country, strain and population variant of each sequence. For a complete description of the sequences, you can follow the links to the DPDB, GenBank, EMBL and FlyBase databases.
5. Check the RESULTS OF THE ANALYSES (Figure 2d): Check the results of polymorphism, linkage disequilibrium and codon bias, especially when they show extreme values. In those cases, the program may have grouped together sequences from different origins, or maybe the alignment is poor.
6. REANALYZE THE DATA if needed: Two programs are available to reanalyze your data from the results:
(7) AMNIS: ALGORITHM FOR THE MAXIMIZATION OF THE NUMBER OF INFORMATIVE SITES IN THE ALIGNMENT After the grouping and alignment of sequences, a further step is taken before estimating the polymorphism parameters. It is referred here as AMNIS (Algorithm for the Maximization of the Number of Informative Sites):
Example:
In this example, the first four sequences would be assigned to group 1, and the last four sequences to group 2. The number of informative sites (without gaps) using the four first sequences (group 1) is: Informative sites group 1 = 42 non-gapped positions * 4 sequences = 168 Using the accumulative set of sequences of group 1 + 2, we have more sequences, but less non-gapped positions: Informative sites group 1+2 = 7 non-gapped positions * 8 sequences = 56 Therefore, we will have more informative sites by using the four long sequences only and discarding the short ones, rather than using the complete set of eight sequences. DPDB would show the alignment with all the sequences, but would use the four long sequences only to calculate the polymorphism estimates (n = 4 in the results). To distinguish which sequences were used in the analyses from those which were discarded, DPDB uses a color code:
●
for sequences that were included in the
estimates, and You can find this information in the REPORT for each analysis unit.
(8) DPDB DATA MODEL No standard data model exists for the storage and representation of haplotypic data with associated diversity estimates, so that we have defined a new data model for the secondary database, which is based on two basic units: the POLYMORPHIC SETS (each group of sequences belonging to the same gene and species) and the ANALYSIS UNITS (or ALIGNMENTS) (different subgroups from the corresponding polymorphic sets, according to the functional region (gene, CDSs, exons, etc.) and the percentage of homology between sequences pairs). All subsequent diversity data is estimated and annotated into different joined tables in a MySQL database, related by index tables. The storage of diversity estimates in databases makes them permanently available and allows the re-analysis of all or part of the sequences. The database content is daily updated, and records are assigned unique and permanent DPDB identification numbers to facilitate cross-database referencing. Each new item is assigned a unique and increasingly DPDB identifier: a six-digit number is preceded by the string SET for polymorphic sets, by DPpol for analysis units, by DPseq for individual sequences, and by DPref for references. The database can be freely downloaded via our Downloads page (see next section in this help) as a compressed gzip file, with the following structure of related tables:
The database contains two copies of each red table. The second copy is labeled with _old and contains all the older information of the analyses that have been reanalyzed (the newest information of which are stored in the first copy of the table). This allows to trace the history of a polymorphic set or analysis unit, including all the previous results with the corresponding date when they were analyzed. DPDB is a PDA-related database, since it is build by using this analytic pipeline. PDA follows the following steps to estimate diversity from sequences in GenBank:
|