Title of the dataset:
Origin and evolution of the cannabinoid oxidocyclase gene family

Creators:
Robin van Velzen
Michael Eric Schranz

Related publication:
Origin and evolution of the cannabinoid oxidocyclase gene family

Description:
Sequence alignments and associated phylogenetic gene trees of Berberine bridge enzyme gene family protein sequences and cannabinoid oxidocyclase gene familie DNA sequences.

Keywords:
Cannabis sativa L.
Gene family analysis
Phylogenetic analysis
DNA sequence alignment
Protein sequence alignment
Berberine-bridge enzyme
Evolution

This dataset contains the following files:
Fig.2_alignment.fasta
Fig.2_tree.newick
Fig.3_alignment.fasta
Fig.3_tree.newick
Fig.S1_alignment.fasta
Fig.S1_tree.newick
Fig.S2_alignment.fasta
Fig.S2_tree.newick

Explanation of variables: 
File names refer to the corresponding figures in the manuscript, the data type (alignment or tree), and the data format (.fasta or .newick).
Data associated with Fig.2 are based on selected Eurosid Berberine bridge enzyme gene family protein sequences. 
Data associated with Fig. 3. are based on cannabinoid oxidocyclase gene family nucleotide sequences from whole-genome assemblies of Cannabis sativa cultivars ‘CBDRx’, ‘Finola’, ‘Jamaican Lion’, and a wild plant from Jilong, Tibet.
Data associated with Fig. S1. are based on cannabinoid oxidocyclase gene family nucleotide sequences from nucleotide sequences from genbank accessions
Data associated with Fig. S2. are based on cannabinoid oxidocyclase gene family nucleotide sequences from whole-genome assemblies of Cannabis sativa cultivars ‘Cannatonic’, ‘Chemdog91’, ‘Jamaican Lion’ (father), ‘LA confidential’, and ‘Pineapple Banana Bubble Kush’ (PBBK). 
Sequence names include genbank accession numbers.

Methods, materials and software:
Multiple sequence alignments were generated with MAFFT v7.450 with automatic selection of appropriate algorithm, a gap open penalty of 1.26 and an offset value 0.123. For protein and nucleotide sequence datasets we used the BLOSUM62 and 100PAM scoring matrix, respectively. 
Optimal models of sequence evolution as determined using Modeltest-NG v.0.1.5 on XSEDE via the CIPRES gateway were WAG+I+G4 for the protein data set and GTR+I+G4 for all nucleotide data sets. Gene trees were reconstructed in a Bayesian framework using MrBayes v 3.2.6 implemented in Geneious Prime with a chain length of 2.2 million generations; sampling every 1000th generation; 4 heated chains with a temperature of 0.2. The first 200,000 generations were discarded as burnin.

This dataset is published under the CC BY-SA (Attribution ShareAlike) license.
This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
If you remix, adapt, or build upon the material, you must license the modified material under identical terms.