Pankegg_make_db.py
Pankegg’s data parser that constructs the SQL database that is later used in the web application.
Input file format
To create the SQL database, you will need a CSV file listing all your samples and their corresponding input directories or files.
The expected CSV format uses the following header:
Sample name,Annotation_dir,classification_dir,Checkm2_dir
Each subsequent line represents a sample, and should look like one of the following examples:
- For Sourmash classification:
SAMPLE1,/Path/to/SAMPLE1/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE1/sourmash/*,/Path/to/SAMPLE1/checkm2_dir/quality_report.tsv
SAMPLE2,/Path/to/SAMPLE2/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE2/sourmash/*,/Path/to/SAMPLE2/checkm2_dir/quality_report.tsv
- For GTDB-TK classification (when using the
--gtdbtk
flag):
SAMPLE1,/Path/to/SAMPLE1/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE1/gtdb_results/*.summary.tsv,/Path/to/SAMPLE1/checkm2_dir/quality_report.tsv
SAMPLE2,/Path/to/SAMPLE2/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE2/gtdb_results/*.summary.tsv,/Path/to/SAMPLE2/checkm2_dir/quality_report.tsv
Create the input file
You can create this input CSV file automatically with a bash loop, depending on how your data are organized:
If your results are grouped by sample:
Dir/
├── SAMPLEID1/
│ ├── bin_annotation/
│ ├── sourmash/
│ ├── gtdb_results/
│ └── checkm2_dir/
└── SAMPLEID2/
├── bin_annotation/
├── sourmash/
├── gtdb_results/
└── checkm2_dir/
Run the following script to include Sourmash annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/*; do
name=$(basename "$sample");
ann_path="$sample/bin_annotation/*.annotations.tsv";
class_path="$sample/sourmash/*";
checkm2_path="$sample/checkm2_dir/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done
Run the following script to include GTDBTK annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/*; do
name=$(basename "$sample");
ann_path="$sample/bin_annotation/*.annotations.tsv";
class_path="$sample/gtdb_results/*.summary.tsv";
checkm2_path="$sample/checkm2_dir/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done
If your results are grouped by tool, then sample:
Dir/
├── bin_annotation/
│ ├── SAMPLEID1/
│ ├── SAMPLEID2/
│ └── ...
├── sourmash/
│ ├── SAMPLEID1/
│ ├── SAMPLEID2/
│ └── ...
├── gtdb_results/
│ ├── SAMPLEID1/
│ ├── SAMPLEID2/
│ └── ...
└── checkm2_dir/
├── SAMPLEID1/
├── SAMPLEID2/
└── ...
Run the following script to include Sourmash annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/bin_annotation/*; do
name=$(basename "$sample");
ann_path="Dir/bin_annotation/$name/*.annotations.tsv";
class_path="Dir/sourmash/$name/*";
checkm2_path="Dir/checkm2_dir/$name/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done
Run the following script to include GTDBTK annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/bin_annotation/*; do
name=$(basename "$sample");
ann_path="Dir/bin_annotation/$name/*.annotations.tsv";
class_path="Dir/gtdb_results/$name/*.summary.tsv";
checkm2_path="Dir/checkm2_dir/$name/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done
Parameters Make DB
-i, --input
Path to the CSV file listing all samples and their input files/directories (required).-o, --output
Name of the output database file (without the extension). The default ispankegg
. The.db
extension is automatically added to the output name.--output_dir
Directory the database will be written to. The default is./db_output
.--gtdbtk
Use this flag if your classification files were generated with GTDB-TK instead of Sourmash.
Output Make DB
The output is an SQLite database (*.db
) that can be opened with tools like sqlite3
, but is best used with pankegg_app.py
for interactive browsing.
The database contains the following tables:
taxonomy
bin
map
kegg
bin_map_kegg
bin_map
map_kegg
bin_extra
bin_extra_kegg
sample
Each table stores specific information related to bins, pathways, taxonomy, and annotation results for easy querying and visualization.
Run with Testdata
To verify your installation and familiarize yourself with Pankegg, you can run a test using provided data. Download the example archive, unzip it, and generate test databases using the included CSV files:
Download the test data archive from OSF:
wget https://osf.io/download/5v3zc/ -O pankegg_test_data.zip
OR
curl -L -o pankegg_test_data.zip https://osf.io/download/5v3zc/
Unzip the archive:
This will create a directory called pankegg_test_data
.
unzip pankegg_test_data.zip
Create a test database for Sourmash classification:
python pankegg_make_db.py -i pankegg_test_data/sourmash_example.csv -o test_sourmash --output_dir pankegg_test_data
Create a test database for GTDB-TK classification:
python pankegg_make_db.py -i pankegg_test_data/gtdbtk_example.csv -o test_gtdbtk --output_dir pankegg_test_data --gtdbtk
After running these commands, you should find newly created test_sourmash.db
and test_gtdbtk.db
inside the pankegg_test_data
directory. This is in addition to the already existing sourmash_example.db
and gtdbtk_example.db
files in the same directory.
The existing and newly generated respective databases should be identical, so the validity of pankegg_make_db.py
can be tested by comparing the two files.
Troubleshooting
For more details or troubleshooting, please consult the Reporting Bugs & Contributing section.