Pankegg_make_db.py


Pankegg’s data parser that constructs the SQL database that is later used in the web application.

Input file format

To create the SQL database, you will need a CSV file listing all your samples and their corresponding input directories or files.
The expected CSV format uses the following header:

Sample name,Annotation_dir,classification_dir,Checkm2_dir

Each subsequent line represents a sample, and should look like one of the following examples:

  • For Sourmash classification:
    SAMPLE1,/Path/to/SAMPLE1/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE1/sourmash/*,/Path/to/SAMPLE1/checkm2_dir/quality_report.tsv
    SAMPLE2,/Path/to/SAMPLE2/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE2/sourmash/*,/Path/to/SAMPLE2/checkm2_dir/quality_report.tsv
  • For GTDB-TK classification (when using the --gtdbtk flag):
    SAMPLE1,/Path/to/SAMPLE1/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE1/gtdb_results/*.summary.tsv,/Path/to/SAMPLE1/checkm2_dir/quality_report.tsv
    SAMPLE2,/Path/to/SAMPLE2/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE2/gtdb_results/*.summary.tsv,/Path/to/SAMPLE2/checkm2_dir/quality_report.tsv

Create the input file

You can create this input CSV file automatically with a bash loop, depending on how your data are organized:

If your results are grouped by sample:

Dir/
├── SAMPLEID1/
   ├── bin_annotation/
   ├── sourmash/
   ├── gtdb_results/
   └── checkm2_dir/
└── SAMPLEID2/
    ├── bin_annotation/
    ├── sourmash/
    ├── gtdb_results/
    └── checkm2_dir/

Run the following script to include Sourmash annotation:

echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/*; do
    name=$(basename "$sample");
    ann_path="$sample/bin_annotation/*.annotations.tsv";
    class_path="$sample/sourmash/*";
    checkm2_path="$sample/checkm2_dir/quality_report.tsv";
    echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done

Run the following script to include GTDBTK annotation:

echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/*; do
   name=$(basename "$sample");
   ann_path="$sample/bin_annotation/*.annotations.tsv";
   class_path="$sample/gtdb_results/*.summary.tsv";
   checkm2_path="$sample/checkm2_dir/quality_report.tsv";
   echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done

If your results are grouped by tool, then sample:

Dir/
├── bin_annotation/
   ├── SAMPLEID1/
   ├── SAMPLEID2/
   └── ...
├── sourmash/
   ├── SAMPLEID1/
   ├── SAMPLEID2/
   └── ...
├── gtdb_results/
   ├── SAMPLEID1/
   ├── SAMPLEID2/
   └── ...
└── checkm2_dir/
   ├── SAMPLEID1/
   ├── SAMPLEID2/
   └── ...

Run the following script to include Sourmash annotation:

echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/bin_annotation/*; do
    name=$(basename "$sample");
    ann_path="Dir/bin_annotation/$name/*.annotations.tsv";
    class_path="Dir/sourmash/$name/*";
    checkm2_path="Dir/checkm2_dir/$name/quality_report.tsv";
    echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done

Run the following script to include GTDBTK annotation:

echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/bin_annotation/*; do
    name=$(basename "$sample");
    ann_path="Dir/bin_annotation/$name/*.annotations.tsv";
    class_path="Dir/gtdb_results/$name/*.summary.tsv";
    checkm2_path="Dir/checkm2_dir/$name/quality_report.tsv";
    echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
done

Parameters Make DB

  • -i, --input
    Path to the CSV file listing all samples and their input files/directories (required).

  • -o, --output
    Name of the output database file (without the extension). The default is pankegg. The .db extension is automatically added to the output name.

  • --output_dir
    Directory the database will be written to. The default is ./db_output.

  • --gtdbtk
    Use this flag if your classification files were generated with GTDB-TK instead of Sourmash.

Output Make DB

The output is an SQLite database (*.db) that can be opened with tools like sqlite3, but is best used with pankegg_app.py for interactive browsing.

The database contains the following tables:

  • taxonomy
  • bin
  • map
  • kegg
  • bin_map_kegg
  • bin_map
  • map_kegg
  • bin_extra
  • bin_extra_kegg
  • sample

Each table stores specific information related to bins, pathways, taxonomy, and annotation results for easy querying and visualization.

Run with Testdata

To verify your installation and familiarize yourself with Pankegg, you can run a test using provided data. Download the example archive, unzip it, and generate test databases using the included CSV files:

Download the test data archive from OSF:

wget https://osf.io/download/5v3zc/ -O pankegg_test_data.zip

OR

curl -L -o pankegg_test_data.zip  https://osf.io/download/5v3zc/

Unzip the archive:

This will create a directory called pankegg_test_data.

unzip pankegg_test_data.zip

Create a test database for Sourmash classification:

python pankegg_make_db.py -i pankegg_test_data/sourmash_example.csv -o test_sourmash --output_dir pankegg_test_data

Create a test database for GTDB-TK classification:

python pankegg_make_db.py -i pankegg_test_data/gtdbtk_example.csv -o test_gtdbtk --output_dir pankegg_test_data --gtdbtk

After running these commands, you should find newly created test_sourmash.db and test_gtdbtk.db inside the pankegg_test_data directory. This is in addition to the already existing sourmash_example.db and gtdbtk_example.db files in the same directory.

The existing and newly generated respective databases should be identical, so the validity of pankegg_make_db.py can be tested by comparing the two files.

Troubleshooting

For more details or troubleshooting, please consult the Reporting Bugs & Contributing section.