Skip to content

Estimating distances

Ali Osman Berk Şapcı edited this page Nov 24, 2025 · 1 revision

Estimating the distance between individual query sequences (e.g., reads) to reference genomes is one of the two main functionalities krepp offers. These distances are akin to the number of mismatches that one would get by alignment. The subcommand dist is in the following general form:

krepp dist -i /path/to/index -q /path/to/query.fastq --num-threads 4

where -q is the path (or URL) of a FASTA/Q file containing query sequences, and -i is the path of the directory storing the index.

By default, krepp simply outputs distances to stdout and writes log messages to stderr. You can write the results into a file using the -o option (see krepp dist --help).

The default mode is one-to-many (--multi --no-filter), meaning that krepp will report all reference hits and corresponding distances for each read, regardless of the distances (e.g., roughly up to ~75% sequence identity). If you use the --filter flag, krepp will only retain reference hits with distances statistically indistinguishable from the closest hit. For instance, if there is one hit with a 1% distance, krepp would not output another reference with % 15% distance. However, if the closest were 12%, krepp will most likely report 15%, but not 25% etc. krepp decides this using its likelihood model. If --no-multi is given, krepp will report a single reference with the minimum distance (ties are broken randomly). You can also cap the maximum distance to be reported via --dist-max.

The output is in a tab-separated format where the first column stands for the sequence ID, the second column is the ID of the reference matching, and the third column is the distance estimate. The first line (before the header) is commented with # to store the version information and the command used.

An example output would start as shown below:

# software: krepp       version: v0.6.0 invocation :krepp dist -i index_toy/ -q query_toy.fq
SEQ_ID  REFERENCE_NAME  DIST
||61435-4122    G000341695      0.0933
||61435-4949    G000341695      0.0505
||61435-4949    G001889305      0.1571
||61435-4949    G000830905      0.1044
||61435-4949    G001610775      0.0505
||61435-4949    G000025025      0.0933
||61435-4949    G000741845      0.0326
||61435-317     G000341695      0.0060

Lastly, krepp also offers an option to summarize all hits across reads in the given input (e.g., a metagenomic sample). This can be achieved using the --summarize option. If this is given, instead of individual read mappings, a count table will be output. These counts are weighted read counts. Each read gets a total count value of 1, which will be shared across all matching references (i.e., if there are $n$ reference hits, each gets $1/n$). The two-column table would look like:

# software: krepp       version: v0.6.0 invocation :krepp dist -i index_toy/ -q query_toy.fq --summarize
REFERENCE_NAME  WEIGHTED_COUNT  SEQUENCE_ABUNDANCE
G000741845      28.3333 0.3047
G000341695      38.3333 0.4122
G001610775      26.3333 0.2832

Clone this wiki locally