-
Notifications
You must be signed in to change notification settings - Fork 3
Estimating distances
Estimating the distance between individual query sequences (e.g., reads) to reference genomes is one of the two main functionalities krepp offers. These distances are akin to the number of mismatches that one would get by alignment. The subcommand dist is in the following general form:
krepp dist -i /path/to/index -q /path/to/query.fastq --num-threads 4where -q is the path (or URL) of a FASTA/Q file containing query sequences, and -i is the path of the directory storing the index.
By default, krepp simply outputs distances to stdout and writes log messages to stderr. You can write the results into a file using the -o option (see krepp dist --help).
The default mode is one-to-many (--multi --no-filter), meaning that krepp will report all reference hits and corresponding distances for each read, regardless of the distances (e.g., roughly up to ~75% sequence identity). If you use the --filter flag, krepp will only retain reference hits with distances statistically indistinguishable from the closest hit. For instance, if there is one hit with a 1% distance, krepp would not output another reference with % 15% distance. However, if the closest were 12%, krepp will most likely report 15%, but not 25% etc. krepp decides this using its likelihood model. If --no-multi is given, krepp will report a single reference with the minimum distance (ties are broken randomly). You can also cap the maximum distance to be reported via --dist-max.
The output is in a tab-separated format where the first column stands for the sequence ID, the second column is the ID of the reference matching, and the third column is the distance estimate. The first line (before the header) is commented with # to store the version information and the command used.
An example output would start as shown below:
# software: krepp version: v0.6.0 invocation :krepp dist -i index_toy/ -q query_toy.fq
SEQ_ID REFERENCE_NAME DIST
||61435-4122 G000341695 0.0933
||61435-4949 G000341695 0.0505
||61435-4949 G001889305 0.1571
||61435-4949 G000830905 0.1044
||61435-4949 G001610775 0.0505
||61435-4949 G000025025 0.0933
||61435-4949 G000741845 0.0326
||61435-317 G000341695 0.0060
Lastly, krepp also offers an option to summarize all hits across reads in the given input (e.g., a metagenomic sample). This can be achieved using the --summarize option. If this is given, instead of individual read mappings, a count table will be output. These counts are weighted read counts. Each read gets a total count value of 1, which will be shared across all matching references (i.e., if there are
# software: krepp version: v0.6.0 invocation :krepp dist -i index_toy/ -q query_toy.fq --summarize
REFERENCE_NAME WEIGHTED_COUNT SEQUENCE_ABUNDANCE
G000741845 28.3333 0.3047
G000341695 38.3333 0.4122
G001610775 26.3333 0.2832