API

Welcome to the API documentation. Below is the documentation for the code that makes up Clade-o-matic.

main

cladeomatic.main.print_usage_and_exit()

This method prints brief usage instructions of Clade-o-matic to the command line

create

genotype

cladeomatic.genotype.convert_features_to_mutations(feature_lookup, snp_profile)

This method converts the mutations or features in the scheme to the snp base mutation and sets this on a data dictionary with the snp state.

Parameters:
  • feature_lookup (dict) – The mutations or features in the scheme data dictionary

  • snp_profile (dict) – The snp profile for a given genotype

Returns:

A dictionary of the snp ids and if they belong to the alt or ref for the genotype

Return type:

dict

cladeomatic.genotype.get_snp_profiles(valid_positions, vcf_file)

This method retrieves both the SNP and sample profiles from the VCF file, ensures the SNPs are in valid positions and adds the sample SNP profile to the data dictionary.

Parameters:
  • valid_positions (list) – The list of integers indicating the valid SNP positions for this sample set

  • vcf_file (str) – The file path to vcf or tsv snp data files

Returns:

The dictionary of the snp data

Return type:

dict

cladeomatic.genotype.is_valid_file(filename)

A helper method to determine if the file path leads to a valid file. For the file to be considered valid it must exist and have the minimum file size.

Parameters:

filename (str) – The path to the file requiring validation

Returns:

True if this is a valid file, False if not

Return type:

bool

cladeomatic.genotype.parse_args()

Argument Parsing method.

A function to parse the command line arguments passed at initialization of Clade-o-matic, format these arguments, and return help prompts to the user shell when specified.

Returns:

The arguments and their user specifications, the usage help prompts and the correct formatting for the incoming argument (str, int, etc.)

Return type:

ArgumentParser object

cladeomatic.genotype.parse_scheme_features(scheme_file)

This method parses the pass snp scheme file and creates a scheme and feature data set for downstream processing. The scheme data dictionary consists of the snp scheme data of the snp position, the snp base and if the genotype is positive, partial and/or allowed. The feature data dictionary the number of positions, the number of mutations, total features and the mutation data for searching. This feature data dictionary also included the features of each genotype: the mutation lookup id, if the snp was a ref or alt state, the base and base position.

Parameters:

scheme_file (str) – The file path to the snp scheme file to read in

Returns:

The constructed features data dictionary, with the scheme data dictionary inside

Return type:

dict

cladeomatic.genotype.parse_scheme_genotypes(scheme_file)

This method parses the snp scheme file to construct a scheme data dictionary for further downstream processing.

Parameters:

scheme_file (str) – The file path the snp scheme file for parsing

Returns:

The scheme data dictionary

Return type:

dict

cladeomatic.genotype.run()

The main method to read the command line arguments and creates the genotype call file. This method reads the scheme, variant, sample and genotype metadata files, and constructs various processing data dictionaries to ultimately call the genotypes and provide a measure of quality control of those calls for the samples passed.

Notes

Refer to https://www.ray.io for more information about the Ray instances used in this module.

cladeomatic.genotype.write_genotype_calls(header, scheme_name, outfile, genotype_results, sample_metadata, genotype_meta, scheme_data, sample_variants, min_positions=1)

This method writes the genotype calls determined through previous methods to a file. Please refer to the file examples/small_test/cladeomatic/cladeomatic-genotype.calls.txt for more information.

Parameters:
  • header (str) – The header for the output file

  • scheme_name (str) – The name for the scheme

  • outfile (str) – The output file path

  • genotype_results (dict) – The dictionary of the genotype calls from call_genotypes()

  • sample_metadata (dict) – The sample metadata data dictionary

  • genotype_meta (dict) – The genotype metadata data dictionary

  • scheme_data (dict) – The snp scheme data dictionary

  • sample_variants (dict) – The sample variants data dictionary

  • min_positions (int) – The minimum number of . Default of 1.

benchmark

cladeomatic.benchmark.convert_genotype_report(data_dict, predicted_field_name, submitted_field_name)

A method to take in the genotype call data dictionary, the predicted and submitted field names and converts these to a data dictionary consisting of the sample ids for the genotypes, the predicted genotype, and submitted genotype ids.

Parameters:
  • data_dict (dictionary) – The genotype call dictionary to parse

  • predicted_field_name (str) – The name of the predicted genotype field within the genotype call file

  • submitted_field_name (str) – The name of the submitted genotype field within the genotype call file

Returns:

A dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls

Return type:

dict

cladeomatic.benchmark.filter_genotypes_exclude(labels, name)

This method takes the dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls, and filters the dictionary for the genotype name. If the genotype name exists in either the predicted or the submitted genotype call the sample is excluded from addition to the data dictionary.

Parameters:
  • labels (dict) – A dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls

  • name (str) – The name of the genotype to filter by

Returns:

The genotype name-filtered dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls

Return type:

dict

cladeomatic.benchmark.filter_genotypes_include(labels, name)

This method takes the dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls, and filters the dictionary for the genotype name. If the genotype name exists in either the predicted or the submitted genotype call the sample will be added to the dictionary and the inclusive dictionary is returned.

Parameters:
  • labels (dict) – A dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls

  • name (str) – The name of the genotype to filter by

Returns:

The genotype name-filtered dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls

Return type:

dict

cladeomatic.benchmark.get_genotype_counts(data_dict, field_name='genotype')

This method counts the number of unique genotypes that appear in the data dictionary passed.

Parameters:
  • data_dict (dict) – The data dictionary that contains genotypes for counting

  • field_name (str) – The name of the field within the data dictionary for which to parse and count genotypes for. Default is ‘genotype’.

Returns:

The dictionary of genotypes and their counts

Return type:

dict

cladeomatic.benchmark.get_problem_bases(profile, rule)

This method identifies which snps exist in the variants data dictionary but not in the genotype rules or schema dictionary for a specific sample id.

Parameters:
  • profile (dict) – The variants data dictionary for a specific sample id

  • rule (dict) – The scheme data dictionary for a specific sample id

Returns:

A data dictionary of the snp positions and bases that are not found in both the variants and scheme dictionaries for the sample passed

Return type:

dict

cladeomatic.benchmark.parse_args()

Argument Parsing method.

A function to parse the command line arguments passed at initialization of Clade-o-matic, format these arguments, and return help prompts to the user shell when specified.

Returns:

The arguments and their user specifications, the usage help prompts and the correct formatting for the incoming argument (str, int, etc.)

Return type:

ArgumentParser object

cladeomatic.benchmark.parse_genotype_report(header, file)

A helper method to read in the genotype call file for downstream use.

Parameters:
  • header (list) – The list of strings for the genotype call report header

  • file (str) – The path to the genotype call report file to parse

Returns:

A dictionary of the parsed genotype call report file

Return type:

dict

cladeomatic.benchmark.run()

The main method to read the command line arguments and creates a file for the F1 scores and if required a file that records per sample any sites that are responsible for the genotype not being called. This method reads the genotype call file, the variant call file and the kmer inclusive scheme file to verify the genotypes and determine if the submitted and predicted genotypes are a match. The results are written to the output files examples/small_test/cladeomatic/benchmark/cladeomatic-scheme.scores.txt and examples/small_test/cladeomatic/benchmark/cladeomatic-sample.results.txt.

cladeomatic.benchmark.write_scheme_scores(header, file, scheme_name, scores)

This method writes the scheme scores determined to a file. Please refer to the file examples/small_test/cladeomatic/benchmark/cladeomatic-scheme.scores.txt for more information.

Parameters:
  • header (str) – The header string for the output file

  • file (str) – The path to the output file

  • scheme_name (str) – The name of the scheme

  • scores (dict) – The data dictionary of the scores to be written out

cladeomatic.benchmark.write_updated_genotype_report(header, file, scheme_name, variants, data_dict, genotype_rules, predicted_field_name, submitted_field_name)

This method writes the genotype sample results file with the sample_id, scheme name, submitted genotype, predicted genotype, if the genotypes are a match for each other, and if not what are the problem snp positions and bases (features). Please refer to the file examples/small_test/cladeomatic/benchmark/cladeomatic-sample.results.txt for more information.

Parameters:
  • header (str) – The header string for the output file

  • file (str) – The file path for the output file

  • scheme_name (str) – The name for the scheme

  • variants (dict) – The data dictionary for the variant call file

  • data_dict (dict) – The data dictionary for the called genotypes

  • genotype_rules (dict) – The data dictionary for the genotype scheme/rules

  • predicted_field_name (str) – The name for the predicted genotype field name

  • submitted_field_name (str) – The name for the submitted genotype field name

namer

cladeomatic.namer.parse_args()

Argument Parsing method.

A function to parse the command line arguments passed at initialization of Clade-o-matic, format these arguments, and return help prompts to the user shell when specified.

Returns:

The arguments and their user specifications, the usage help prompts and the correct formatting for the incoming argument (str, int, etc.)

Return type:

ArgumentParser object

cladeomatic.namer.rename(lookup, queries)

This method takes in the naming data dictionary and the positive or partial genotypes list to create a new list of names for the genotypes if they names are the same in both the naming dictionary and positive or partial genotypes list.

Parameters:
  • lookup (dict) – The naming data dictionary

  • queries (list) – The list of positive genotypes

Returns:

The list of names that are the same between the naming dictionary and the genotypes

Return type:

list

cladeomatic.namer.run()

This method takes in the data dictionaries for the scheme and names and renames the genotypes accordingly.

clades

class cladeomatic.clades.clade_worker(vcf, metadata_dict, dist_mat_file, groups, ref_seq, mode, perform_compression=True, delim='.', min_snp_count=1, max_snps=-1, max_states=6, min_members=1, min_inter_clade_dist=1, num_threads=1, max_snp_resolution_thresh=0, method='average', rcor_thresh=0.4, min_perc=1)

The clade worker class provides methods to take the user provided input and cluster the data, create and compress clades, and further summarize the SNP variants within the samples.

__init__(vcf, metadata_dict, dist_mat_file, groups, ref_seq, mode, perform_compression=True, delim='.', min_snp_count=1, max_snps=-1, max_states=6, min_members=1, min_inter_clade_dist=1, num_threads=1, max_snp_resolution_thresh=0, method='average', rcor_thresh=0.4, min_perc=1)

The instantiation of the clade_worker class.

Parameters:
  • vcf (string) – The file path to the VCF file for clade processing

  • metadata_dict (dict) – The dictionary of the metadata for the samples

  • dist_mat_file (str) – The file path to the distance matrix file created by clade-o-matic

  • groups (dict) – The dictionary containing the grouping of the data

  • ref_seq (str) – The reference sequence

  • mode (str) – A string to denote either ‘tree’ or ‘group’ mode for processing

  • perform_compression (bool) – A boolean to denote if tree compression should occur. Default is True.

  • delim (str) – A string to denote the file/sample delimiter. Default is ‘.’

  • min_snp_count (int) – The minimum number of unique snps required for a clade to be valid. Default is 1.

  • max_snps (int) – The maximum number of snps required for the definition of a unique genotype. Default is -1.

  • max_states (int) – The maximum number of states for a position [A,T,C,G,N,-]. Default is 6.

  • min_members (int) – The minimum number of members required for a clade to be considered valid. Default is 1.

  • min_inter_clade_dist (int) – The minimum inter-clade distance. Default is 1.

  • num_threads (int) – The number of threads desired for Ray processing. Default is 1.

  • max_snp_resolution_thresh (int) – The maximum snp resolution for the clade. The maximum number of members a clade is required to have to be considered valid. Default is 0.

  • method (str) – Method of clustering according to scipy.cluster.hierarchy.linkage. Default is ‘average’.

  • rcor_thresh (float) – The correlation coefficient threshold. Default is 0.4.

  • min_perc (float) – Minimum percentage of clade members to be positive for a kmer to be valid. Default is 1.

as_range(values)

A helper method to determine the trough ranges through the histogram data passed.

Parameters:

values (list) – A list of integers to find the ranges for

Returns:

A list of tuples with the start and end index of each range

Return type:

list

calc_metadata_counts(sample_set)

A helper method to count the metadata values in the sample set passed. This method loops through the field identifiers of the metadata and counts the total values for each field value.

Example: the location field has the values Europe and North America. This method will count the number of samples that correspond to Europe and North America (Europe 10, North America 15).

Parameters:

sample_set (set) – A set for the samples to tabulate metadata counts

Returns:

A dictionary for the metadata field id, value and the value counts

Return type:

dict

calc_node_associations_groups()

A method to loop through the metadata counts to calculate the fisher’s exact test for the node associations in the clades aleady determined.

calc_node_distances()

This method reads the distance matrix file previously created and sets the node’s closest distance, the closest sample distances, the average distances within the clade, total comparisons, and total distance for each node in the clade data dictionary.

check_nodes()

A method to check the validity of the various nodes in the clade data. If the number of nodes is below the min_member_count or the min_snp_count, the node flags ‘is_valid’ and ‘is_selected’ are set to False.

clade_snp_association()

A helper method to loop through the clade and find the associated snp names for each clade.

Returns:

A dictionary for the snp name and associated clade node id

Return type:

dict

clade_snp_count()

A method to aggregate the counts for the number of clade node members based on the clade id in the clade data dictionary.

Returns:

A dictionary of the counts for the clade node members

Return type:

dict

compress_heirarchy()

A method to compress the generated hierarchy based on the node bins previously determined and the genotypes. This method also removes nodes based on if the nodes on the same branch belong to the same bins. Valid nodes are those nodes which are terminal and distinct.

compression_cleanup()

A method to clean up the clade following compression by removing nodes whose distance is below the threshold default or the threshold as denoted by the user input.

create_cluster_membership_lookup()

A method that loops through the sample map dictionary and should the sample id from the map match the cluster determined in the perform_clustering() method, the node ids and sample map indexes are added to the nodes dictionary.

Returns:

A dictionary with the sample map indexes and the node ids clustered

Return type:

dict

dist_based_nomenclature()

A method to find the genotype identifiers from clustered data and format these identifiers based on the distances found and the delimiter set upon clade_worker instantiation.

distance_node_binning()

A method to add the clade nodes to the bins as per the partition distances determined in the partition_distances() method.

find_dist_troughs(values)

Uses the SciPy method scipy.signal.find_peaks to locate the troughs (reverse peaks) in the histogram data passed.

Parameters:

values (list) – The list of the histogram data values to find the troughs for

Returns:

  • troughs (numpy.array) – An array of the troughs

  • properties (dict) – a dictionary of the properties

Notes

Please refer to the external methods scipy.signal.find_peaks and numpy.array for more thorough documentation.

fit_clusters_to_groupings()

A method to map the clusters created to the nodes previously identified.

Returns:

A dictionary for the mapping of nodes to clusters

Return type:

dict

fix_root()

A helper method to reset the root of the clade with the snp contained in largest conserved sequence.

generate_genotypes()

A method to generate the genotypes data dictionary. This method creates this set from the group data dictionary and extracts the tree node and genotype from the sample map contained in the group data dictionary.

Returns:

A dictionary with the genotypes consisting of the tree node identifiers and genotype identifiers

Return type:

dict

genotype_lookup(sample_list)

A helper method to retrieve the genotype from the sample id passed

Parameters:

sample_list (list) – A list of sample identifiers

Returns:

The dictionary of the genotypes corresponding to the list of sample ids

Return type:

dict

get_all_nodes()

Set the list of nodes as the valid nodes from the group data dictionary.

get_bifurcating_nodes()

A method to split nodes with more than one genotype node identifier, the first is the parent and the second is the child node. It then sets the results in the bifurcating node dictionary.

get_clade_distances()

This method loops though the clade data nodes for the average distances within the clades and places them in a list.

Returns:

A list of floats for the average distances within the clades

Return type:

list

get_close_nodes()

This helper method to find the nodes whose distances are smaller than the minimum inter-clade distance.

Returns:

A set for the invalid nodes who violate the distance constraint

Return type:

set

get_conserved_ranges()

A method to compile a list of sequence ranges for the variant positions data list.

Returns:

A list of the conserved ranges for the variant positions

Return type:

list

get_genotype_snp_rules()

Getter method to retrieve the genotype_snp_rules data dictionary.

Returns:

The genotype_snp_rules data dictionary

Return type:

dicts

get_inter_clade_distances()

This helper method loops through the nodes of the clade data dictionary and creates a list of the closest clade distances for each node.

Returns:

A list of the closest clade distances

Return type:

list

get_largest_conserved_pos()

A method to loop through the conserved sequence ranges determined through the method get_conserved_ranges() and find the longest sequence and the position for the snp contained therein.

Returns:

The position for the snp within the largest conserved sequence

Return type:

int

get_node_member_counts()

A helper method to count the number of distinct genotypes in the group data sample map and set that to the node counts class variable.

get_selected_nodes()

A helper method to retrieve the nodes that correspond to the ‘is selected’ flag within the clade data dictionary.

Returns:

The nodes that correspond to the ‘is selected’ flag in the clade data

Return type:

set

get_selected_positions()

A helper method to retrieve the list of valid nodes within the clade data dictionary.

Returns:

A sorted list of integers for the positions of valid nodes

Return type:

list

get_terminal_nodes()

A helper method to find the terminal nodes in the genotypes set generated in the generate_genotypes() method.

Returns:

The set of terminal nodes from the genotypes set generated

Return type:

set

get_valid_nodes()

A helper method to loop through all the nodes to find the nodes flagged as valid in the clade data set.

Returns:

A set of all the valid nodes in the clade data

Return type:

set

get_variant_positions()

A method to retrieve the variant positions from the snp data dictionary.

Returns:

An integer list of the variant positions

Return type:

list

identify_dist_ranges()

A method to identify the troughs within the histogram data. This method uses the helper method find_dist_troughs() to accomplish this task.

Returns:

A ist of tuples for the range of the troughs in the histogram data

Return type:

list

init_clade_data()

A method to initialize the clade data dictionary and all the variables contained within for each item in the node list data set.

partition_distances()

A helper method to determine the bins and partitions required for the clades based on the average distances between these clades.

Returns:

A list of the average distances (partitions) for the clade for the calculated bins

Return type:

list

perform_clustering()

A method to read the distance matrix file created, along with the distance thresholds, and perform the clustering though the SciPy hierarchy methods.

Notes

See scipy.cluster.hierarchy.linkage and scipy.cluster.hierarchy.fcluster for more thorough documentation.

populate_clade_data()

The method for populating clade data dictionary with the canonical snps from the snp data set. Uses the clade id, the position and variant base of the snp.

prune_snps()

A method to prune the nodes in the clade based on the max_snps constraint. If there are more snp entries for the clade node, then some are removed randomly.

remove_snps(positions)

This function removes snps from the valid positions variable in this clade data dictionary.

Parameters:

positions (list) – A list of integer positions for the snps to be removed

set_genotype_snp_rules()

This method sets the genotype and snp rules for what is a positive and partial genotype within the genotype_snp associated data dictionary based on what the minimum percentage of clade members that need to be positive for a kmer to be valid.

set_genotype_snp_states()

This method sets the genotype data, genotype and base counts, the snp position, and in the genotype data for the genotype_snp_data data dictionary.

set_invalid_nodes(invalid_nodes)

A helper method to loop through all the invalid nodes in the set and set these nodes as invalid in the clade data dictionary.

Parameters:

set – The set of invalid nodes to mark as invalid

set_valid_nodes(valid_nodes)

Set the valid nodes in the group data set.

Parameters:

set – The valid nodes to set in the group data set

snp_based_filter()

A helper method to filter the snps based on if they are valid based on the number of max states for the snp positions and the minimum member count. If there are greater than the max states or fewer than the minimum member count, the snp is marked as invalid.

summarize_snps()

A method to loop through the snps identified and determine if the snp is canonical and/or valid. It also creates a list of variant positions for downstream processing.

temporal_signal()

A method to calculate the spearman’s and pearson’s coefficients from the metadata for the year and the calculated clade distances previously determined. If the spearman or pearson coefficients are larger than the pre-determined threshold, the temporal flag is set to ‘True’ in the clade data for the node.

update()

A method to call the helper methods to update the clades - get_valid_nodes(), set_valid_nodes(), generate_genotypes().

workflow()

The workflow method calls all the helper methods of this class to process the data according to both the input and the flags set by the user to determine clade memberships, validate SNPs, and compress the resulting clade unless otherwise specified by the user. Please refer to the file examples/small_test/cladeomatic/cladeomatic-clades.info.txt for more information

constants

A module that holds a number of constants for the various functions clade-o-matic performs in the creation of schemes and data analysis.

kmers

class cladeomatic.kmers.kmer_worker(ref_sequence, msa_file, result_dir, prefix, klen, genotype_map, genotype_snp_rules, max_ambig=0, min_perc=1, target_positions=[], num_threads=1)

The kmer_worker class.

The kmer_worker class contains several class variables for use in the creation of the kmer lists required for downstream processes in the creation of the schemes and data analysis files.

__init__(ref_sequence, msa_file, result_dir, prefix, klen, genotype_map, genotype_snp_rules, max_ambig=0, min_perc=1, target_positions=[], num_threads=1)

The instantiation of the kmer_worker class.

Parameters:
  • ref_sequence (str) – The reference sequence for kmer selection

  • msa_file (str) – The file path to the fasta file with the snps substitutions (multiple sequence alignment)

  • result_dir (str) – The file path to the results directory

  • prefix (str) – The prefix for the clade-o-matic produced output files

  • klen (int) – The length of the kmers

  • genotype_map (dict) – The map of the sample identifiers and genotypes

  • genotype_snp_rules (dict) – The dictionary groupings of the positions, base variants, positive and partial genotypes for the snps, that make up the processing rules

  • max_ambig (int) – The maximum number of ambiguous bases that can be contained in a kmer sequence. Default for this class is 0.

  • min_perc (float) – The minimum percentage of clade members to be positive for a kmer to be considered valid. Default for this class is 1.

  • target_positions (list) – The integer list of the target snp positions. Default for this class is an empty list.

  • num_threads (int) – The number of threads for the Ray instance. Default for this class is 1.

Notes

Refer to https://www.ray.io for more information about the Ray instances used in this module

calc_consensus_seq()

A method to determine the consensus sequence for the fasta passed by looping through the reference sequence.

Returns:

The consensus sequence for the bases in the sequence fasta passed

Return type:

str

confirm_kmer_specificity()

A method to further filter invalid kmers from the extracted kmer dictionary. This method ensures the target positions have the valid bases and exist in the kmer scheme data dictionary. Kmers that do not adhere to these rules are flagged as invalid.

construct_ruleset()

A method to construct and populate the kmer rule set dictionary of the positive genotypes (the kmers that match or exceed the minimum percentage of clade members to be positive for a kmer to be valid) and the partial genotypes which are less than the minimum percentage of clade members.

create_biohansel_kmers()

A method to create the kmer data dictionary for use with the BioHansel scheme. This method creates the positive and negative kmers for use in the BioHansel scheme, while adhering to the same kmer rules as the cladeomatic.create.create_scheme() method.

Returns:

A dictionary for the kmers to be used by the BioHansel scheme

Return type:

dict

Notes

Refer to https://github.com/phac-nml/biohansel for more thorough BioHansel documentation

extract_kmers()

This method creates the dictionary of all the possible kmers for the snp variants found in the sequence alignments. These kmers are found through the looping of the snp positions in the sequence and ‘frame-shifting’ the kmer sequence through all possibilities for the snp, as long as there are no ambiguous bases nor gaps in the kmer sequence and the kmer is the specified length.

find_invalid_kmers()

A method to loop through the extracted kmer dictionary, retrieve the flagged invalid kmers and add their indexes to a set of just invalid kmers. This is a helper method used by other methods for kmer filtering.

find_variant_positions()

Method to find the sequence variants and their positions in the msa_base_counts dictionary and assign the positions of the bases to the variant_positions dictionary.

flag_invalid_kmers(invalid_kmers)

This method takes the set of invalid kmer indexes and flags the kmers as invalid in the extracted kmer dictionary.

Parameters:

invalid_kmers (set) – The set of integers for the indexes of the invalid kmers that require flagging in the extracted kmes dictionary

get_genotype_snp_states()

A method to find the valid SNPs in the genotype membership dictionary previously constructed.

get_kseq_by_index(index)

A helper method to return the kmer sequence for the kmer index passed from the extracted kmers dictionary.

Parameters:

int (index -) – The index for the desired kmer sequence

Returns:

The kmer sequence if it is found in the extracted kmer dictionary, but returns an empty string if not found

Return type:

str

get_optimal_kmer_position()

A method to determine the best kmer start positions for the variant/snps positions found in the find_variant_positions method. Set these kmer start positions in a list for further processing

get_pos_without_kmer()

This method compiles a dictionary of missing kmer indexes as identifiers and the snp base they flank.

Returns:

A dictionary of the missing kmers and the snp bases these kmers flank.

Return type:

dict

init_kmer_scheme_data()

A method to initialize the kmer scheme through the list of target positions. Adds blank entries to the kmer scheme dictionary.

init_msa_base_counts()

Method to initialize the msa (multiple sequence alignment) base counts dictionary msa_base_counts with the number of each base substitution for the length of the reference sequence set to zero.

This method takes in the list of all possible kmers for the sequences provided from the method extract_kmers() and writes them to a file for later processing using the cladeomatic.utils.kmerSearch.SeqSearchController() method.

pop_msa_base_counts()

Method to count or populate the msa_base_counts dictionary initialized in init_msa_base_counts() for the bases in the fasta file passed. For the sequence in the fasta, the snp position is referenced and the counter is updated for each base substitution that occurs in that position.

populate_int_base_kmer_lookup()

A method to instantiate the base kmer lookup dictionary with a sequential index and the sequence of the kmer.

populate_kmer_scheme_data()

A method to populate the kmer scheme dictionary with the processed kmers of the scheme data and extracted kmers dictionaries. This method removes the invalid kmers, removes the kmers with invalid bases, and the kmers with invalid positions in comparison to the scheme data.

process_kmer_results()

This method process the all the kmers within the search file clade-o-matic created in the perform_kmer_search() method. This method filters and find the list of valid kmers for the sequence variants by ensuring the kmer sequence id exists in the genotype map and there are no duplicate kmers in the search file

refine_rules()

A method to filter and refine the kmer selection rules. Missing data can cause kmers to all be partial when there isn’t an alternative kmer available for a genotype. This filters the rules to assign kmers to be positive for a genotype when there is only one kmer base state present for it.

remove_empty_base_states()

A clean up method to remove the records with empty or invalid bases from the kmer scheme data dictionary.

remove_invalid_kmers_from_scheme()

A method to remove the invalid kmers from the scheme data dictionary.

remove_redundant_kmers()

A method to remove all but the best kmer sequences from the kmer scheme data dictionary. This is accomplished by revisiting the kmer rules sets to find the best representative kmer and finally updates the kmer data scheme dictionary with the results of the filtering.

remove_scheme_pos(pos_to_remove)

A helper method to remove an entry in the kmer scheme data dictionary based on the position passed :param pos_to_remove: int - the index of the position to be removed

workflow()

The workflow method to call all the methods within this class to create the kmer lists used for schema analysis and processing.

snps

cladeomatic.snps.process_snp(chrom, pos, base, all_samples, snp_members, ambig_members, group_membership, is_ref)

A method to process all the inputs to create a dictionary entry of SNP data. Please refer to the file examples/small_test/cladeomatic/cladeomatic-snps.info.txt for more information.

Parameters:
  • chrom (str) – The chromosome to which the SNP belongs to

  • pos (int) – The location of the SNP in the sequence

  • base (str) – The nucleotide base of th SNP

  • all_samples (set) – The set of identifiers for the samples

  • snp_members (set) – The identifiers for the SNPs

  • ambig_members (set) – The collections of ambiguously called nucleotides

  • group_membership (dict) – The dictionary for the node ids and their clade/group memberships

  • is_ref (bool) – True, if this the reference sequence nucleotide

Returns:

A dictionary of the snp and associated information for further processing

Return type:

dict

cladeomatic.snps.snp_search_controller(group_data, vcf_file, n_threads=1)

A method to search for the lists of snps in the VCF file and group data to match them to create a complete snp dictionary. Please refer to the file examples/small_test/cladeomatic/cladeomatic-snps.info.txt for more information.

Parameters:
  • group_data (dict) – A dictionary of all the SNPs and their clade/group membership

  • vcf_file (str) – The file path to the user VCF file containing the variants of interest

  • n_threads (int) – The number of threads to use in the Ray method call

Returns:

The data dictionary for the snps as identified from the input and grouped by chromosome, position and base

Return type:

dict

Notes

Refer to https://www.ray.io for more information about the Ray instances used in this module.

visualize

writers

cladeomatic.writers.print_params(params, outfile)

A method to create the parameter file. This is the list of all the command line parameters and their user input values if supplied. Please refer to the sample file examples/small_test/cladeomatic/cladeomatic-params.log for more information.

Parameters:
  • params (argparse.Namespace object) – The Namespace object that holds the parameters and their values

  • outfile (str) – The file path for the parameters output file

cladeomatic.writers.write_genotypes(genotypes, outfile, header='sample_id\tgenotype\n')

Accepts a list of sample genotype hierarchies and writes them to a file. Please refer to the sample files examples/small_test/cladeomatic/cladeomatic-genotypes.distance.txt and examples/small_test/cladeomatic/cladeomatic-genotypes.selected.txt for more information.

:param : The dictionary of genotypes for writing :type : param genotypes : dict :param outfile: The output file path :type outfile: str

cladeomatic.writers.write_kmers(kmers, out_file)

A method to create and write a kmer file indicating the index of the kmer, the kmer sequence, the snp position and base value, and the alignment start and end.

Parameters:
  • kmers (dict) – The dictionary of kmers to be written

  • out_file (str) – The path to the output file

cladeomatic.writers.write_node_report(clade_data, outfile)

Writes the clades data produced by the cladeomatic.clades.clade_worker class to a TSV file. Please refer to the sample file examples/small_test/cladeomatic/cladeomatic-clades.info.txt for more information.

Parameters:
  • clade_data (dict) – The clade data for writing to the file

  • outfile (str) – The output file path

cladeomatic.writers.write_scheme(header, scheme, outfile)

A method to write either a kmer or snp scheme to a file. Please refer to the sample files examples/small_test/cladeomatic/cladeomatic-kmer.scheme.txt and examples/small_test/cladeomatic/cladeomatic-snp.scheme.txt for more information.

Parameters:
  • header (str) – The string representing the header text

  • scheme (dict) – A dictionary of the kmer or snp scheme data to write out

  • outfile (str) – The output file path

cladeomatic.writers.write_snp_report(snp_data, outfile)

Accepts snp_data data dictionary structure and writes the snp details to a file. Please refer to the sample file examples/small_test/cladeomatic/cladeomatic-snps.info.txt for more information.

Parameters:
  • snp_data (dict) – A dictionary for the snp data in the format {[chrom_id] : {[position]: {[base]: :dict()}}}

  • outfile (str) – The file output path

utils.__init__

cladeomatic.utils.__init__.calc_AMI(category_1, category_2)

Calculates the adjusted mutual info score between two clusterings.

Parameters:
  • category_1 (list) – A list of int values for cluster 1

  • category_2 (list) – A list of int values for cluster 2

Returns:

The value of the AMI score for the two clusters

Return type:

float

cladeomatic.utils.__init__.calc_ARI(category_1, category_2)

Calculates the adjusted rand score between two clusterings.

Parameters:
  • category_1 (list) – A list of int values for cluster 1

  • category_2 (list) – A list of int values for cluster 2

Returns:

The value of the ARI score for the two clusters

Return type:

float

cladeomatic.utils.__init__.calc_shanon_entropy(value_list)

This method calculates the shannon entropy value for the list of numbers passed.

Parameters:

value_list (list) – The list of values to use for the calculation

Returns:

The calculated Shannon entropy or -1 if there are no values in the passed list

Return type:

float

cladeomatic.utils.__init__.init_console_logger(lvl)

This method initializes the logger for the console.

Parameters:

lvl (int) – The level of logging desired 0,1,2,3

Returns:

The logging object

Return type:

logging

cladeomatic.utils.__init__.parse_metadata(file, column='sample_id')

Parses the metadata file into a dictionary with sample ids as the keys. Will dynamically add the other columns and values to the dictionary.

Parameters:

file (str) – The path to tsv metadata file with ‘sample_id’ as a mandatory column

Returns:

A dictionary of the parsed metadata values organized with the sample id as a key

Return type:

dict

cladeomatic.utils.__init__.run_command(command)

This method runs the passed command on the shell command line.

Parameters:

command (str) – The command for the command line

Returns:

stdout, stderr: the standard out and error messages returned by the command line

Return type:

bytes, bytes

utils.jellyfish

utils.kmerSearch

cladeomatic.utils.kmerSearch.SeqSearchController(seqKmers, fasta_file, out_dir, prefix, n_threads=1)

This method takes in the list of all possible kmers for the sequences provided and writes them to a temporary processing file for downstream searches. This method employs the use of Ray for processing.

Parameters:
  • seqKmers (dict) – A dictionary with the index as keys and kmer sequences as values

  • fasta_file (str) – The file path to the fasta file containing the sequence

  • out_dir (str) – The filepath to the temporary output file for the kmers

  • prefix (str) – The prefix for the output file name

  • n_threads (int) – The number of threads to be used in the Ray process, default is 1

Returns:

A list of strings for the paths to the temporary processing files

Return type:

int

Notes

Refer to https://www.ray.io for more information about the Ray instances used in this module.

cladeomatic.utils.kmerSearch.expand_degenerate_bases(seq)

List all possible kmers for a scheme given a degenerate base.

Parameters:

seq (str) – The string for the scheme_kmers from SNV scheme fasta file

Returns:

List of all possible kmers given a degenerate base or not

Return type:

list

cladeomatic.utils.kmerSearch.init_automaton_dict(seqs)

Initialize Aho-Corasick Automaton with the kmers found in the passed sequence dictionary. The Automaton takes the kmer and its reversed compliment for loading.

Parameters:
  • seqs (dict) – The dictionary of kmer sequences and their indexes

  • Returns – Aho-Corasick Automaton with kmers loaded

Notes

Please refer to the pyahocorasick project, specifically the method add_word for this method.

cladeomatic.utils.kmerSearch.revcomp(s)

This method creates the reverse compliment nucleotide sequence for the sequence passed using the str.translate method.

Parameters:
  • s (str) – The nucleotide sequence to find the reverse compliment for

  • Returns

  • str – The reverse complement of the passed nucleotide sequence

utils.phylo_tree

utils.seqdata

cladeomatic.utils.seqdata.calc_homopolymers(seq)

The method calculates the longest homopolymer (the sequence of consecutive identical bases) in the sequence or sequence fragment passed.

Parameters:

seq (str) – The sequences or sequence fragment to find the longest homopolymer

Returns:

The longest homopolymer length

Return type:

int

cladeomatic.utils.seqdata.calc_md5(string)

Method to encode the MD5 hash for the input string.

Parameters:
  • string (srt) –

  • hash (The string to compute the MD5) –

Returns:

The md5 hash generated

Return type:

hash

cladeomatic.utils.seqdata.create_aln_pos_from_unalign_pos_lookup(aln_seq)

This method creates a list of integers for the positions of the bases in the unaligned sequence derived from the aligned sequence passed to the method.

Parameters:

aln_seq (str) – The alignment sequence to process

Returns:

A list of integers for the positions found

Return type:

list

cladeomatic.utils.seqdata.create_pseudoseqs_from_vcf(ref_id, ref_seq, vcf_file, outfile)

This method creates the pseudo full sequences of the variants found in the VCF file.

Parameters:
  • ref_id (str) – The reference sequence identifier

  • ref_seq (str) – The reference sequence to alter

  • vcf_file (str) – The path to the VCF file

  • outfile (str) – The path to the pseudo variant outfile

cladeomatic.utils.seqdata.gb_to_fasta_dict(gbk_file)

Reads a GenBank formatted sequence file and creates a dictionary of sequences with the sequence id as keys.

Parameters:

str (gbk_file -) – The string path to a GenBank formatted sequence file

Returns:

A dictionary of sequences indexed by sequence id

Return type:

dict

cladeomatic.utils.seqdata.generate_non_gap_position_lookup(seq)

Creates a list of positions which correspond to the position of that base in a gapless sequence.

Parameters:

seq (str) – The sequence to process for gaps

Returns:

An int list of the positions of the gaps

Return type:

list

cladeomatic.utils.seqdata.get_variants(vcf_file)

This method reads the incoming VCF file and returns a dictionary of the sample variant bases and their locations.

Parameters:

vcf_file (str) – The file path to the VCF file for variant discovery

Returns:

A dictionary of the sample variants with the node id as key

Return type:

dict

cladeomatic.utils.seqdata.parse_reference_gbk(gbk_file)

Method to parse the GenBank reference file, clean the strings, and return the reference features of interest.

Parameters:

gbk_file (str) – The file path to the reference genbank format file with sequence annotations

Returns:

A dictionary of all the reference features

Return type:

dict

cladeomatic.utils.seqdata.read_fasta_dict(fasta_file)

Reads the fasta file from the passed file path and formats the input to a dictionary of sequences.

Parameters:

fasta_file (str) – The path to fasta file to read

Returns:

A dictionary of sequences indexed by sequence id

Return type:

dict

cladeomatic.utils.seqdata.revcomp(s)

Helper method to create the reverse complement nucleotide sequence for the sequence passed.

Parameters:

s (str) – The nucleotide sequence to parse

Returns:

The reverse complement of the passed nucleotide sequence

Return type:

str

utils.snpdists

cladeomatic.utils.snpdists.run_snpdists(aln_file, out_file, num_threads=1)

Runs the command line snp-dists application which calculates the SNP distances from a fasta file.

Parameters:
  • aln_file (str) – The path to the fasta sequence file

  • out_file (str) – The path to the output file

  • num_threads (int) – The number of processing threads for . Default is 1.

Returns:

Stderr - the console error messages

Return type:

tuple

Notes

Please refer to https://github.com/tseemann/snp-dists for more documentation

utils.vcfhelper

class cladeomatic.utils.vcfhelper.vcfReader(file)

The vcfReader class is meant to be a helper class to read and process VCF files.

check_file(file)

A helper method to verify that a file exists and is not empty.

Parameters:

file (str) – The path to the file

Returns:

True if the file both exists and is not empty

Return type:

bool

next_row()

A method to return the next row of the VCF file.

Returns:

The next row of the file

Return type:

vcfReader object

process_row()

A method to process the output of a row for the VCF file.

Returns:

A dictionary of the VCF file data with the headers as keys for the sample data. Return ‘None’ if there is no next row in the file.

Return type:

dict

utils.visualization

cladeomatic.utils.visualization.create_dist_histo(data, outfile)

A method to create a histogram from the data dictionary passed. This method also writes the histogram to a file.

Parameters:
  • data (dict) – The data to be plotted

  • outfile (str) – The output figure file path

cladeomatic.utils.visualization.plot_bar(x, y)

A method to plot a bar chart with the x and y objects passed.

Parameters:
  • x (obj) – The list or a dictionary for the x-axis objects

  • y (obj) – The list or a dictionary for the y-axis objects

Returns:

The plotly figure produced

Return type:

Figure

Notes

Please refer to https://plotly.com/python/ for more documentation