API
Welcome to the API documentation. Below is the documentation for the code that makes up Clade-o-matic.
main
- cladeomatic.main.print_usage_and_exit()
This method prints brief usage instructions of Clade-o-matic to the command line
create
genotype
- cladeomatic.genotype.convert_features_to_mutations(feature_lookup, snp_profile)
This method converts the mutations or features in the scheme to the snp base mutation and sets this on a data dictionary with the snp state.
- Parameters:
feature_lookup (dict) – The mutations or features in the scheme data dictionary
snp_profile (dict) – The snp profile for a given genotype
- Returns:
A dictionary of the snp ids and if they belong to the alt or ref for the genotype
- Return type:
dict
- cladeomatic.genotype.get_snp_profiles(valid_positions, vcf_file)
This method retrieves both the SNP and sample profiles from the VCF file, ensures the SNPs are in valid positions and adds the sample SNP profile to the data dictionary.
- Parameters:
valid_positions (list) – The list of integers indicating the valid SNP positions for this sample set
vcf_file (str) – The file path to vcf or tsv snp data files
- Returns:
The dictionary of the snp data
- Return type:
dict
- cladeomatic.genotype.is_valid_file(filename)
A helper method to determine if the file path leads to a valid file. For the file to be considered valid it must exist and have the minimum file size.
- Parameters:
filename (str) – The path to the file requiring validation
- Returns:
True if this is a valid file, False if not
- Return type:
bool
- cladeomatic.genotype.parse_args()
Argument Parsing method.
A function to parse the command line arguments passed at initialization of Clade-o-matic, format these arguments, and return help prompts to the user shell when specified.
- Returns:
The arguments and their user specifications, the usage help prompts and the correct formatting for the incoming argument (str, int, etc.)
- Return type:
ArgumentParser object
- cladeomatic.genotype.parse_scheme_features(scheme_file)
This method parses the pass snp scheme file and creates a scheme and feature data set for downstream processing. The scheme data dictionary consists of the snp scheme data of the snp position, the snp base and if the genotype is positive, partial and/or allowed. The feature data dictionary the number of positions, the number of mutations, total features and the mutation data for searching. This feature data dictionary also included the features of each genotype: the mutation lookup id, if the snp was a ref or alt state, the base and base position.
- Parameters:
scheme_file (str) – The file path to the snp scheme file to read in
- Returns:
The constructed features data dictionary, with the scheme data dictionary inside
- Return type:
dict
- cladeomatic.genotype.parse_scheme_genotypes(scheme_file)
This method parses the snp scheme file to construct a scheme data dictionary for further downstream processing.
- Parameters:
scheme_file (str) – The file path the snp scheme file for parsing
- Returns:
The scheme data dictionary
- Return type:
dict
- cladeomatic.genotype.run()
The main method to read the command line arguments and creates the genotype call file. This method reads the scheme, variant, sample and genotype metadata files, and constructs various processing data dictionaries to ultimately call the genotypes and provide a measure of quality control of those calls for the samples passed.
Notes
Refer to https://www.ray.io for more information about the Ray instances used in this module.
- cladeomatic.genotype.write_genotype_calls(header, scheme_name, outfile, genotype_results, sample_metadata, genotype_meta, scheme_data, sample_variants, min_positions=1)
This method writes the genotype calls determined through previous methods to a file. Please refer to the file examples/small_test/cladeomatic/cladeomatic-genotype.calls.txt for more information.
- Parameters:
header (str) – The header for the output file
scheme_name (str) – The name for the scheme
outfile (str) – The output file path
genotype_results (dict) – The dictionary of the genotype calls from
call_genotypes()sample_metadata (dict) – The sample metadata data dictionary
genotype_meta (dict) – The genotype metadata data dictionary
scheme_data (dict) – The snp scheme data dictionary
sample_variants (dict) – The sample variants data dictionary
min_positions (int) – The minimum number of . Default of 1.
benchmark
- cladeomatic.benchmark.convert_genotype_report(data_dict, predicted_field_name, submitted_field_name)
A method to take in the genotype call data dictionary, the predicted and submitted field names and converts these to a data dictionary consisting of the sample ids for the genotypes, the predicted genotype, and submitted genotype ids.
- Parameters:
data_dict (dictionary) – The genotype call dictionary to parse
predicted_field_name (str) – The name of the predicted genotype field within the genotype call file
submitted_field_name (str) – The name of the submitted genotype field within the genotype call file
- Returns:
A dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls
- Return type:
dict
- cladeomatic.benchmark.filter_genotypes_exclude(labels, name)
This method takes the dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls, and filters the dictionary for the genotype name. If the genotype name exists in either the predicted or the submitted genotype call the sample is excluded from addition to the data dictionary.
- Parameters:
labels (dict) – A dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls
name (str) – The name of the genotype to filter by
- Returns:
The genotype name-filtered dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls
- Return type:
dict
- cladeomatic.benchmark.filter_genotypes_include(labels, name)
This method takes the dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls, and filters the dictionary for the genotype name. If the genotype name exists in either the predicted or the submitted genotype call the sample will be added to the dictionary and the inclusive dictionary is returned.
- Parameters:
labels (dict) – A dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls
name (str) – The name of the genotype to filter by
- Returns:
The genotype name-filtered dictionary of sample_ids, the predicted genotype calls and the submitted genotype calls
- Return type:
dict
- cladeomatic.benchmark.get_genotype_counts(data_dict, field_name='genotype')
This method counts the number of unique genotypes that appear in the data dictionary passed.
- Parameters:
data_dict (dict) – The data dictionary that contains genotypes for counting
field_name (str) – The name of the field within the data dictionary for which to parse and count genotypes for. Default is ‘genotype’.
- Returns:
The dictionary of genotypes and their counts
- Return type:
dict
- cladeomatic.benchmark.get_problem_bases(profile, rule)
This method identifies which snps exist in the variants data dictionary but not in the genotype rules or schema dictionary for a specific sample id.
- Parameters:
profile (dict) – The variants data dictionary for a specific sample id
rule (dict) – The scheme data dictionary for a specific sample id
- Returns:
A data dictionary of the snp positions and bases that are not found in both the variants and scheme dictionaries for the sample passed
- Return type:
dict
- cladeomatic.benchmark.parse_args()
Argument Parsing method.
A function to parse the command line arguments passed at initialization of Clade-o-matic, format these arguments, and return help prompts to the user shell when specified.
- Returns:
The arguments and their user specifications, the usage help prompts and the correct formatting for the incoming argument (str, int, etc.)
- Return type:
ArgumentParser object
- cladeomatic.benchmark.parse_genotype_report(header, file)
A helper method to read in the genotype call file for downstream use.
- Parameters:
header (list) – The list of strings for the genotype call report header
file (str) – The path to the genotype call report file to parse
- Returns:
A dictionary of the parsed genotype call report file
- Return type:
dict
- cladeomatic.benchmark.run()
The main method to read the command line arguments and creates a file for the F1 scores and if required a file that records per sample any sites that are responsible for the genotype not being called. This method reads the genotype call file, the variant call file and the kmer inclusive scheme file to verify the genotypes and determine if the submitted and predicted genotypes are a match. The results are written to the output files examples/small_test/cladeomatic/benchmark/cladeomatic-scheme.scores.txt and examples/small_test/cladeomatic/benchmark/cladeomatic-sample.results.txt.
- cladeomatic.benchmark.write_scheme_scores(header, file, scheme_name, scores)
This method writes the scheme scores determined to a file. Please refer to the file examples/small_test/cladeomatic/benchmark/cladeomatic-scheme.scores.txt for more information.
- Parameters:
header (str) – The header string for the output file
file (str) – The path to the output file
scheme_name (str) – The name of the scheme
scores (dict) – The data dictionary of the scores to be written out
- cladeomatic.benchmark.write_updated_genotype_report(header, file, scheme_name, variants, data_dict, genotype_rules, predicted_field_name, submitted_field_name)
This method writes the genotype sample results file with the sample_id, scheme name, submitted genotype, predicted genotype, if the genotypes are a match for each other, and if not what are the problem snp positions and bases (features). Please refer to the file examples/small_test/cladeomatic/benchmark/cladeomatic-sample.results.txt for more information.
- Parameters:
header (str) – The header string for the output file
file (str) – The file path for the output file
scheme_name (str) – The name for the scheme
variants (dict) – The data dictionary for the variant call file
data_dict (dict) – The data dictionary for the called genotypes
genotype_rules (dict) – The data dictionary for the genotype scheme/rules
predicted_field_name (str) – The name for the predicted genotype field name
submitted_field_name (str) – The name for the submitted genotype field name
namer
- cladeomatic.namer.parse_args()
Argument Parsing method.
A function to parse the command line arguments passed at initialization of Clade-o-matic, format these arguments, and return help prompts to the user shell when specified.
- Returns:
The arguments and their user specifications, the usage help prompts and the correct formatting for the incoming argument (str, int, etc.)
- Return type:
ArgumentParser object
- cladeomatic.namer.rename(lookup, queries)
This method takes in the naming data dictionary and the positive or partial genotypes list to create a new list of names for the genotypes if they names are the same in both the naming dictionary and positive or partial genotypes list.
- Parameters:
lookup (dict) – The naming data dictionary
queries (list) – The list of positive genotypes
- Returns:
The list of names that are the same between the naming dictionary and the genotypes
- Return type:
list
- cladeomatic.namer.run()
This method takes in the data dictionaries for the scheme and names and renames the genotypes accordingly.
clades
- class cladeomatic.clades.clade_worker(vcf, metadata_dict, dist_mat_file, groups, ref_seq, mode, perform_compression=True, delim='.', min_snp_count=1, max_snps=-1, max_states=6, min_members=1, min_inter_clade_dist=1, num_threads=1, max_snp_resolution_thresh=0, method='average', rcor_thresh=0.4, min_perc=1)
The clade worker class provides methods to take the user provided input and cluster the data, create and compress clades, and further summarize the SNP variants within the samples.
- __init__(vcf, metadata_dict, dist_mat_file, groups, ref_seq, mode, perform_compression=True, delim='.', min_snp_count=1, max_snps=-1, max_states=6, min_members=1, min_inter_clade_dist=1, num_threads=1, max_snp_resolution_thresh=0, method='average', rcor_thresh=0.4, min_perc=1)
The instantiation of the
clade_workerclass.- Parameters:
vcf (string) – The file path to the VCF file for clade processing
metadata_dict (dict) – The dictionary of the metadata for the samples
dist_mat_file (str) – The file path to the distance matrix file created by clade-o-matic
groups (dict) – The dictionary containing the grouping of the data
ref_seq (str) – The reference sequence
mode (str) – A string to denote either ‘tree’ or ‘group’ mode for processing
perform_compression (bool) – A boolean to denote if tree compression should occur. Default is True.
delim (str) – A string to denote the file/sample delimiter. Default is ‘.’
min_snp_count (int) – The minimum number of unique snps required for a clade to be valid. Default is 1.
max_snps (int) – The maximum number of snps required for the definition of a unique genotype. Default is -1.
max_states (int) – The maximum number of states for a position [A,T,C,G,N,-]. Default is 6.
min_members (int) – The minimum number of members required for a clade to be considered valid. Default is 1.
min_inter_clade_dist (int) – The minimum inter-clade distance. Default is 1.
num_threads (int) – The number of threads desired for Ray processing. Default is 1.
max_snp_resolution_thresh (int) – The maximum snp resolution for the clade. The maximum number of members a clade is required to have to be considered valid. Default is 0.
method (str) – Method of clustering according to scipy.cluster.hierarchy.linkage. Default is ‘average’.
rcor_thresh (float) – The correlation coefficient threshold. Default is 0.4.
min_perc (float) – Minimum percentage of clade members to be positive for a kmer to be valid. Default is 1.
- as_range(values)
A helper method to determine the trough ranges through the histogram data passed.
- Parameters:
values (list) – A list of integers to find the ranges for
- Returns:
A list of tuples with the start and end index of each range
- Return type:
list
- calc_metadata_counts(sample_set)
A helper method to count the metadata values in the sample set passed. This method loops through the field identifiers of the metadata and counts the total values for each field value.
Example: the location field has the values Europe and North America. This method will count the number of samples that correspond to Europe and North America (Europe 10, North America 15).
- Parameters:
sample_set (set) – A set for the samples to tabulate metadata counts
- Returns:
A dictionary for the metadata field id, value and the value counts
- Return type:
dict
- calc_node_associations_groups()
A method to loop through the metadata counts to calculate the fisher’s exact test for the node associations in the clades aleady determined.
- calc_node_distances()
This method reads the distance matrix file previously created and sets the node’s closest distance, the closest sample distances, the average distances within the clade, total comparisons, and total distance for each node in the clade data dictionary.
- check_nodes()
A method to check the validity of the various nodes in the clade data. If the number of nodes is below the
min_member_countor themin_snp_count, the node flags ‘is_valid’ and ‘is_selected’ are set to False.
- clade_snp_association()
A helper method to loop through the clade and find the associated snp names for each clade.
- Returns:
A dictionary for the snp name and associated clade node id
- Return type:
dict
- clade_snp_count()
A method to aggregate the counts for the number of clade node members based on the clade id in the clade data dictionary.
- Returns:
A dictionary of the counts for the clade node members
- Return type:
dict
- compress_heirarchy()
A method to compress the generated hierarchy based on the node bins previously determined and the genotypes. This method also removes nodes based on if the nodes on the same branch belong to the same bins. Valid nodes are those nodes which are terminal and distinct.
- compression_cleanup()
A method to clean up the clade following compression by removing nodes whose distance is below the threshold default or the threshold as denoted by the user input.
- create_cluster_membership_lookup()
A method that loops through the sample map dictionary and should the sample id from the map match the cluster determined in the
perform_clustering()method, the node ids and sample map indexes are added to the nodes dictionary.- Returns:
A dictionary with the sample map indexes and the node ids clustered
- Return type:
dict
- dist_based_nomenclature()
A method to find the genotype identifiers from clustered data and format these identifiers based on the distances found and the delimiter set upon
clade_workerinstantiation.
- distance_node_binning()
A method to add the clade nodes to the bins as per the partition distances determined in the
partition_distances()method.
- find_dist_troughs(values)
Uses the SciPy method scipy.signal.find_peaks to locate the troughs (reverse peaks) in the histogram data passed.
- Parameters:
values (list) – The list of the histogram data values to find the troughs for
- Returns:
troughs (numpy.array) – An array of the troughs
properties (dict) – a dictionary of the properties
Notes
Please refer to the external methods scipy.signal.find_peaks and numpy.array for more thorough documentation.
- fit_clusters_to_groupings()
A method to map the clusters created to the nodes previously identified.
- Returns:
A dictionary for the mapping of nodes to clusters
- Return type:
dict
- fix_root()
A helper method to reset the root of the clade with the snp contained in largest conserved sequence.
- generate_genotypes()
A method to generate the genotypes data dictionary. This method creates this set from the group data dictionary and extracts the tree node and genotype from the sample map contained in the group data dictionary.
- Returns:
A dictionary with the genotypes consisting of the tree node identifiers and genotype identifiers
- Return type:
dict
- genotype_lookup(sample_list)
A helper method to retrieve the genotype from the sample id passed
- Parameters:
sample_list (list) – A list of sample identifiers
- Returns:
The dictionary of the genotypes corresponding to the list of sample ids
- Return type:
dict
- get_all_nodes()
Set the list of nodes as the valid nodes from the group data dictionary.
- get_bifurcating_nodes()
A method to split nodes with more than one genotype node identifier, the first is the parent and the second is the child node. It then sets the results in the bifurcating node dictionary.
- get_clade_distances()
This method loops though the clade data nodes for the average distances within the clades and places them in a list.
- Returns:
A list of floats for the average distances within the clades
- Return type:
list
- get_close_nodes()
This helper method to find the nodes whose distances are smaller than the minimum inter-clade distance.
- Returns:
A set for the invalid nodes who violate the distance constraint
- Return type:
set
- get_conserved_ranges()
A method to compile a list of sequence ranges for the variant positions data list.
- Returns:
A list of the conserved ranges for the variant positions
- Return type:
list
- get_genotype_snp_rules()
Getter method to retrieve the
genotype_snp_rulesdata dictionary.- Returns:
The
genotype_snp_rulesdata dictionary- Return type:
dicts
- get_inter_clade_distances()
This helper method loops through the nodes of the clade data dictionary and creates a list of the closest clade distances for each node.
- Returns:
A list of the closest clade distances
- Return type:
list
- get_largest_conserved_pos()
A method to loop through the conserved sequence ranges determined through the method
get_conserved_ranges()and find the longest sequence and the position for the snp contained therein.- Returns:
The position for the snp within the largest conserved sequence
- Return type:
int
- get_node_member_counts()
A helper method to count the number of distinct genotypes in the group data sample map and set that to the node counts class variable.
- get_selected_nodes()
A helper method to retrieve the nodes that correspond to the ‘is selected’ flag within the clade data dictionary.
- Returns:
The nodes that correspond to the ‘is selected’ flag in the clade data
- Return type:
set
- get_selected_positions()
A helper method to retrieve the list of valid nodes within the clade data dictionary.
- Returns:
A sorted list of integers for the positions of valid nodes
- Return type:
list
- get_terminal_nodes()
A helper method to find the terminal nodes in the genotypes set generated in the
generate_genotypes()method.- Returns:
The set of terminal nodes from the genotypes set generated
- Return type:
set
- get_valid_nodes()
A helper method to loop through all the nodes to find the nodes flagged as valid in the clade data set.
- Returns:
A set of all the valid nodes in the clade data
- Return type:
set
- get_variant_positions()
A method to retrieve the variant positions from the snp data dictionary.
- Returns:
An integer list of the variant positions
- Return type:
list
- identify_dist_ranges()
A method to identify the troughs within the histogram data. This method uses the helper method
find_dist_troughs()to accomplish this task.- Returns:
A ist of tuples for the range of the troughs in the histogram data
- Return type:
list
- init_clade_data()
A method to initialize the clade data dictionary and all the variables contained within for each item in the node list data set.
- partition_distances()
A helper method to determine the bins and partitions required for the clades based on the average distances between these clades.
- Returns:
A list of the average distances (partitions) for the clade for the calculated bins
- Return type:
list
- perform_clustering()
A method to read the distance matrix file created, along with the distance thresholds, and perform the clustering though the SciPy hierarchy methods.
Notes
See scipy.cluster.hierarchy.linkage and scipy.cluster.hierarchy.fcluster for more thorough documentation.
- populate_clade_data()
The method for populating clade data dictionary with the canonical snps from the snp data set. Uses the clade id, the position and variant base of the snp.
- prune_snps()
A method to prune the nodes in the clade based on the
max_snpsconstraint. If there are more snp entries for the clade node, then some are removed randomly.
- remove_snps(positions)
This function removes snps from the valid positions variable in this clade data dictionary.
- Parameters:
positions (list) – A list of integer positions for the snps to be removed
- set_genotype_snp_rules()
This method sets the genotype and snp rules for what is a positive and partial genotype within the genotype_snp associated data dictionary based on what the minimum percentage of clade members that need to be positive for a kmer to be valid.
- set_genotype_snp_states()
This method sets the genotype data, genotype and base counts, the snp position, and in the genotype data for the
genotype_snp_datadata dictionary.
- set_invalid_nodes(invalid_nodes)
A helper method to loop through all the invalid nodes in the set and set these nodes as invalid in the clade data dictionary.
- Parameters:
set – The set of invalid nodes to mark as invalid
- set_valid_nodes(valid_nodes)
Set the valid nodes in the group data set.
- Parameters:
set – The valid nodes to set in the group data set
- snp_based_filter()
A helper method to filter the snps based on if they are valid based on the number of max states for the snp positions and the minimum member count. If there are greater than the max states or fewer than the minimum member count, the snp is marked as invalid.
- summarize_snps()
A method to loop through the snps identified and determine if the snp is canonical and/or valid. It also creates a list of variant positions for downstream processing.
- temporal_signal()
A method to calculate the spearman’s and pearson’s coefficients from the metadata for the year and the calculated clade distances previously determined. If the spearman or pearson coefficients are larger than the pre-determined threshold, the temporal flag is set to ‘True’ in the clade data for the node.
- update()
A method to call the helper methods to update the clades -
get_valid_nodes(),set_valid_nodes(),generate_genotypes().
- workflow()
The workflow method calls all the helper methods of this class to process the data according to both the input and the flags set by the user to determine clade memberships, validate SNPs, and compress the resulting clade unless otherwise specified by the user. Please refer to the file examples/small_test/cladeomatic/cladeomatic-clades.info.txt for more information
constants
A module that holds a number of constants for the various functions clade-o-matic performs in the creation of schemes and data analysis.
kmers
- class cladeomatic.kmers.kmer_worker(ref_sequence, msa_file, result_dir, prefix, klen, genotype_map, genotype_snp_rules, max_ambig=0, min_perc=1, target_positions=[], num_threads=1)
The kmer_worker class.
The kmer_worker class contains several class variables for use in the creation of the kmer lists required for downstream processes in the creation of the schemes and data analysis files.
- __init__(ref_sequence, msa_file, result_dir, prefix, klen, genotype_map, genotype_snp_rules, max_ambig=0, min_perc=1, target_positions=[], num_threads=1)
The instantiation of the
kmer_workerclass.- Parameters:
ref_sequence (str) – The reference sequence for kmer selection
msa_file (str) – The file path to the fasta file with the snps substitutions (multiple sequence alignment)
result_dir (str) – The file path to the results directory
prefix (str) – The prefix for the clade-o-matic produced output files
klen (int) – The length of the kmers
genotype_map (dict) – The map of the sample identifiers and genotypes
genotype_snp_rules (dict) – The dictionary groupings of the positions, base variants, positive and partial genotypes for the snps, that make up the processing rules
max_ambig (int) – The maximum number of ambiguous bases that can be contained in a kmer sequence. Default for this class is 0.
min_perc (float) – The minimum percentage of clade members to be positive for a kmer to be considered valid. Default for this class is 1.
target_positions (list) – The integer list of the target snp positions. Default for this class is an empty list.
num_threads (int) – The number of threads for the Ray instance. Default for this class is 1.
Notes
Refer to https://www.ray.io for more information about the Ray instances used in this module
- calc_consensus_seq()
A method to determine the consensus sequence for the fasta passed by looping through the reference sequence.
- Returns:
The consensus sequence for the bases in the sequence fasta passed
- Return type:
str
- confirm_kmer_specificity()
A method to further filter invalid kmers from the extracted kmer dictionary. This method ensures the target positions have the valid bases and exist in the kmer scheme data dictionary. Kmers that do not adhere to these rules are flagged as invalid.
- construct_ruleset()
A method to construct and populate the kmer rule set dictionary of the positive genotypes (the kmers that match or exceed the minimum percentage of clade members to be positive for a kmer to be valid) and the partial genotypes which are less than the minimum percentage of clade members.
- create_biohansel_kmers()
A method to create the kmer data dictionary for use with the BioHansel scheme. This method creates the positive and negative kmers for use in the BioHansel scheme, while adhering to the same kmer rules as the
cladeomatic.create.create_scheme()method.- Returns:
A dictionary for the kmers to be used by the BioHansel scheme
- Return type:
dict
Notes
Refer to https://github.com/phac-nml/biohansel for more thorough BioHansel documentation
- extract_kmers()
This method creates the dictionary of all the possible kmers for the snp variants found in the sequence alignments. These kmers are found through the looping of the snp positions in the sequence and ‘frame-shifting’ the kmer sequence through all possibilities for the snp, as long as there are no ambiguous bases nor gaps in the kmer sequence and the kmer is the specified length.
- find_invalid_kmers()
A method to loop through the extracted kmer dictionary, retrieve the flagged invalid kmers and add their indexes to a set of just invalid kmers. This is a helper method used by other methods for kmer filtering.
- find_variant_positions()
Method to find the sequence variants and their positions in the
msa_base_countsdictionary and assign the positions of the bases to the variant_positions dictionary.
- flag_invalid_kmers(invalid_kmers)
This method takes the set of invalid kmer indexes and flags the kmers as invalid in the extracted kmer dictionary.
- Parameters:
invalid_kmers (set) – The set of integers for the indexes of the invalid kmers that require flagging in the extracted kmes dictionary
- get_genotype_snp_states()
A method to find the valid SNPs in the genotype membership dictionary previously constructed.
- get_kseq_by_index(index)
A helper method to return the kmer sequence for the kmer index passed from the extracted kmers dictionary.
- Parameters:
int (index -) – The index for the desired kmer sequence
- Returns:
The kmer sequence if it is found in the extracted kmer dictionary, but returns an empty string if not found
- Return type:
str
- get_optimal_kmer_position()
A method to determine the best kmer start positions for the variant/snps positions found in the find_variant_positions method. Set these kmer start positions in a list for further processing
- get_pos_without_kmer()
This method compiles a dictionary of missing kmer indexes as identifiers and the snp base they flank.
- Returns:
A dictionary of the missing kmers and the snp bases these kmers flank.
- Return type:
dict
- init_kmer_scheme_data()
A method to initialize the kmer scheme through the list of target positions. Adds blank entries to the kmer scheme dictionary.
- init_msa_base_counts()
Method to initialize the msa (multiple sequence alignment) base counts dictionary
msa_base_countswith the number of each base substitution for the length of the reference sequence set to zero.
- perform_kmer_search()
This method takes in the list of all possible kmers for the sequences provided from the method
extract_kmers()and writes them to a file for later processing using thecladeomatic.utils.kmerSearch.SeqSearchController()method.
- pop_msa_base_counts()
Method to count or populate the
msa_base_countsdictionary initialized ininit_msa_base_counts()for the bases in the fasta file passed. For the sequence in the fasta, the snp position is referenced and the counter is updated for each base substitution that occurs in that position.
- populate_int_base_kmer_lookup()
A method to instantiate the base kmer lookup dictionary with a sequential index and the sequence of the kmer.
- populate_kmer_scheme_data()
A method to populate the kmer scheme dictionary with the processed kmers of the scheme data and extracted kmers dictionaries. This method removes the invalid kmers, removes the kmers with invalid bases, and the kmers with invalid positions in comparison to the scheme data.
- process_kmer_results()
This method process the all the kmers within the search file clade-o-matic created in the
perform_kmer_search()method. This method filters and find the list of valid kmers for the sequence variants by ensuring the kmer sequence id exists in the genotype map and there are no duplicate kmers in the search file
- refine_rules()
A method to filter and refine the kmer selection rules. Missing data can cause kmers to all be partial when there isn’t an alternative kmer available for a genotype. This filters the rules to assign kmers to be positive for a genotype when there is only one kmer base state present for it.
- remove_empty_base_states()
A clean up method to remove the records with empty or invalid bases from the kmer scheme data dictionary.
- remove_invalid_kmers_from_scheme()
A method to remove the invalid kmers from the scheme data dictionary.
- remove_redundant_kmers()
A method to remove all but the best kmer sequences from the kmer scheme data dictionary. This is accomplished by revisiting the kmer rules sets to find the best representative kmer and finally updates the kmer data scheme dictionary with the results of the filtering.
- remove_scheme_pos(pos_to_remove)
A helper method to remove an entry in the kmer scheme data dictionary based on the position passed :param pos_to_remove: int - the index of the position to be removed
- workflow()
The workflow method to call all the methods within this class to create the kmer lists used for schema analysis and processing.
snps
- cladeomatic.snps.process_snp(chrom, pos, base, all_samples, snp_members, ambig_members, group_membership, is_ref)
A method to process all the inputs to create a dictionary entry of SNP data. Please refer to the file examples/small_test/cladeomatic/cladeomatic-snps.info.txt for more information.
- Parameters:
chrom (str) – The chromosome to which the SNP belongs to
pos (int) – The location of the SNP in the sequence
base (str) – The nucleotide base of th SNP
all_samples (set) – The set of identifiers for the samples
snp_members (set) – The identifiers for the SNPs
ambig_members (set) – The collections of ambiguously called nucleotides
group_membership (dict) – The dictionary for the node ids and their clade/group memberships
is_ref (bool) – True, if this the reference sequence nucleotide
- Returns:
A dictionary of the snp and associated information for further processing
- Return type:
dict
- cladeomatic.snps.snp_search_controller(group_data, vcf_file, n_threads=1)
A method to search for the lists of snps in the VCF file and group data to match them to create a complete snp dictionary. Please refer to the file examples/small_test/cladeomatic/cladeomatic-snps.info.txt for more information.
- Parameters:
group_data (dict) – A dictionary of all the SNPs and their clade/group membership
vcf_file (str) – The file path to the user VCF file containing the variants of interest
n_threads (int) – The number of threads to use in the Ray method call
- Returns:
The data dictionary for the snps as identified from the input and grouped by chromosome, position and base
- Return type:
dict
Notes
Refer to https://www.ray.io for more information about the Ray instances used in this module.
visualize
writers
- cladeomatic.writers.print_params(params, outfile)
A method to create the parameter file. This is the list of all the command line parameters and their user input values if supplied. Please refer to the sample file examples/small_test/cladeomatic/cladeomatic-params.log for more information.
- Parameters:
params (argparse.Namespace object) – The Namespace object that holds the parameters and their values
outfile (str) – The file path for the parameters output file
- cladeomatic.writers.write_genotypes(genotypes, outfile, header='sample_id\tgenotype\n')
Accepts a list of sample genotype hierarchies and writes them to a file. Please refer to the sample files examples/small_test/cladeomatic/cladeomatic-genotypes.distance.txt and examples/small_test/cladeomatic/cladeomatic-genotypes.selected.txt for more information.
:param : The dictionary of genotypes for writing :type : param genotypes : dict :param outfile: The output file path :type outfile: str
- cladeomatic.writers.write_kmers(kmers, out_file)
A method to create and write a kmer file indicating the index of the kmer, the kmer sequence, the snp position and base value, and the alignment start and end.
- Parameters:
kmers (dict) – The dictionary of kmers to be written
out_file (str) – The path to the output file
- cladeomatic.writers.write_node_report(clade_data, outfile)
Writes the clades data produced by the
cladeomatic.clades.clade_workerclass to a TSV file. Please refer to the sample file examples/small_test/cladeomatic/cladeomatic-clades.info.txt for more information.- Parameters:
clade_data (dict) – The clade data for writing to the file
outfile (str) – The output file path
- cladeomatic.writers.write_scheme(header, scheme, outfile)
A method to write either a kmer or snp scheme to a file. Please refer to the sample files examples/small_test/cladeomatic/cladeomatic-kmer.scheme.txt and examples/small_test/cladeomatic/cladeomatic-snp.scheme.txt for more information.
- Parameters:
header (str) – The string representing the header text
scheme (dict) – A dictionary of the kmer or snp scheme data to write out
outfile (str) – The output file path
- cladeomatic.writers.write_snp_report(snp_data, outfile)
Accepts snp_data data dictionary structure and writes the snp details to a file. Please refer to the sample file examples/small_test/cladeomatic/cladeomatic-snps.info.txt for more information.
- Parameters:
snp_data (dict) – A dictionary for the snp data in the format {[chrom_id] : {[position]: {[base]: :dict()}}}
outfile (str) – The file output path
utils.__init__
- cladeomatic.utils.__init__.calc_AMI(category_1, category_2)
Calculates the adjusted mutual info score between two clusterings.
- Parameters:
category_1 (list) – A list of int values for cluster 1
category_2 (list) – A list of int values for cluster 2
- Returns:
The value of the AMI score for the two clusters
- Return type:
float
- cladeomatic.utils.__init__.calc_ARI(category_1, category_2)
Calculates the adjusted rand score between two clusterings.
- Parameters:
category_1 (list) – A list of int values for cluster 1
category_2 (list) – A list of int values for cluster 2
- Returns:
The value of the ARI score for the two clusters
- Return type:
float
- cladeomatic.utils.__init__.calc_shanon_entropy(value_list)
This method calculates the shannon entropy value for the list of numbers passed.
- Parameters:
value_list (list) – The list of values to use for the calculation
- Returns:
The calculated Shannon entropy or -1 if there are no values in the passed list
- Return type:
float
- cladeomatic.utils.__init__.init_console_logger(lvl)
This method initializes the logger for the console.
- Parameters:
lvl (int) – The level of logging desired 0,1,2,3
- Returns:
The logging object
- Return type:
logging
- cladeomatic.utils.__init__.parse_metadata(file, column='sample_id')
Parses the metadata file into a dictionary with sample ids as the keys. Will dynamically add the other columns and values to the dictionary.
- Parameters:
file (str) – The path to tsv metadata file with ‘sample_id’ as a mandatory column
- Returns:
A dictionary of the parsed metadata values organized with the sample id as a key
- Return type:
dict
- cladeomatic.utils.__init__.run_command(command)
This method runs the passed command on the shell command line.
- Parameters:
command (str) – The command for the command line
- Returns:
stdout, stderr: the standard out and error messages returned by the command line
- Return type:
bytes, bytes
utils.jellyfish
utils.kmerSearch
- cladeomatic.utils.kmerSearch.SeqSearchController(seqKmers, fasta_file, out_dir, prefix, n_threads=1)
This method takes in the list of all possible kmers for the sequences provided and writes them to a temporary processing file for downstream searches. This method employs the use of Ray for processing.
- Parameters:
seqKmers (dict) – A dictionary with the index as keys and kmer sequences as values
fasta_file (str) – The file path to the fasta file containing the sequence
out_dir (str) – The filepath to the temporary output file for the kmers
prefix (str) – The prefix for the output file name
n_threads (int) – The number of threads to be used in the Ray process, default is 1
- Returns:
A list of strings for the paths to the temporary processing files
- Return type:
int
Notes
Refer to https://www.ray.io for more information about the Ray instances used in this module.
- cladeomatic.utils.kmerSearch.expand_degenerate_bases(seq)
List all possible kmers for a scheme given a degenerate base.
- Parameters:
seq (str) – The string for the scheme_kmers from SNV scheme fasta file
- Returns:
List of all possible kmers given a degenerate base or not
- Return type:
list
- cladeomatic.utils.kmerSearch.init_automaton_dict(seqs)
Initialize Aho-Corasick Automaton with the kmers found in the passed sequence dictionary. The Automaton takes the kmer and its reversed compliment for loading.
- Parameters:
seqs (dict) – The dictionary of kmer sequences and their indexes
Returns – Aho-Corasick Automaton with kmers loaded
Notes
Please refer to the pyahocorasick project, specifically the method add_word for this method.
- cladeomatic.utils.kmerSearch.revcomp(s)
This method creates the reverse compliment nucleotide sequence for the sequence passed using the str.translate method.
- Parameters:
s (str) – The nucleotide sequence to find the reverse compliment for
Returns –
str – The reverse complement of the passed nucleotide sequence
utils.phylo_tree
utils.seqdata
- cladeomatic.utils.seqdata.calc_homopolymers(seq)
The method calculates the longest homopolymer (the sequence of consecutive identical bases) in the sequence or sequence fragment passed.
- Parameters:
seq (str) – The sequences or sequence fragment to find the longest homopolymer
- Returns:
The longest homopolymer length
- Return type:
int
- cladeomatic.utils.seqdata.calc_md5(string)
Method to encode the MD5 hash for the input string.
- Parameters:
string (srt) –
hash (The string to compute the MD5) –
- Returns:
The md5 hash generated
- Return type:
hash
- cladeomatic.utils.seqdata.create_aln_pos_from_unalign_pos_lookup(aln_seq)
This method creates a list of integers for the positions of the bases in the unaligned sequence derived from the aligned sequence passed to the method.
- Parameters:
aln_seq (str) – The alignment sequence to process
- Returns:
A list of integers for the positions found
- Return type:
list
- cladeomatic.utils.seqdata.create_pseudoseqs_from_vcf(ref_id, ref_seq, vcf_file, outfile)
This method creates the pseudo full sequences of the variants found in the VCF file.
- Parameters:
ref_id (str) – The reference sequence identifier
ref_seq (str) – The reference sequence to alter
vcf_file (str) – The path to the VCF file
outfile (str) – The path to the pseudo variant outfile
- cladeomatic.utils.seqdata.gb_to_fasta_dict(gbk_file)
Reads a GenBank formatted sequence file and creates a dictionary of sequences with the sequence id as keys.
- Parameters:
str (gbk_file -) – The string path to a GenBank formatted sequence file
- Returns:
A dictionary of sequences indexed by sequence id
- Return type:
dict
- cladeomatic.utils.seqdata.generate_non_gap_position_lookup(seq)
Creates a list of positions which correspond to the position of that base in a gapless sequence.
- Parameters:
seq (str) – The sequence to process for gaps
- Returns:
An int list of the positions of the gaps
- Return type:
list
- cladeomatic.utils.seqdata.get_variants(vcf_file)
This method reads the incoming VCF file and returns a dictionary of the sample variant bases and their locations.
- Parameters:
vcf_file (str) – The file path to the VCF file for variant discovery
- Returns:
A dictionary of the sample variants with the node id as key
- Return type:
dict
- cladeomatic.utils.seqdata.parse_reference_gbk(gbk_file)
Method to parse the GenBank reference file, clean the strings, and return the reference features of interest.
- Parameters:
gbk_file (str) – The file path to the reference genbank format file with sequence annotations
- Returns:
A dictionary of all the reference features
- Return type:
dict
- cladeomatic.utils.seqdata.read_fasta_dict(fasta_file)
Reads the fasta file from the passed file path and formats the input to a dictionary of sequences.
- Parameters:
fasta_file (str) – The path to fasta file to read
- Returns:
A dictionary of sequences indexed by sequence id
- Return type:
dict
- cladeomatic.utils.seqdata.revcomp(s)
Helper method to create the reverse complement nucleotide sequence for the sequence passed.
- Parameters:
s (str) – The nucleotide sequence to parse
- Returns:
The reverse complement of the passed nucleotide sequence
- Return type:
str
utils.snpdists
- cladeomatic.utils.snpdists.run_snpdists(aln_file, out_file, num_threads=1)
Runs the command line snp-dists application which calculates the SNP distances from a fasta file.
- Parameters:
aln_file (str) – The path to the fasta sequence file
out_file (str) – The path to the output file
num_threads (int) – The number of processing threads for . Default is 1.
- Returns:
Stderr - the console error messages
- Return type:
tuple
Notes
Please refer to https://github.com/tseemann/snp-dists for more documentation
utils.vcfhelper
- class cladeomatic.utils.vcfhelper.vcfReader(file)
The vcfReader class is meant to be a helper class to read and process VCF files.
- check_file(file)
A helper method to verify that a file exists and is not empty.
- Parameters:
file (str) – The path to the file
- Returns:
True if the file both exists and is not empty
- Return type:
bool
- next_row()
A method to return the next row of the VCF file.
- Returns:
The next row of the file
- Return type:
vcfReader object
- process_row()
A method to process the output of a row for the VCF file.
- Returns:
A dictionary of the VCF file data with the headers as keys for the sample data. Return ‘None’ if there is no next row in the file.
- Return type:
dict
utils.visualization
- cladeomatic.utils.visualization.create_dist_histo(data, outfile)
A method to create a histogram from the data dictionary passed. This method also writes the histogram to a file.
- Parameters:
data (dict) – The data to be plotted
outfile (str) – The output figure file path
- cladeomatic.utils.visualization.plot_bar(x, y)
A method to plot a bar chart with the x and y objects passed.
- Parameters:
x (obj) – The list or a dictionary for the x-axis objects
y (obj) – The list or a dictionary for the y-axis objects
- Returns:
The plotly figure produced
- Return type:
Figure
Notes
Please refer to https://plotly.com/python/ for more documentation