RepeatFinder#

class pytantan.RepeatFinder#

A repeat finder using the Tantan method.

alphabet#

The alphabet to use for encoding and decoding sequences.

Type:

Alphabet

scoring_matrix#

The scoring matrix used to derive the likelihood ratio matrix, if any.

Type:

ScoringMatrix or None

likelihood_matrix#

The likelihood ratio matrix used for scoring letter pairs in tandem repeats.

Type:

LikelihoodMatrix

__init__(scoring_matrix, *, repeat_start=0.005, repeat_end=0.05, decay=0.9, protein=False)#

Create a new repeat finder.

Parameters:
  • scoring_matrix (ScoringMatrix) – The scoring matrix to use for scoring sequence alignments.

  • repeat_start (float) – The probability of a repeat starting per position.

  • repeat_end (float) – The probability of a repeat ending per position.

  • decay (float) – The probability decay per period.

  • protein (bool) – Set to True to treat the input sequence as a protein sequence.

get_probabilities(sequence)#

Get the probabilities of being a repeat for each sequence position.

Parameters:

sequence (str or byte-like object) – The sequence to compute probabilities for.

Returns:

array – A float array containing the repeat probability per sequence position.

mask_repeats(sequence, threshold=0.5, mask=None)#

Mask regions predicted as repeats in the given sequence.

Parameters:
  • sequence (str or byte-like object) – The sequence containing the repeats to mask.

  • threshold (float) – The probability threshold above which to mask sequence characters.

  • mask (str or None) – A single mask character to use for masking positions. If None given, masking uses the lowercase letters of the original sequence.

Returns:

str – The input sequence with repeat regions masked.

Example

>>> matrix = ScoringMatrix(
...     [[ 1, -1, -1, -1],
...      [-1,  1, -1, -1],
...      [-1, -1,  1, -1],
...      [-1, -1, -1,  1]],
...     alphabet="ACGT",
... )
>>> tantan = RepeatFinder(matrix)
>>> tantan.mask_repeats("ATTATTATTATTATT")
'ATTattattattatt'
>>> tantan.mask_repeats("ATTATTATTATTATT", mask='N')
'ATTNNNNNNNNNNNN'