dna

Class to represent DNA sequences.

Attributes

NUCLEOBASES

PAD_TOKEN

Classes

DNA

Module Contents

dna.NUCLEOBASES = ('A', 'C', 'T', 'G')[source]
dna.PAD_TOKEN = '0'[source]
class dna.DNA(proxy_fmt='onehot-np', **kwargs)[source]

Bases: gflownet.envs.sequences.base.SequenceBase

Parameters:

proxy_fmt (str) –

Specifies the proxy format. Options:
  • onehot: One-hot encoding

  • letters: The nucleobases as a list of strings

  • np or numpy: numpy, for the onehot case

  • torch or tensor: torch tensor, for the onehot case

states2proxy_onehot(states)[source]

Prepares a batch of states in “environment format” for a proxy model: states are one-hot encoded. If numpy is True (default), the output is converted into a numpy array, otherwise it remains a torch tensor.

Example, with max_length = 5:
  • Sequence (tokens): ACGC

  • state: [1, 2, 4, 2, 0]

  • policy format:
    [0, 1, 0, 0, 0, (A)

    0, 0, 1, 0, 0, (C) 0, 0, 0, 0, 1, (G) 0, 0, 1, 0, 0, (C) 1, 0, 0, 0, 0] (PAD)

Parameters:

states (tensor) – A batch of states in environment format, either as a list of states or as a single tensor.

Returns:

A numpy array containing the one-hot encoding of all the states in the batch.

Return type:

Union[torchtyping.TensorType[batch, policy_input_dim], numpy.typing.NDArray]