dna

Class to represent DNA sequences.

Attributes

`NUCLEOBASES`
`PAD_TOKEN`

Classes

DNA

Module Contents

dna.NUCLEOBASES = ('A', 'C', 'T', 'G')[source]

dna.PAD_TOKEN = '0'[source]

class dna.DNA(proxy_fmt='onehot-np', **kwargs)[source]

Bases: gflownet.envs.sequences.base.SequenceBase

Parameters:

proxy_fmt (str) –

Specifies the proxy format. Options:

onehot: One-hot encoding
letters: The nucleobases as a list of strings
np or numpy: numpy, for the onehot case
torch or tensor: torch tensor, for the onehot case

states2proxy_onehot(states)[source]

Prepares a batch of states in “environment format” for a proxy model: states are one-hot encoded. If numpy is True (default), the output is converted into a numpy array, otherwise it remains a torch tensor.

Example, with max_length = 5:

Sequence (tokens): ACGC
state: [1, 2, 4, 2, 0]
policy format:

[0, 1, 0, 0, 0, (A)
0, 0, 1, 0, 0, (C) 0, 0, 0, 0, 1, (G) 0, 0, 1, 0, 0, (C) 1, 0, 0, 0, 0] (PAD)

Parameters:: states (tensor) – A batch of states in environment format, either as a list of states or as a single tensor.
Returns:: A numpy array containing the one-hot encoding of all the states in the batch.
Return type:: Union[torchtyping.TensorType[batch, policy_input_dim], numpy.typing.NDArray]