I am trying to apply fastai for rna prediction from dna sequence.
Current bottelneck is to use datablock for my dna sequence data.
I have score for each sequences of 500 character(ACGT) long. And using follow function for transforming the DNA sequence to partially type of image.
def one_hot_encode(seq):
"""
Given a DNA sequence, return its one-hot encoding
"""
# Make sure seq has only allowed bases
allowed = set("ACTGNactgn")
if not set(seq).issubset(allowed):
invalid = set(seq) - allowed
raise ValueError(f"Sequence contains chars not in allowed DNA alphabet (ACGTN): {invalid}")
# Dictionary returning one-hot encoding for each nucleotide
nuc_d = {'A':[1.0,0.0,0.0,0.0],
'C':[0.0,1.0,0.0,0.0],
'G':[0.0,0.0,1.0,0.0],
'T':[0.0,0.0,0.0,1.0],
'N':[0.0,0.0,0.0,0.0],
}
# Create array from nucleotide sequence
vec=np.array([nuc_d[x] for x in seq])
return vec
However I have no clue how to setup the datablock and dataloaders for it. It will be grateful to have any tips