From what I can tell, process calls process_one. Here is where I’m seeing that:
in data_block.py PreProcessor Class:
class PreProcessor():
"Basic class for a processor that will be applied to items at the end of the data block API."
def __init__(self, ds:Collection=None): self.ref_ds = ds
def process_one(self, item:Any): return item
def process(self, ds:Collection): ds.items = array([self.process_one(item) for item in ds.items])
So this PreProcessor class is meant to be super classed by any preprocessors that are being build and you would probably define your action in the process_one def and then call process after testing process_one on a single instance of whatever item contains. That’s how it seems like it would be used, but looking at an actual preprocessor, TokenizeProcessor, it actually isn’t working quite like that. Here is the source for the TokenizeProcessor class:
class TokenizeProcessor(PreProcessor):
"`PreProcessor` that tokenizes the texts in `ds`."
def __init__(self, ds:ItemList=None, tokenizer:Tokenizer=None, chunksize:int=10000,
mark_fields:bool=False, include_bos:bool=True, include_eos:bool=False):
self.tokenizer,self.chunksize,self.mark_fields = ifnone(tokenizer, Tokenizer()),chunksize,mark_fields
self.include_bos, self.include_eos = include_bos, include_eos
def process_one(self, item):
return self.tokenizer._process_all_1(_join_texts([item], self.mark_fields, self.include_bos, self.include_eos))[0]
def process(self, ds):
ds.items = _join_texts(ds.items, self.mark_fields, self.include_bos, self.include_eos)
tokens = []
for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):
tokens += self.tokenizer.process_all(ds.items[i:i+self.chunksize])
ds.items = tokens
So it seems like there are a few ways to use those classes. Here is another text preprocessor, NumericalizeProcessor:
class NumericalizeProcessor(PreProcessor):
"`PreProcessor` that numericalizes the tokens in `ds`."
def __init__(self, ds:ItemList=None, vocab:Vocab=None, max_vocab:int=60000, min_freq:int=3):
vocab = ifnone(vocab, ds.vocab if ds is not None else None)
self.vocab,self.max_vocab,self.min_freq = vocab,max_vocab,min_freq
def process_one(self,item): return np.array(self.vocab.numericalize(item), dtype=np.int64)
def process(self, ds):
if self.vocab is None: self.vocab = Vocab.create(ds.items, self.max_vocab, self.min_freq)
ds.vocab = self.vocab
super().process(ds)
So neither of these actually passes a list of items into process_one through process like is being done in the PreProcessor class that they spawn from. Interesting to look through that and hopefully this may be helpful to somebody in the future (maybe even myself, that happens every once in a while
)