For the Data Block API's "PreProcessor" class, when is "process_one" called and when is "process" called?

Building my own custom ItemBase and ItemList classes and trying to figure out what triggers a call to “process_one” or “process” in any PreProcessor subclasses defined in my ItemList.

I see that building a LabelList at least triggers process (e.g., ll = ils.label_from_df(dep_var) causes preprocessors to run for the entire dataset/itemlist … but not sure where else, or how else, process can be called.

So, where do process and process_one get called from, and what are the legit ways they can be used?

2 Likes

From what I can tell, process calls process_one. Here is where I’m seeing that:

in data_block.py PreProcessor Class:

class PreProcessor():
    "Basic class for a processor that will be applied to items at the end of the data block API."
    def __init__(self, ds:Collection=None):  self.ref_ds = ds
    def process_one(self, item:Any):         return item
    def process(self, ds:Collection):        ds.items = array([self.process_one(item) for item in ds.items])

So this PreProcessor class is meant to be super classed by any preprocessors that are being build and you would probably define your action in the process_one def and then call process after testing process_one on a single instance of whatever item contains. That’s how it seems like it would be used, but looking at an actual preprocessor, TokenizeProcessor, it actually isn’t working quite like that. Here is the source for the TokenizeProcessor class:

class TokenizeProcessor(PreProcessor):
    "`PreProcessor` that tokenizes the texts in `ds`."
    def __init__(self, ds:ItemList=None, tokenizer:Tokenizer=None, chunksize:int=10000, 
                 mark_fields:bool=False, include_bos:bool=True, include_eos:bool=False):
        self.tokenizer,self.chunksize,self.mark_fields = ifnone(tokenizer, Tokenizer()),chunksize,mark_fields
        self.include_bos, self.include_eos = include_bos, include_eos

    def process_one(self, item):
        return self.tokenizer._process_all_1(_join_texts([item], self.mark_fields, self.include_bos, self.include_eos))[0]

    def process(self, ds):
        ds.items = _join_texts(ds.items, self.mark_fields, self.include_bos, self.include_eos)
        tokens = []
        for i in progress_bar(range(0,len(ds),self.chunksize), leave=False):
            tokens += self.tokenizer.process_all(ds.items[i:i+self.chunksize])
        ds.items = tokens

So it seems like there are a few ways to use those classes. Here is another text preprocessor, NumericalizeProcessor:

class NumericalizeProcessor(PreProcessor):
    "`PreProcessor` that numericalizes the tokens in `ds`."
    def __init__(self, ds:ItemList=None, vocab:Vocab=None, max_vocab:int=60000, min_freq:int=3):
        vocab = ifnone(vocab, ds.vocab if ds is not None else None)
        self.vocab,self.max_vocab,self.min_freq = vocab,max_vocab,min_freq

    def process_one(self,item): return np.array(self.vocab.numericalize(item), dtype=np.int64)
    def process(self, ds):
        if self.vocab is None: self.vocab = Vocab.create(ds.items, self.max_vocab, self.min_freq)
        ds.vocab = self.vocab
        super().process(ds)

So neither of these actually passes a list of items into process_one through process like is being done in the PreProcessor class that they spawn from. Interesting to look through that and hopefully this may be helpful to somebody in the future (maybe even myself, that happens every once in a while :slight_smile: )

1 Like