How to turn off ALL threading / multi-processing?

sut · October 18, 2020, 2:00pm

Is there a way to make sure no threading or multi-processing occur in fastai methods? My VSCode debugger often drops its connection with a running program which uses multi-proc. For this reason, I’d like a way to turn it off everywhere if desired.

In DataLoaders, I know you can set num_workers=0. Which does the trick.

Other functions continue to use threading though:, e.g. tokenize_df [0] which is used when doing setups on a TextDataLoader. Here, the num_workers argument does not flow through into calling rhis function and it spawns a number of child processes equal to the number of cpus on your machine.

This behavior gets set in fastcore at load time so I’m not seeing an easy way to override it.

Am I missing an easy trick someone else has discovered?

[0]:

github.com

fastai/fastai/blob/1381b32618b08a14f8e6a5b1b19efc0dd5fb88b6/fastai/text/core.py#L211-L220


def tokenize_df(df, text_cols, n_workers=defaults.cpus, rules=None, mark_fields=None,
                tok=None, res_col_name="text"):
    "Tokenize texts in `df[text_cols]` in parallel using `n_workers`"
    text_cols = [df.columns[c] if isinstance(c, int) else c for c in L(text_cols)]
    #mark_fields defaults to False if there is one column of texts, True if there are multiple
    if mark_fields is None: mark_fields = len(text_cols)>1
    rules = L(ifnone(rules, defaults.text_proc_rules.copy()))
    texts = _join_texts(df[text_cols], mark_fields=mark_fields)
    outputs = L(parallel_tokenize(texts, tok, rules, n_workers=n_workers)
               ).sorted().itemgot(1)

sut · October 18, 2020, 3:38pm

import os
os.sched_setaffinity(0, (0,))
from fastai import ...

I’ve found starting your .py with this (before importing fastai methods) gets n_workers=1

Good: this makes VSCode’s debugger much more reliable. (I think because it now has a cpu core to itself to monitor the process being debugged?)

Bad: Still no way get n_workers=0 and so we’re still spawning one child process, and some code will follow different exectuion paths, e.g. fastcore.utils.parallel_gen():

github.com

fastai/fastcore/blob/a505902fe013321df42c55703fff3c24a88d77f4/fastcore/utils.py#L806-L817


def parallel_gen(cls, items, n_workers=defaults.cpus, **kwargs):
    "Instantiate `cls` in `n_workers` procs & call each on a subset of `items` in parallel."
    if n_workers==0:
        yield from enumerate(list(cls(**kwargs)(items)))
        return
    batches = L(chunked(items, n_chunks=n_workers))
    idx = L(itertools.accumulate(0 + batches.map(len)))
    queue = Queue()
    if progress_bar: items = progress_bar(items, leave=False)
    f=partial(_f_pg, cls(**kwargs), queue)
    done=partial(_done_pg, queue, items)
    yield from run_procs(f, done, L(batches,idx).zip())

jbfm · June 18, 2021, 8:38pm

I solved this problem by vendoring a modified version of Fastai with a re-written copy of tokenize_df that doesn’t use parallel_tokenize. So this is not an “easy trick” but it worked in my situation; I am now able to use Fastai from within a Celery worker process.

My modified version of tokenize_df looks like this:

    df,
    text_cols,
    n_workers=defaults.cpus,
    rules=None,
    mark_fields=None,
    tok=None,
    tok_text_col="text",
):
    "Tokenize texts in `df[text_cols]` in parallel using `n_workers` and stores them in `df[tok_text_col]`"
    text_cols = [df.columns[c] if isinstance(c, int) else c for c in L(text_cols)]
    # mark_fields defaults to False if there is one column of texts, True if there are multiple
    if mark_fields is None:
        mark_fields = len(text_cols) > 1
    rules = L(ifnone(rules, defaults.text_proc_rules.copy()))
    texts = _join_texts(df[text_cols], mark_fields=mark_fields)
    if tok is None:
        tok = WordTokenizer()
    if hasattr(tok, "setup"):
        tok.setup(items, rules)
    outputs = L(TokenizeWithRules(tok=tok, rules=rules)(texts))

    other_cols = df.columns[~df.columns.isin(text_cols)]
    res = df[other_cols].copy()
    res[tok_text_col] = outputs
    res[f"{tok_text_col}_length"] = [len(o) for o in outputs]
    return res, Counter(outputs.concat())```

How to turn off *ALL* threading / multi-processing?

How to turn off ALL threading / multi-processing?