Speeding up text interpretation

hafidz · August 26, 2019, 7:52am

Currently each of my prediction, which also calls out the attention scores from the text classifier (via TextClassificationInterpretation), takes on average 20s per call.

Has anyone else managed to get faster speed for this?

For context, my ideal output requires me to churn out the predicted label, probability and the attention scores. I don’t think 20s per prediction is acceptable for this to be used for production. Wondering about my options at the moment.

Thanks in advance.

krash · September 3, 2019, 11:11am

Text interpretation need not be done for every prediction. Can you share the code snippet maybe that would help.

@hafidz How did you obtain text attention scores and their corresponding mapping? I am getting index out of bounds when I compare the output of interp.intrinsic_attention()

txt = " Valar Morghulis - All men must serve"
x = interp.intrinsic_attention(txt)

returns a tuple of actual text and tensor of attention scores

(Text xxbos xxmaj xxunk xxmaj xxunk - xxmaj all men must serve, tensor([0.5078, 0.2553, 0.4090, 0.0895, 0.3499, 0.2155, 0.1322, 0.4119, 0.6209, 0.8082, 1.0000], device='cuda:0'))

x[0] is of object Text and x[1] is a tensor. Let’s say want to get top 5 attention words.
How do we tokenize Text object?

At first I thought I would use spacy.tokenize but there was length mismatch between words and their scores. After gong through fastai library , the intrinsic_attention function uses split() to tokenise, even this is leading to length mismatch. Link to show_intrinsic_attention
show_piece_attn(text.text.split(), to_np(attn), **kwargs)
I am not sure how to proceed.
Can anyone help me.

Thanks!

hafidz · September 11, 2019, 6:54am

Hi @krash . Here’s an example. In my case I’ve removed the special tokens since I don’t really care about mapping them.

interp = TextClassificationInterpretation.from_learner(model)
int_att = interp.intrinsic_attention(input_object)
rev_str = int_att[0].text.split(' ')
scores = [i.item() for i in int_att[1]]

# remove special token and their weightage
special_token = ['xxunk', 'xxpad', 'xxbos', 'xxeos',
                 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep']
paired_attn = [i for i in zip(rev_str, scores)]
paired_scores = [[i, j] for i, j in paired_attn if i not in special_token]
"""
    Example output:
    [
        ['really', 0.4153802692890167],
        ['awesome', 0.09405671805143356],
    ]
"""