IndexError: invalid index to scalar variable

Fathia · February 23, 2021, 7:12am

hello guys !!
i try to run lda2vec algorithm with m own data from : lda2vec . i have one fichier csv with 3 data (idTweet,textTweet,dateTweet) and when i try to run the module preprocess.py , i’m getting this error :

n_words 21
[ 0 1 2 … 4324 4326 4325]
n_stories 4327
Traceback (most recent call last):
File “preprocess.py”, line 74, in
flattened, features_flat = corpus.compact_to_flat(pruned,*feature_arrs)
File “/usr/local/lib/python3.6/dist-packages/lda2vec-0.1-py3.6.egg/lda2vec/corpus.py”, line 422, in compact_to_flat
IndexError: invalid index to scalar variable.
how to fix this error.
voici mon code :

from lda2vec import preprocess, Corpus

import numpy as np

import pandas as pd

import logging

import pickle

import os.path

import spacy

from spacy.attrs import LOWER

logging.basicConfig()

max_length = 250 # Limit of 250 words per comment

min_author_comments = 50 # Exclude authors with fewer comments

nrows = None # Number of rows of file to read; None reads in full file

nlp = spacy.load(“en_core_web_sm”)

#fn = “/content/drive/MyDrive/lda2vec-master/examples/hacker_news/data/mars_avril.csv”

features =

Convert to unicode (spaCy only works with unicode)

features = pd.read_csv(“/content/drive/MyDrive/lda2vec-master/examples/hacker_news/data/mars_avril.csv”, encoding=‘utf8’, nrows=nrows)

Convert all integer arrays to int32

for col, dtype in zip(features.columns, features.dtypes):

if dtype is np.dtype('int64'):

    features[col] = features[col].astype('int32')

Tokenize the texts

If this fails it’s likely spacy. Install a recent spacy version.

Only the most recent versions have tokenization of noun phrases

I’m using SHA dfd1a1d3a24b4ef5904975268c1bbb13ae1a32ff

Also try running python -m spacy.en.download all --force

texts = features.pop(‘textTweet’).values

tokens, vocab = preprocess.tokenize(str(texts), max_length, skip=-2, attr=LOWER, merge=False, nlp=nlp)

del texts

Make a ranked list of rare vs frequent words

corpus = Corpus()

corpus.update_word_count(tokens)

corpus.finalize()

The tokenization uses spaCy indices, and so may have gaps

between indices for words that aren’t present in our dataset.

This builds a new compact index

compact = corpus.to_compact(tokens)

Remove extremely rare words

pruned = corpus.filter_count(compact, min_count=10)

Words tend to have power law frequency, so selectively

downsample the most prevalent words

clean = corpus.subsample_frequent(pruned)

print (“n_words”, np.unique(clean).max())

Extract numpy arrays over the fields we want covered by topics

Convert to categorical variables

#author_counts = features[‘dateTweet’].value_counts()

#to_remove = author_counts[author_counts < min_author_comments].index

#mask = features[‘dateTweet’].isin(to_remove).values

#author_name = features[‘dateTweet’].values.copy()

#author_name[mask] = ‘infrequent_author’

#features[‘dateTweet’] = author_name

#authors = pd.Categorical(features[‘dateTweet’])

#author_id = authors.codes

#author_name = authors.categories

story_id = pd.Categorical(features[‘idTweet’]).codes

Chop timestamps into days

print(story_id)

#story_time = pd.to_datetime(features[‘dateTweet’], unit=‘s’)

#days_since = (story_time - story_time.min()) / pd.Timedelta(‘1 day’)

#time_id = days_since.astype(‘int32’)

features[‘story_id_codes’] = story_id

#features[‘author_id_codes’] = story_id

#features[‘time_id_codes’] = time_id

#print (“n_authors”, author_id.max())

print (“n_stories”, story_id.max())

#print (“n_times”, time_id.max())

Extract outcome supervised features

#ranking = features[‘comment_ranking’].values

#score = features[‘story_comment_count’].values

Now flatten a 2D array of document per row and word position

per column to a 1D array of words. This will also remove skips

and OoV words

feature_arrs = (story_id)

flattened, features_flat = corpus.compact_to_flat(pruned,*feature_arrs)

Flattened feature arrays

story_id_f = features_flat

Save the data

pickle.dump(corpus, open(‘corpus’, ‘wb’), protocol=2)

pickle.dump(vocab, open(‘vocab’, ‘wb’), protocol=2)

features.to_pickle(‘features.pd’)

data = dict(flattened=flattened, story_id=story_id_f,author_id=author_id_f)

np.savez(‘data’, **data)

np.save(open(‘tokens’, ‘wb’), tokens)