hello guys !!
i try to run lda2vec algorithm with m own data from : lda2vec . i have one fichier csv with 3 data (idTweet,textTweet,dateTweet) and when i try to run the module preprocess.py , i’m getting this error :
n_words 21
[ 0 1 2 … 4324 4326 4325]
n_stories 4327
Traceback (most recent call last):
File “preprocess.py”, line 74, in
flattened, features_flat = corpus.compact_to_flat(pruned,*feature_arrs)
File “/usr/local/lib/python3.6/dist-packages/lda2vec-0.1-py3.6.egg/lda2vec/corpus.py”, line 422, in compact_to_flat
IndexError: invalid index to scalar variable.
how to fix this error.
voici mon code :
from lda2vec import preprocess, Corpus
import numpy as np
import pandas as pd
import logging
import pickle
import os.path
import spacy
from spacy.attrs import LOWER
logging.basicConfig()
max_length = 250 # Limit of 250 words per comment
min_author_comments = 50 # Exclude authors with fewer comments
nrows = None # Number of rows of file to read; None reads in full file
nlp = spacy.load(“en_core_web_sm”)
#fn = “/content/drive/MyDrive/lda2vec-master/examples/hacker_news/data/mars_avril.csv”
features =
Convert to unicode (spaCy only works with unicode)
features = pd.read_csv(“/content/drive/MyDrive/lda2vec-master/examples/hacker_news/data/mars_avril.csv”, encoding=‘utf8’, nrows=nrows)
Convert all integer arrays to int32
for col, dtype in zip(features.columns, features.dtypes):
if dtype is np.dtype('int64'):
features[col] = features[col].astype('int32')
Tokenize the texts
If this fails it’s likely spacy. Install a recent spacy version.
Only the most recent versions have tokenization of noun phrases
I’m using SHA dfd1a1d3a24b4ef5904975268c1bbb13ae1a32ff
Also try running python -m spacy.en.download all --force
texts = features.pop(‘textTweet’).values
tokens, vocab = preprocess.tokenize(str(texts), max_length, skip=-2, attr=LOWER, merge=False, nlp=nlp)
del texts
Make a ranked list of rare vs frequent words
corpus = Corpus()
corpus.update_word_count(tokens)
corpus.finalize()
The tokenization uses spaCy indices, and so may have gaps
between indices for words that aren’t present in our dataset.
This builds a new compact index
compact = corpus.to_compact(tokens)
Remove extremely rare words
pruned = corpus.filter_count(compact, min_count=10)
Words tend to have power law frequency, so selectively
downsample the most prevalent words
clean = corpus.subsample_frequent(pruned)
print (“n_words”, np.unique(clean).max())
Extract numpy arrays over the fields we want covered by topics
Convert to categorical variables
#author_counts = features[‘dateTweet’].value_counts()
#to_remove = author_counts[author_counts < min_author_comments].index
#mask = features[‘dateTweet’].isin(to_remove).values
#author_name = features[‘dateTweet’].values.copy()
#author_name[mask] = ‘infrequent_author’
#features[‘dateTweet’] = author_name
#authors = pd.Categorical(features[‘dateTweet’])
#author_id = authors.codes
#author_name = authors.categories
story_id = pd.Categorical(features[‘idTweet’]).codes
Chop timestamps into days
print(story_id)
#story_time = pd.to_datetime(features[‘dateTweet’], unit=‘s’)
#days_since = (story_time - story_time.min()) / pd.Timedelta(‘1 day’)
#time_id = days_since.astype(‘int32’)
features[‘story_id_codes’] = story_id
#features[‘author_id_codes’] = story_id
#features[‘time_id_codes’] = time_id
#print (“n_authors”, author_id.max())
print (“n_stories”, story_id.max())
#print (“n_times”, time_id.max())
Extract outcome supervised features
#ranking = features[‘comment_ranking’].values
#score = features[‘story_comment_count’].values
Now flatten a 2D array of document per row and word position
per column to a 1D array of words. This will also remove skips
and OoV words
feature_arrs = (story_id)
flattened, features_flat = corpus.compact_to_flat(pruned,*feature_arrs)
Flattened feature arrays
story_id_f = features_flat
Save the data
pickle.dump(corpus, open(‘corpus’, ‘wb’), protocol=2)
pickle.dump(vocab, open(‘vocab’, ‘wb’), protocol=2)
features.to_pickle(‘features.pd’)
data = dict(flattened=flattened, story_id=story_id_f,author_id=author_id_f)
np.savez(‘data’, **data)
np.save(open(‘tokens’, ‘wb’), tokens)