I’m generally following the Kaggle notebook “Iterate like a grandmaster!” to do a model where we will input a phrase and it gets the song(s) of a specific artist that has the same general meaning/context.
I’ve found a csv dataset that has the song name and lyric columns. This lyric column has the full lyric of the song.
At a point I’m doing :
inps = “Unnamed: 0”,“Artist”,“Album”, “Year”, “Date”
tok_ds = ds.map(tok_func, batched=True, remove_columns=inps+(‘inputs’))
And I’m getting error :
TypeError: can only concatenate tuple (not “str”) to tuple
Do you thing I should do any kind of preprocessing on that big lyrics column?
What I can do about this error?
Thanks and I appreciate any comments.
Hello,
I see where you’re heading! It looks like there’s an issue with how the input columns are specified.
To address the error TypeError: can only concatenate tuple (not “str”) to tuple
, you need to make sure that you’re concatenating tuples correctly. The error you’re encountering is due to trying to concatenate a string ('inputs'
) to a tuple (inps
). To fix this, you can convert 'inputs'
into a tuple before concatenating.
Here’s how you can modify your code:
inps = ("Unnamed: 0","Artist","Album", "Year", "Date")
tok_ds = ds.map(tok_func, batched=True, remove_columns=inps + ('inputs',))
Notice how I wrapped 'inputs'
in parentheses to convert it to a tuple: ('inputs',)
. This way, the concatenation will work correctly.
Regarding preprocessing the lyrics column, it’s often helpful to clean and preprocess text data before using it in your model. You can consider the following preprocessing steps:
- Lowercasing: Convert all text to lowercase.
- Removing punctuation: Remove or replace punctuation marks.
- Tokenization: Split text into individual words or tokens.
- Stopwords removal: Remove common words that may not be useful for your analysis (e.g., “and”, “the”).
- Lemmatization: Convert words to their base form (e.g., “running” to “run”).
Here’s a simple example using Python’s nltk
library for text preprocessing:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
def preprocess_lyrics(lyrics):
# Lowercase
lyrics = lyrics.lower()
# Remove punctuation
lyrics = lyrics.translate(str.maketrans("", "", string.punctuation))
# Tokenize
tokens = word_tokenize(lyrics)
# Remove stopwords
tokens = [word for word in tokens if word not in stopwords.words('english')]
# Lemmatize
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
return " ".join(tokens)
You can then apply this preprocessing function to your lyrics column.
I hope this helps!
Best regards,
Dora
1 Like