How to organize data between pandas, numpy, and torch?

gai · March 1, 2019, 9:16am

Lately I’ve been trying to create NNs using just pytorch. What I have found is that I often run into incompatible data type issues, e.g. a pandas Series and a torch tensor which cannot be multiplied together. I also looked for some documentation on how to best handle all this but haven’t found anything so far–does anyone know of some guidelines on how to best organize the data between these three libraries and what to avoid and why? Thanks!

gai · March 1, 2019, 9:34am

Examples:

# Mixing pandas and numpy is ok
df=pd.DataFrame({"a": [1]}); df.a * np.array([2])
# -> 0    2  Name: a, dtype: int64

df=pd.DataFrame({"a": [1]}); df.a * np.array([2]) * torch.Tensor([1])
# TypeError: mul(): argument 'other' (position 1) must be Tensor, not numpy.ndarray

# The previous error message indicates this shouldn't work
df=pd.DataFrame({"a": [1]}); max(df.a * np.array([2])) * torch.Tensor([1])
# -> tensor([2.])

Pomo · March 2, 2019, 6:06am

Hi Sven,

I don’t think there is a definitive guideline to be found. Each coder develops their own practices through experience, experimentation, and studying code examples.

For myself (beginner),

Pandas: table lookups, csv import and export, data summaries
numpy: rich math functions and libraries, powerful array indexing/operations, matplotlib for plotting.
PyTorch: fast, machine learning specialized, specialized but limited math operations. You generally have to explicitly convert data into and out of PyTorch tensors. Also, when making NNs themselves, operations must be done within PyTorch in order to have automatic gradient calculation.
Python: base types, list comprehensions, control structures. But I prefer to convert data out of Python lists and tuples as soon as possible, because it’s difficult to work with arrays.

Each data type area has explicit conversions to other types, yet many functions will automatically handle anything “array-like” from a related domain. You have to experiment or read docs to find out exactly what works.

I’d suggest learning all the to/from conversion functions for each area, and using the Python type() function often to discover exactly what goes on inside any expression.

HTH & good luck!