Using TruncatedSVD in PySpark

Tooba · August 13, 2024, 8:15pm

Hi, I’m having a large list of Book-Titles / User-IDs and trying to use TruncatedSVD for dimensionality reduction. I’m following these steps:

Define the feature columns.
Assemble the features into a vector column.
Apply TruncatedSVD.

But it looks like PySpark doesn’t have TruncatedSVD built-in, and using a similar approach with PCA would give OutOfMemoryError with a T4 GPU.

While this approach works well without using PySpark, I was wondering if anyone has run into this issue before and how they handled dimensionality reduction in PySpark.

Any tips or references would be really helpful. Thanks a lot!

esther598 · October 18, 2024, 6:22am

Hello,
To handle dimensionality reduction in PySpark, since TruncatedSVD isn’t built-in, you can:
Use RandomProjection as an alternative to SVD for lower memory usage.
Perform TruncatedSVD outside PySpark using scikit-learn and re-import the results.
Try ChiSqSelector to reduce features before applying PCA or SVD.
Optimize PCA by increasing Spark memory settings and data partitioning.
Best Regards
esther598

Tooba · October 20, 2024, 10:56pm

Thank you so much!