Using TruncatedSVD in PySpark

Hi, I’m having a large list of Book-Titles / User-IDs and trying to use TruncatedSVD for dimensionality reduction. I’m following these steps:

  1. Define the feature columns.
  2. Assemble the features into a vector column.
  3. Apply TruncatedSVD.

But it looks like PySpark doesn’t have TruncatedSVD built-in, and using a similar approach with PCA would give OutOfMemoryError with a T4 GPU.

While this approach works well without using PySpark, I was wondering if anyone has run into this issue before and how they handled dimensionality reduction in PySpark.

Any tips or references would be really helpful. Thanks a lot!

Hello,
To handle dimensionality reduction in PySpark, since TruncatedSVD isn’t built-in, you can:
Use RandomProjection as an alternative to SVD for lower memory usage.
Perform TruncatedSVD outside PySpark using scikit-learn and re-import the results.
Try ChiSqSelector to reduce features before applying PCA or SVD.
Optimize PCA by increasing Spark memory settings and data partitioning.
Best Regards
esther598

1 Like

Thank you so much!