I got very similar different timings for S4TF. Did you exclude the first iteration of the S4TF matmul? It includes the compilation step (see this thread).
The timing I got on a 6 core intel processor (CPU) were as follows:
- Numpy -
5.63 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- PyTorch -
2.21 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- S4TF -
1000 loops Mean: 3.878070619 ms, Std Dev: 223.7475553704346 µs
GPU (1080 Ti):
- S4TF -
1000 loops Mean: 32.084067 µs. Std Dev: 10.078865044860505 µs
- PyTorch -
202 µs ± 73.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Links: