I might be able to provide some context here.
In some sense what would be ideal with feature selection is for the number of features to go down, but for accuracy to go up (or, at least, not go down). However, this is tricky.
My expectation (although I would be happy to be proven wrong), is that straightforward feature selection is unlikely to work well. (This is despite the fact that, as you have seen, the distribution of weights in the learned classifier looks promising.)
This is covered to some extent in section 4.3.1 of the paper. If you look at figure 5, basically what it shows is that as the number of kernels goes up, accuracy also goes up (and vice versa). The difference stops being statistically significant after about 10K kernels. However, the difference between, say, 100 kernels and 1K kernels is statistically significant, not because 1K kernels produce radically higher accuracy than 100 kernels, but rather becuase 1K kernels produce consistently higher accuracy. This could be a very small increase in accuracy over a majority of datasets, and indeed this is basically what you see with any increase in the number of kernels: a small but consistent increase in accuracy. The other aspect of this, as you have seen, is that variance increases as the number of kernels goes down (any given set of 100 kernels is likely less similar to any other given set of 100 kernels, as compared to the difference between two sets of 10K kernels, or two sets of 100K kernels).
So why do we use 10K kernels (producing 20K features)? Because 10K is consistently more accurate than < 10K kernels, because 10K kernels produces relatively low variance, because more kernels doesn’t make that much more difference in terms of either accuracy or variance (diminishing returns; more formally, this is more or less the point where the difference is no longer statistically significant), because 10K is a round number, and because the classifier (the ridge regression classifier or logistic / softmax regression) can handle 20K features easily (and, for the rigde regression classifier, even with a small number of training examples).
Nonetheless, if compute time is critical, you can use fewer kernels. What you can’t see in figure 5 is that, even with 100 kernels, ROCKET ranks somewhere in the middle of the ‘second pack’ of classifiers (roughly similar performance to ProximityForest). And, with 100 kernels, you should be able to train and test the whole UCR archive in about 2 minutes, and you can see from our scalability experiments that, for a small hit in accuracy, you can learn from > 1 million time series in about 1 minute.
Loosely speaking, for problems where ROCKET works well, even, e.g., 100 kernels should produce pretty good accuracy. 1K kernels will produce consistently higher accuracy (but the actual increase in accuracy is likely to be relatively small). Same for 10K over 1K, etc. (Obviously, you start running into other problems as the number of kernels keeps increasing.)
The bottom line is that you are probably going to get similar results simply by using fewer kernels in the first place, rather than generating more kernels then doing feature selection.
There is also a bigger picture. Feature selection takes time. Not necessarily much time, but it depends what you are doing. Convolutional kernels are proven feature detectors, but the potential ‘space’ of all kernels (even just in terms of weights, let alone in terms of arrangement or architecture) is very large. The typical way of wading through this space is by learning the kernel weights, and possibly venturing some kind of architecture search, or using a proven architecture such as ResNet or InceptionTime.
But there is another approach, i.e., the approach taken by ROCKET, which is—speaking fairly loosely—to simply generate lots of kernels which, in combination, provide good coverage of the space of all kernels (or all useful kernels).
Feature selection fits somewhere on the continuum between ‘completely random’ and fully learned kernels in an established architecture, or an architecture found through some kind of architecture search. At some point, it will almost certainly be more beneficial to simply spend time learning the kernels and performing some kind of architecture search, rather than hunting through randomly generated kernels. (Note also the possible ‘correlation effect’ observed by @MadeUpMasters in his summary, above, which might work against feature selection for random kernels.)
Having said all that, scikit-learn (just as an example) has a number of feature selection methods which can be used directly with ROCKET in one or more ways, and may prove useful at least for some problems.
Best,
Angus.