I was trying to reproduce the plots of the embeddings that @jeremy showed and I found that t-SNE has several parameters that can really affect the final visualization. I was wondering if you have some insights about the impact of different parameters and/or good practices. I attach my code (sorry. t’s quite nasty), for you to play too
This first chart looks nice. What kind of t-SNE parameters you used to produce it ?
Perplexities = [2,4,5,10,25] (default is 30)
All the others are the scikit learn defaults:
early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5
I played mainly with perplexity, early_exaggeration, learning_rate and angle.
First chart clearly displays 3 clusters and this kind of transformation one can effectively feed into kmeans.
I’ll see if I can find my tSNE notes / sources to confirm, but I seem to recall recommended Perplexities being in the 10-50 range (and having my most success in the 15-30 range, but this may be data specific).
I just googled around a bit and found this post which seems pretty on point to your questions:
I this case specifically you do have a ground truth. It is Germany’s states own topology.
Therefore in order to tune your TSNE representation you could conceivably calculate the discrepancy between your dimensionality-reduced embedding representations and the actual topology/map of Germany.
In pseudocode it would like something along these lines:
def calc_pair_wise_distances(topo): return [(from, to, dist) for i in topo]
def topology_ss(emb_topo, actual_topo):
pw_dist_emb = calc_pair_wise_distances(emb_topo)
pw_dist_actual = calc_pair_wise_distances(actual_topo)
squared_dist = [ (i[0][3] - i[1][3])**2 for i in zip(pw_dist_topo, pw_dist_actual)]
return squared_dist.sum()
Calculate that for all TSNE parameters and pick the combination with the lowest sum of squares