Hi,
I have a working prototype for possible SparseDataset
and LinearSparse
for creating dataset (mostly for tabular) from scipy.sparse using a custom sparse_collate_fn
, would this be something desired or worth adding as a feature ?
I read a lot of issues in Pytorch forums and came up with the ideas above for handling sparse datasets.
Some use cases, why would someone want sparse dataset when we have embeddings ?:
1) Extracting features from CNN models like ResNet, if you have n different models it will make n*2048 sized vector as input which has lot of zeros since it’s after relu.
2) Text features which are non-sequential and variable length. Here we can use BOW-TFIIDF, of course you can work your way around with more sophisticated models like LSTMs
or Transformers
but BOW-TFIDF is most of the will give you a strong and fast baseline.
3 Will think more…
Why custom collate_fn
?:
Because default_collate
will fail. The way I do is i store scipy.sparse data inside dataset:
def sparse_collate_fn(dataset):
"""
dataset: [self.dataset[i] for i in indices]
"""
x,y = list(zip(*dataset))
sparse_stacked = scipy.sparse.vstack(x)
torch_sparse_stacked = torch_from_scipysparse(sparse_stacked, size=sparse_stacked.shape, device=0)
torch_y = torch.FloatTensor(np.concatenate([y]))
return torch_sparse_stacked, torch_y
def torch_from_scipysparse(sparse_matrix, *args, **kwargs):
"""sparse_matrix: scipy sparse matrix """
sparse_matrix = sparse_matrix.tocoo(copy=False)
row,col,values = sparse_matrix.row, sparse_matrix.col, sparse_matrix.data
i = torch.LongTensor(np.vstack([row, col]))
v = torch.FloatTensor(values)
return torch.sparse_coo_tensor(i, v, *args, **kwargs)
Why custom LinearSparse
?:
Because at the moment autograd supports some ops only, which can be find here: https://github.com/pytorch/pytorch/issues/9674. This is an open issue and i bet in near future they will support all modules. But if you create nn.Linear
layer move it to say device:0
and do backward()
will fail. The reason is, at least what I remember from forums, is that when you move variables intermediate variables are created and backward fails ? (maybe someone can explain further). So LinearSparse
will allow us to move the module to GPU as we create it, it uses torch.mm
and pretty similar to nn.Linear
:
class LinearSparse(nn.Module):
def __init__(self, in_features, out_features, bias=True, **kwagrs):
"""allow args and kwargs for weight and bias construction"""
super(LinearSparse, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.weight = Parameter(torch.rand(in_features, out_features, **kwagrs))
if bias:
self.bias = Parameter(torch.rand(out_features, **kwagrs))
else:
self.register_parameter('bias', None)
self.reset_parameters()
def reset_parameters(self):
stdv = 1. / math.sqrt(self.weight.size(1))
self.weight.data.uniform_(-stdv, stdv)
if self.bias is not None:
self.bias.data.uniform_(-stdv, stdv)
def forward(self, input):
"""input: Sparse (S), weight: Dense(S)"""
return torch.mm(input, self.weight) + self.bias
def extra_repr(self):
return 'in_features={}, out_features={}, bias={}'.format(
self.in_features, self.out_features, self.bias is not None
)