I want to build a vocabulary based on wordpiece instead of words. can anyone tell me the process of building wordpices vocabulary from some sentences or any library that is able to do this?
Thanks
You might want to try SentencePiece
Here is one example of creating wordpice tokens:
# coding=utf-8
"""BERT finetuning runner."""
from __future__ import absolute_import, division, print_function
import argparse
import collections
import logging
import os
import random
import numpy as np
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data.sampler import RandomSampler, SequentialSampler
from tqdm import tqdm, trange
This file has been truncated. show original
From what I can tell SentencePiece is the wordpices that were used to train BERT.
Here is a paper that you might find interesting
https://www.aclweb.org/anthology/D18-2012