如何创建属于你自己的图片数据集
Creating your own dataset from Google Images
Nb的目的
Nb的目的
by: Francisco Ingham and Jeremy Howard. Inspired by Adrian Rosebrock
In this tutorial we will see how to easily create an image dataset through Google Images. Note: You will have to repeat these steps for any new category you want to Google (e.g once for dogs and once for cats).
仅需的library
仅需的library
from fastai.vision import *
Get a list of URLs
如何精确搜索
如何精确搜索
Search and scroll
Go to Google Images and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.
Scroll down until you’ve seen all the images you want to download, or until you see a button that says ‘Show more results’. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.
It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, “canis lupus lupus”, it might be a good idea to exclude other variants:
"canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis
You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.
如何下载图片的链接
如何下载图片的链接
Download into file
Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.
Press CtrlShiftJ in Windows/Linux and CmdOptJ in Mac, and a small window the javascript ‘Console’ will appear. That is where you will paste the JavaScript commands.
You will need to get the urls of each of the images. You can do this by running the following commands:
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).ou);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
创建文件夹并上传链接文本到云端
创建文件夹并上传链接文本到云端
Create directory and upload urls file into your server
Choose an appropriate name for your labeled images. You can run these steps multiple times to create different labels.
一个类别,一个文件夹,一个链接文本
folder = 'black'
file = 'urls_black.txt'
folder = 'teddys'
file = 'urls_teddys.txt'
folder = 'grizzly'
file = 'urls_grizzly.txt'
You will need to run this cell once per each category.
下面这个Cell,每个类别运行一次
创建子文件夹
创建子文件夹
path = Path('data/bears')
dest = path/folder
dest.mkdir(parents=True, exist_ok=True)
查看文件夹内部
查看文件夹内部
path.ls()
[PosixPath('data/bears/urls_teddy.txt'),
PosixPath('data/bears/black'),
PosixPath('data/bears/urls_grizzly.txt'),
PosixPath('data/bears/urls_black.txt')]
Finally, upload your urls file. You just need to press ‘Upload’ in your working directory and select your file, then click ‘Upload’ for each of the displayed files.
通过云端的Nb’upload’来上传
Download images
如何下载图片并设置下载数量上限
如何下载图片并设置下载数量上限
Now you will need to download your images from their respective urls.
fast.ai has a function that allows you to do just that. You just have to specify the urls filename as well as the destination folder and this function will download and save all images that can be opened. If they have some problem in being opened, they will not be saved.
Let’s download our images! Notice you can choose a maximum number of images to be downloaded. In this case we will not download all the urls.
You will need to run this line once for every category.
classes = ['teddys','grizzly','black']
download_images(path/file, dest, max_pics=200)
下载出问题的处理方法
下载出问题的处理方法
# If you have problems download, try with `max_workers=0` to see exceptions:
download_images(path/file, dest, max_pics=20, max_workers=0)
如何删除无法打开的图片
如何删除无法打开的图片
Then we can remove any images that can’t be opened:
for c in classes:
print(c)
verify_images(path/c, delete=True, max_size=500)
View data
从一个文件夹生成DataBunch
从一个文件夹生成DataBunch
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
用CSV文件协助生成DataBunch
用CSV文件协助生成DataBunch
# If you already cleaned your data, run this cell instead of the one before
# np.random.seed(42)
# data = ImageDataBunch.from_csv(".", folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
# ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)
Good! Let’s take a look at some of our pictures then.
查看类别
查看类别
data.classes
['black', 'grizzly', 'teddys']
查看类别,训练集和验证集的数量
查看类别,训练集和验证集的数量
data.classes, data.c, len(data.train_ds), len(data.valid_ds)
(['black', 'grizzly', 'teddys'], 3, 448, 111)
Train model
创建基于Resnet34的CNN模型
创建基于Resnet34的CNN模型
learn = create_cnn(data, models.resnet34, metrics=error_rate)
用默认参数训练4次
用默认参数训练4次
learn.fit_one_cycle(4)
保存模型
保存模型
learn.save('stage-1')
解冻模型
解冻模型
learn.unfreeze()
当前寻找最优学习率
当前寻找最优学习率
learn.lr_find()
对学习率和损失值作图
对学习率和损失值作图
learn.recorder.plot()
用学习率区间训练2次
用学习率区间训练2次
learn.fit_one_cycle(2, max_lr=slice(3e-5,3e-4))
保存模型
保存模型
learn.save('stage-2')
Interpretation 解读
加载模型
加载模型
learn.load('stage-2');
生成分类器解读器
生成分类器解读器
interp = ClassificationInterpretation.from_learner(learn)
对confusion matrix 作图
对confusion matrix 作图
interp.plot_confusion_matrix()
Cleaning Up
调用widget
调用widget
Some of our top losses aren’t due to bad performance by our model. There are images in our data set that shouldn’t be.
有些高损失值是因为错误标注造成的。
Using the ImageCleaner
widget from fastai.widgets
we can prune our top losses, removing photos that don’t belong.
ImageCleaner
可以帮助找出和清除这些图片
from fastai.widgets import *
如何获取高损失值图片的图片数据和序号
如何获取高损失值图片的图片数据和序号
First we need to get the file paths from our top_losses. We can do this with .from_toplosses
. We then feed the top losses indexes and corresponding dataset to ImageCleaner
.
Notice that the widget will not delete images directly from disk but it will create a new csv file cleaned.csv
from where you can create a new ImageDataBunch with the corrected labels to continue training your model.
ds, idxs = DatasetFormatter().from_toplosses(learn, ds_type=DatasetType.Valid)
用`ImageCleaner`生成这些图片以便清除
用ImageCleaner
生成这些图片以便清除
ImageCleaner(ds, idxs, path)
'No images to show :)'
Flag photos for deletion by clicking ‘Delete’. Then click ‘Next Batch’ to delete flagged photos and keep the rest in that row. ImageCleaner
will show you a new row of images until there are no more to show. In this case, the widget will show you images until there are none left from top_losses.ImageCleaner(ds, idxs)
找出相似图片的图片数据和序号
找出相似图片的图片数据和序号
You can also find duplicates in your dataset and delete them! To do this, you need to run .from_similars
to get the potential duplicates’ ids and then run ImageCleaner
with duplicates=True
. The API works in a similar way as with misclassified images: just choose the ones you want to delete and click ‘Next Batch’ until there are no more images left.
ds, idxs = DatasetFormatter().from_similars(learn, ds_type=DatasetType.Valid)
清除相似图片
清除相似图片
ImageCleaner(ds, idxs, path, duplicates=True)
'No images to show :)'
记住用新生成的CSV来生成DataBunch(不含清除的图片)
记住用新生成的CSV来生成DataBunch(不含清除的图片)
Remember to recreate your ImageDataBunch from your cleaned.csv
to include the changes you made in your data!
Putting your model in production 创建网页 APP
为量产生成模型包
为量产生成模型包
First thing first, let’s export the content of our Learner
object for production:
learn.export()
This will create a file named ‘export.pkl’ in the directory where we were working that contains everything we need to deploy our model (the model, the weights but also some metadata like the classes or the transforms/normalization used).
使用CPU来运行模型
使用CPU来运行模型
You probably want to use CPU for inference, except at massive scale (and you almost certainly don’t need to train in real-time). If you don’t have a GPU that happens automatically. You can test your model on CPU like so:
defaults.device = torch.device('cpu')
打开一张图片
打开一张图片
img = open_image(path/'black'/'00000021.jpg')
img
如何将export.pkl生成模型
如何将export.pkl生成模型
We create our Learner
in production enviromnent like this, jsut make sure that path
contains the file ‘export.pkl’ from before.
learn = load_learner(path)
如何用模型来预测(生成预测类别,类别序号,预测值)
如何用模型来预测(生成预测类别,类别序号,预测值)
pred_class,pred_idx,outputs = learn.predict(img)
pred_class
Category black
Starlette核心代码
Starlette核心代码
So you might create a route something like this (thanks to Simon Willison for the structure of this code):
@app.route("/classify-url", methods=["GET"])
async def classify_url(request):
bytes = await get_bytes(request.query_params["url"])
img = open_image(BytesIO(bytes))
_,_,losses = learner.predict(img)
return JSONResponse({
"predictions": sorted(
zip(cat_learner.data.classes, map(float, losses)),
key=lambda p: p[1],
reverse=True
)
})
(This example is for the Starlette web app toolkit.)
Things that can go wrong
多数时候我们仅需要调试epochs和学习率
多数时候我们仅需要调试epochs和学习率
- Most of the time things will train fine with the defaults
- There’s not much you really need to tune (despite what you’ve heard!)
- Most likely are
- Learning rate
- Number of epochs
学习率过高会怎样
学习率过高会怎样
Learning rate (LR) too high
learn = create_cnn(data, models.resnet34, metrics=error_rate)
learn.fit_one_cycle(1, max_lr=0.5)
Total time: 00:13
epoch train_loss valid_loss error_rate
1 12.220007 1144188288.000000 0.765957 (00:13)
学习率过低会怎样
学习率过低会怎样
Learning rate (LR) too low
learn = create_cnn(data, models.resnet34, metrics=error_rate)
Previously we had this result:
Total time: 00:57
epoch train_loss valid_loss error_rate
1 1.030236 0.179226 0.028369 (00:14)
2 0.561508 0.055464 0.014184 (00:13)
3 0.396103 0.053801 0.014184 (00:13)
4 0.316883 0.050197 0.021277 (00:15)
learn.fit_one_cycle(5, max_lr=1e-5)
Total time: 01:07
epoch train_loss valid_loss error_rate
1 1.349151 1.062807 0.609929 (00:13)
2 1.373262 1.045115 0.546099 (00:13)
3 1.346169 1.006288 0.468085 (00:13)
4 1.334486 0.978713 0.453901 (00:13)
5 1.320978 0.978108 0.446809 (00:13)
learn.recorder.plot_losses()
As well as taking a really long time, it’s getting too many looks at each image, so may overfit.
训练太少会怎样
训练太少会怎样
Too few epochs
learn = create_cnn(data, models.resnet34, metrics=error_rate, pretrained=False)
learn.fit_one_cycle(1)
Total time: 00:14
epoch train_loss valid_loss error_rate
1 0.602823 0.119616 0.049645 (00:14)
训练太多会怎样
训练太多会怎样
Too many epochs
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.9, bs=32,
ds_tfms=get_transforms(do_flip=False, max_rotate=0, max_zoom=1, max_lighting=0, max_warp=0
),size=224, num_workers=4).normalize(imagenet_stats)
learn = create_cnn(data, models.resnet50, metrics=error_rate, ps=0, wd=0)
learn.unfreeze()
learn.fit_one_cycle(40, slice(1e-6,1e-4))
Total time: 06:39
epoch train_loss valid_loss error_rate
1 1.513021 1.041628 0.507326 (00:13)
2 1.290093 0.994758 0.443223 (00:09)
3 1.185764 0.936145 0.410256 (00:09)
4 1.117229 0.838402 0.322344 (00:09)
5 1.022635 0.734872 0.252747 (00:09)
6 0.951374 0.627288 0.192308 (00:10)
7 0.916111 0.558621 0.184982 (00:09)
8 0.839068 0.503755 0.177656 (00:09)
9 0.749610 0.433475 0.144689 (00:09)
10 0.678583 0.367560 0.124542 (00:09)
11 0.615280 0.327029 0.100733 (00:10)
12 0.558776 0.298989 0.095238 (00:09)
13 0.518109 0.266998 0.084249 (00:09)
14 0.476290 0.257858 0.084249 (00:09)
15 0.436865 0.227299 0.067766 (00:09)
16 0.457189 0.236593 0.078755 (00:10)
17 0.420905 0.240185 0.080586 (00:10)
18 0.395686 0.255465 0.082418 (00:09)
19 0.373232 0.263469 0.080586 (00:09)
20 0.348988 0.258300 0.080586 (00:10)
21 0.324616 0.261346 0.080586 (00:09)
22 0.311310 0.236431 0.071429 (00:09)
23 0.328342 0.245841 0.069597 (00:10)
24 0.306411 0.235111 0.064103 (00:10)
25 0.289134 0.227465 0.069597 (00:09)
26 0.284814 0.226022 0.064103 (00:09)
27 0.268398 0.222791 0.067766 (00:09)
28 0.255431 0.227751 0.073260 (00:10)
29 0.240742 0.235949 0.071429 (00:09)
30 0.227140 0.225221 0.075092 (00:09)
31 0.213877 0.214789 0.069597 (00:09)
32 0.201631 0.209382 0.062271 (00:10)
33 0.189988 0.210684 0.065934 (00:09)
34 0.181293 0.214666 0.073260 (00:09)
35 0.184095 0.222575 0.073260 (00:09)
36 0.194615 0.229198 0.076923 (00:10)
37 0.186165 0.218206 0.075092 (00:09)
38 0.176623 0.207198 0.062271 (00:10)
39 0.166854 0.207256 0.065934 (00:10)
40 0.162692 0.206044 0.062271 (00:09)