Lesson 5 taught about the sentiment analysis which is supervised, meaning there were already labels.
In real world datasets label will not be present and we should generate it somehow.
What are the methods by which we can generate these labels and how accurate these methods are for practical purposes ?
Frankly, most people just use something like mechanical turk to present sentences to workers and have them labeled. There are some active learning approaches, eg http://www.sciencedirect.com/science/article/pii/S0306457314000296 . However I haven’t studied these myself yet.
If it is sensitive documents/sentences then mechanical turk method would not be feasible.
Yes - in medicine we paid medical experts to do this.
Agreed with @jeremy. Data Annotation is tricky, strongly domain and use case-dependent. It is one of the hardest problem with ML right now. I saw a lot of issues related to Mechanical Turk, and I would instead hire domain expert or at least semi-experts for annotation (semi-expert could be the person who is keen on learning domain characteristics and after some training will provide feasible annotations). The crucial issue is related to how do you design annotation process, what will be positive/negative examples, will it be repeatable for annotations and surely you must monitor inter-annotator agreement (inter-rater reliability) and correct annotation process if something goes wrong.
I would advise looking into book Natural Language Annotation for Machine Learning http://shop.oreilly.com/product/0636920020578.do