Categories: python, python-3.x, nlp, spacy, ner

Train Spacy NER on Indian Names

3 answers

I am trying to customize Spacy's NER to identify Indian names. Following this guide https://spacy.io/usage/training and this is the dataset I am using https://gist.githubusercontent.com/mbejda/9b93c7545c9dd93060bd/raw/b582593330765df3ccaae6f641f8cddc16f1e879/Indian-Female-Names.csv

As per the code , I am supposed to provide training data in following format:

TRAIN_DATA = [     ('Shivani', {         'entities': [(0, 6, 'PERSON')]     }),     ('Isha ', {         'entities': [(0,3 , 'PERSON')]     }) ] 

How do I provide training data to Spacy for ~12000 names as manually specifying each entity will be a chore? Is there any other tool available to tag all the names ?

All answers to this question, which has the identifier 49483971

The best answer:

You are missing the point of training a NLP library for custom names. The training data has to be a list of training entries that each have a sentence text with the location of the name(s) identified. Please review the training data example again to see how you need to supply a full sentence and not just a name.

Spacy is not meant to be a gazette matching tool. You are likely better off generating 100 sentences that use some of these names and then training Spacy on those annotated sentences. You can add more full sentence examples as needed to increase accuracy. Spacy's native NER for names is robust and does not need 12000 examples.

@ak_35's answer below provides examples of how to provide training sentences with the location of names labeled.

Your current format for providing TRAIN_DATA will not give you good results. Spacy needs data in the format as shown below

TRAIN_DATA = [ ('Shivani lives in chennai', {         'entities': [(0, 6, 'PERSON')]     }),  ('Did you talk to Shivani yesterday', {         'entities': [(16, 22, 'PERSON')]     }),      ('Isha bought a new phone', {         'entities': [(0,3 , 'PERSON')]     })  ] 

See the documentation here. Coming to your question about automating the task of annotation 12000 entries, there are tools that can help you in quickly annotating your data. You can use prodigy (same developers as spacy) but it is a paid service. You can see it in action here. In case you give up on the NER, Pattern matching might also work well for you if you just need to find names in a document, it would be faster and more accurate too if done right.

If you're trying to figure out the index of the names then it's quite simple

(0, len(name.split(sep=',')[0])-1) 

Last questions

how do i remove the switch on my home screen?
how to edit the JS date and time to update atuomatically?
How to utilize data stored in a multidimensional array
Powermockito not mocking URL constructor in URI.toURL() method
Android Bluetooth LE Scanner only scans when phone's Location is turned on in some devices
docker wordpress container can't connect to mysql container
How can I declare a number in java that is more than 64-bits? [duplicate]
Optaplanner solutionClass entityCollectionProperty should never return null error when simple JSON object passed to controller
Anylogic, get the time a pedestrain is in a queue
How do I fix this syntax issue with my .flex file?
Optimizing query in PHP
How to find the highest number of a column and print two columns of that row in R?
Ideas on “Error: Type com.google.firebase.iid.zzav is referenced as an interface from com.google.firebase.messaging.zzd”?
JCIFS SmbFile.exists() and SmbFile.isDirectory() return false when it exists and I can listFiles()
PHP total order
Laravel booking system design
neural net - undefined column selected
How to indicate y axis does not start from 0 in ggplot?
Fragments in backStack
Spinner how to change the data