Better named-entity recognition and similarity using spaCy

1 answer

I've been trying out spaCy for a small side-project, and had a few questions & concerns.

I noticed that spaCy's named-entity recognition results (with its largest en_vectors_web_lg model) don't seem to be as accurate as that of Google Cloud Natural Language API [1]. Google's API is able to extract more entities, more accurately, most likely because their model is even larger. So, is there a way to improve spaCy's NER results using a different model if possible, or through some other technique?

Secondly, Google's API also returns Wikipedia article links for relevant entities. Is this possible with spaCy too, or using some other technique on top of spaCy's NER results?

Thirdly, I noticed that spaCy has a similarity() method [2] that uses GloVe word vectors. But being new to it, I'm not sure what's the best way to frequently perform similarity comparison between each document in a set of documents (say 5000-10000 text documents of under 500 characters each) to generate buckets of similar documents?

Hoping for someone to have any suggestions or tips.

Many thanks!


[1] https://cloud.google.com/natural-language/

[2] https://spacy.io/usage/vectors-similarity

All answers to this question, which has the identifier 52473653

The best answer:

...So is there way to improve spaCy's NER?

It is possible to train the spaCy's model to improve it's NER. You can use GoldParse object to train it. https://spacy.io/usage/training

Secondly, Google's API also returns Wikipedia article links for relevant entities. Is this possible with spaCy too, or using some other technique on top of spaCy's NER results?

I have not seen anyone trying this feature with spaCy.

Thirdly, I noticed that spaCy has a similarity() method [2] that uses GloVe word vectors...

I think this a clustering problem and will not be solved just using spaCy similarity. For clustering, I would highly recommend going through the following link. http://brandonrose.org/clustering

Last questions

how do i remove the switch on my home screen?
how to edit the JS date and time to update atuomatically?
How to utilize data stored in a multidimensional array
Powermockito not mocking URL constructor in URI.toURL() method
Android Bluetooth LE Scanner only scans when phone's Location is turned on in some devices
docker wordpress container can't connect to mysql container
How can I declare a number in java that is more than 64-bits? [duplicate]
Optaplanner solutionClass entityCollectionProperty should never return null error when simple JSON object passed to controller
Anylogic, get the time a pedestrain is in a queue
How do I fix this syntax issue with my .flex file?
Optimizing query in PHP
How to find the highest number of a column and print two columns of that row in R?
Ideas on “Error: Type com.google.firebase.iid.zzav is referenced as an interface from com.google.firebase.messaging.zzd”?
JCIFS SmbFile.exists() and SmbFile.isDirectory() return false when it exists and I can listFiles()
PHP total order
Laravel booking system design
neural net - undefined column selected
How to indicate y axis does not start from 0 in ggplot?
Fragments in backStack
Spinner how to change the data