Categories: java, linux, stanford-nlp, eof

Error When Loading NER Classifier - Unexpected end of ZLIB input stream

1 answer

I have trained a custom NER model with Stanford-NER. I created a properties file and used the -serverProperties argument with the java command to start my server (direction I followed from another question of mine, seen here) and load my custom NER model but when the server attempts to load my custom model it fails with this error: java.io.EOFException: Unexpected end of ZLIB input stream

The stderr.log output with error is as follows:

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---  [main] INFO CoreNLP - setting default constituency parser  [main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz  [main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead  [main] INFO CoreNLP - to use shift reduce parser download English models jar from:  [main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html  [main] INFO CoreNLP -     Threads: 4  [main] INFO CoreNLP - Liveness server started at /0.0.0.0:9000  [main] INFO CoreNLP - Starting server...  [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:80  [pool-1-thread-3] INFO CoreNLP - [/127.0.0.1:35546] API call w/annotators tokenize,ssplit,pos,lemma,depparse,natlog,ner,openie  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos  [pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse  [pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...  [pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 12.297 (s)  [pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [13.6 sec].  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator natlog  [pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner  java.io.EOFException: Unexpected end of ZLIB input stream     at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240       at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)          at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)          at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)       at java.io.BufferedInputStream.read(BufferedInputStream.java:345)     at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2620)     at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2636)          at java.io.ObjectInputStream$BlockDataInputStream.readDoubles(ObjectInputStream.java:3333)       at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1920)      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)     at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)      at edu.stanford.nlp.ie.crf.CRFClassifier.loadClassifier(CRFClassifier.java:2650)      at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1462)      at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1494)     at edu.stanford.nlp.ie.crf.CRFClassifier.getClassifier(CRFClassifier.java:2963)          at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifierFromPath(ClassifierCombiner.java:282)        at edu.stanford.nlp.ie.ClassifierCombiner.loadClassifiers(ClassifierCombiner.java:266)       at edu.stanford.nlp.ie.ClassifierCombiner.<init>(ClassifierCombiner.java:141)        at edu.stanford.nlp.ie.NERClassifierCombiner.<init>(NERClassifierCombiner.java:128)          at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:121)         at edu.stanford.nlp.pipeline.AnnotatorFactories$6.create(AnnotatorFactories.java:273)        at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:152)       at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:451)         at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:154)        at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:145)        at edu.stanford.nlp.pipeline.StanfordCoreNLPServer.mkStanfordCoreNLP(StanfordCoreNLPServer.java:273)         at edu.stanford.nlp.pipeline.StanfordCoreNLPServer.access$500(StanfordCoreNLPServer.java:50)         at edu.stanford.nlp.pipeline.StanfordCoreNLPServer$CoreNLPHandler.handle(StanfordCoreNLPServer.java:583)         at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)          at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)          at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)          at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)       at java.lang.Thread.run(Thread.java:748)  

I have googled this error and most of what I read is in regards to an issue with Java from 2007-2010 where an EOFException is "arbitrarily" thrown. This information is from here.

"When using gzip (via new Deflater(Deflater.BEST_COMPRESSION, true)), for some files, and EOFException is thrown at the end of inflating. Although the file is correct, the bug is the EOFException is thrown inconsistently. For some files it is thrown, other it is not."

Answers to other peoples questions in regards to this error state that you have to close the output streams for the gzip...? Not entirely sure what that means and I don't know how I would execute that advice as Stanford-NER is the software creating the gzip file for me.

Question: What actions can I take to eliminate this error? I am hoping this has happened to others in the past. Also looking for feedback from @StanfordNLPHelp as to whether there have been similar issues risen in the past and if there is something being done/something that has been done to the CoreNLP software to eliminate this issue. If there is a solution from CoreNLP, what files do I need to change, where are these files located within the CoreNLP framework, and what changes do I need to make?

ADDED INFO (PER @StanfordNLPHelp comments):

My model was trained using the directions found here. To train the model I used a TSV as outlined in the directions which contained text from around 90 documents. I know this is not a substantial amount of data to train with but we are just in the testing phases and will improve the model as we acquire more data.

With this TSV file and the Standford-NER software I ran the command below.

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

I then was had my model built and was even able to load and successfully tag a larger corpus of text with the ner GUI that comes with the Stanford-NER software.

During trouble shooting why I was unable to get the model to work I also attempted to update my server.properties file with the file path to the "3 class model" that comes standard in CoreNLP. Again it failed with the same error.

The fact that both my custom model and the 3 class model both work in the Stanford-NER software but fail to load makes me believe my custom model is not the issue and that there is some issue with how the CoreNLP software loads these models through the -serverProperties argument. Or it could be something I am completely unaware of.

The properties file I used to train my NER model was similar to the on in the directions with the train file changed and the output file name changed. It looks like this:

# location of the training file trainFile = custom-model-trainingfile.tsv # location where you would like to save (serialize) your # classifier; adding .gz at the end automatically gzips the file, # making it smaller, and faster to load serializeTo = custome-ner-model.ser.gz  # structure of your training file; this tells the classifier that # the word is in column 0 and the correct answer is in column 1 map = word=0,answer=1  # This specifies the order of the CRF: order 1 means that features # apply at most to a class pair of previous class and current class # or current class and next class. maxLeft=1  # these are the features we'd like to train with # some are discussed below, the rest can be # understood by looking at NERFeatureFactory useClassFeature=true useWord=true # word character ngrams will be included up to length 6 as prefixes # and suffixes only  useNGrams=true noMidNGrams=true maxNGramLeng=6 usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true # the last 4 properties deal with word shape features useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC 

My server.properties file contained only one line ner.model = /path/to/custom_model.ser.gz

I also added /path/to/custom_model to the $CLASSPATH variable in the start up script. Changed line CLASSPATH="$CLASSPATH:$JAR to CLASSPATH="$CLASSPATH:$JAR:/path/to/custom_model.ser.gz. I am not sure if this is a necessary step because I get prompted with the ZLIB error first. Just wanted to include this for completeness.

Attempted to "gunzip" my custom model with the command gunzip custom_model.ser.gz and got a similar error that I get when trying to load the model. It is gzip: custom_model.ser.gz: unexpected end of file

All answers to this question, which has the identifier 44006916

The best answer:

I'm assuming you downloaded Stanford CoreNLP 3.7.0 and have a folder somewhere called stanford-corenlp-full-2016-10-31. For the sake of this example let's assume it's in /Users/stanfordnlphelp/stanford-corenlp-full-2016-10-31 (change this to your specific situation)

Also just to clarify, when you run a Java program, it looks in the CLASSPATH for compiled code and resources. A common way to set the CLASSPATH is to just set the CLASSPATH environment variable with export command.

Typically Java compiled code and resources are stored in jar files.

If you look at stanford-corenlp-full-2016-10-31 you'll see a bunch of .jar files. One of them is called stanford-corenlp-3.7.0-models.jar. You can look at what's inside a jar file with this command: jar tf stanford-corenlp-3.7.0-models.jar.

You'll notice when you look inside that file that there are (among others) various ner models. For instance you should see this file:

edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz 

in the models jar.

So a reasonable way for us to get things working is to run the server and tell it to only load 1 model (since by default it will load 3).

  1. run these commands in one window (in the same directory as the file ner-server.properties)

    export CLASSPATH=/Users/stanfordnlphelp/stanford-corenlp-full-2016-10-31/*:  java -Xmx12g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -serverProperties ner-server.properties 

with ner-server.properties being a 2-line file with these 2 lines:

annotators = tokenize,ssplit,pos,lemma,ner ner.model = edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz 

The export command above is putting EVERY jar in that directory on the CLASSPATH. That is what the * means. So stanford-corenlp-3.7.0-models.jar should be on the CLASSPATH. Thus when the Java code runs, it will be able to find edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz.

  1. In a different Terminal window, issue this command:

    wget --post-data 'Joe Smith lives in Hawaii.' 'localhost:9000/?properties={"outputFormat":"json"}' -O - 

When this runs, you should see in the first window (where the server is running) that only this model is loading edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz.

You should note that if you deleted the ner.model from your file and re-did all of these, 3 models would load instead of 1.

Please let me know if that all works or not.

Let's assume I made an NER model called custom_model.ser.gz , and that file is what StanfordCoreNLP output after the training process. Let's say I put it in the folder /Users/stanfordnlphelp/.

If steps 1 and 2 worked, you should be able to alter ner-server.properties to this:

annotators = tokenize,ssplit,pos,lemma,ner ner.model = /Users/stanfordnlphelp/custom_model.ser.gz 

And when you do the same thing, it will show your custom model loading. There should not be any kind of gzip issue. If you are still having a gzip issue, please let me know what kind of system you are running this on? Mac OS X, Unix, Windows, etc...?

And to confirm, you said that you have run your custom NER model with the standalone Stanford NER software right? If so, that sounds like the model file is fine.

Last questions

how do i remove the switch on my home screen?
how to edit the JS date and time to update atuomatically?
How to utilize data stored in a multidimensional array
Powermockito not mocking URL constructor in URI.toURL() method
Android Bluetooth LE Scanner only scans when phone's Location is turned on in some devices
docker wordpress container can't connect to mysql container
How can I declare a number in java that is more than 64-bits? [duplicate]
Optaplanner solutionClass entityCollectionProperty should never return null error when simple JSON object passed to controller
Anylogic, get the time a pedestrain is in a queue
How do I fix this syntax issue with my .flex file?
Optimizing query in PHP
How to find the highest number of a column and print two columns of that row in R?
Ideas on “Error: Type com.google.firebase.iid.zzav is referenced as an interface from com.google.firebase.messaging.zzd”?
JCIFS SmbFile.exists() and SmbFile.isDirectory() return false when it exists and I can listFiles()
PHP total order
Laravel booking system design
neural net - undefined column selected
How to indicate y axis does not start from 0 in ggplot?
Fragments in backStack
Spinner how to change the data