Gleichzeitige Verarbeitung mit Stanford CoreNLP (3.5.2)

Question

Jun 05, 2015, 11:44 PM

Gleichzeitige Verarbeitung mit Stanford CoreNLP (3.5.2)

Beim gleichzeitigen Kommentieren mehrerer Sätze tritt ein Problem auf. Mir ist nicht klar, ob ich etwas falsch mache oder ob es einen Fehler in CoreNLP gibt.

Mein Ziel ist es, Sätze mit der Pipeline "tokenize, ssplit, pos, lemma, ner, parse, dcoref" zu kommentieren, wobei mehrere Threads parallel laufen. Jeder Thread weist eine eigene Instanz von StanfordCoreNLP zu und verwendet sie dann für die Annotation.

Das Problem ist, dass irgendwann eine Ausnahme ausgelöst wird:

java.util.ConcurrentModificationException
	at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
	at java.util.ArrayList$Itr.next(ArrayList.java:851)
	at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:463)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
	at edu.stanford.nlp.trees.GrammaticalStructure.<init>(GrammaticalStructure.java:201)
	at edu.stanford.nlp.trees.EnglishGrammaticalStructure.<init>(EnglishGrammaticalStructure.java:89)
	at edu.stanford.nlp.semgraph.SemanticGraphFactory.makeFromTree(SemanticGraphFactory.java:139)
	at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:89)
	at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
	at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412)

Ich hänge einen Beispielcode einer Anwendung an, die das Problem in etwa 20 Sekunden auf meinem Core i3 370M-Laptop (Win 7 64-Bit, Java 1.8.0.45 64-Bit) reproduziert. Diese App liest eine XML-Datei des RTE-Corpora (Recognizing Textual Entailment) und analysiert dann alle Sätze gleichzeitig unter Verwendung von Standard-Java-Concurrency-Klassen. Der Pfad zu einer lokalen RTE-XML-Datei muss als Befehlszeilenargument angegeben werden. In meinen Tests habe ich die öffentlich verfügbare XML-Datei hier verwendet:http: //www.nist.gov/tac/data/RTE/RTE3-DEV-FINAL.tar.g

package semante.parser.stanford.server;

import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.annotation.XmlAccessType;
import javax.xml.bind.annotation.XmlAccessorType;
import javax.xml.bind.annotation.XmlAttribute;
import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

public class StanfordMultiThreadingTest {

	@XmlRootElement(name = "entailment-corpus")
	@XmlAccessorType (XmlAccessType.FIELD)
	public static class Corpus {
		@XmlElement(name = "pair")
		private List<Pair> pairList = new ArrayList<Pair>();

		public void addPair(Pair p) {pairList.add(p);}
		public List<Pair> getPairList() {return pairList;}
	}

	@XmlRootElement(name="pair")
	public static class Pair {

		@XmlAttribute(name = "id")
		String id;

		@XmlAttribute(name = "entailment")
		String entailment;

		@XmlElement(name = "t")
		String t;

		@XmlElement(name = "h")
		String h;

		private Pair() {}

		public Pair(int id, boolean entailment, String t, String h) {
			this();
			this.id = Integer.toString(id);
			this.entailment = entailment ? "YES" : "NO";
			this.t = t;
			this.h = h;
		}

		public String getId() {return id;}
		public String getEntailment() {return entailment;}
		public String getT() {return t;}
		public String getH() {return h;}
	}
	
	class NullStream extends OutputStream {
		@Override 
		public void write(int b) {}
	};

	private Corpus corpus;
	private Unmarshaller unmarshaller;
	private ExecutorService executor;

	public StanfordMultiThreadingTest() throws Exception {
		javax.xml.bind.JAXBContext jaxbCtx = JAXBContext.newInstance(Pair.class,Corpus.class);
		unmarshaller = jaxbCtx.createUnmarshaller();
		executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
	}

	public void readXML(String fileName) throws Exception {
		System.out.println("Reading XML - Started");
		corpus = (Corpus) unmarshaller.unmarshal(new InputStreamReader(new FileInputStream(fileName), StandardCharsets.UTF_8));
		System.out.println("Reading XML - Ended");
	}

	public void parseSentences() throws Exception {
		System.out.println("Parsing - Started");

		// turn pairs into a list of sentences
		List<String> sentences = new ArrayList<String>();
		for (Pair pair : corpus.getPairList()) {
			sentences.add(pair.getT());
			sentences.add(pair.getH());
		}

		// prepare the properties
		final Properties props = new Properties();
		props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

		// first run is long since models are loaded
		new StanfordCoreNLP(props);

		// to avoid the CoreNLP initialization prints (e.g. "Adding annotation pos")
		final PrintStream nullPrintStream = new PrintStream(new NullStream());
		PrintStream err = System.err;
		System.setErr(nullPrintStream);

		int totalCount = sentences.size();
		AtomicInteger counter = new AtomicInteger(0);

		// use java concurrency to parallelize the parsing
		for (String sentence : sentences) {
			executor.execute(new Runnable() {
				@Override
				public void run() {
					try {
						StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
						Annotation annotation = new Annotation(sentence);
						pipeline.annotate(annotation);
						if (counter.incrementAndGet() % 20 == 0) {
							System.out.println("Done: " + String.format("%.2f", counter.get()*100/(double)totalCount));
						};
					} catch (Exception e) {
						System.setErr(err);
						e.printStackTrace();
						System.setErr(nullPrintStream);
						executor.shutdownNow();
					}
				}
			});
		}
		executor.shutdown();
		
		System.out.println("Waiting for parsing to end.");		
		executor.awaitTermination(10, TimeUnit.MINUTES);

		System.out.println("Parsing - Ended");
	}

	public static void main(String[] args) throws Exception {
		StanfordMultiThreadingTest smtt = new StanfordMultiThreadingTest();
		smtt.readXML(args[0]);
		smtt.parseSentences();
	}

}

ei meinem Versuch, Hintergrundinformationen zu finden, stieß ich auf Antworten vonChristopher Manning undGabor Angeli von Stanford, die darauf hinweisen, dass aktuelle Versionen von Stanford CoreNLP threadsicher sein sollten. Ein aktuellesFehlerberich on CoreNLP Version 3.4.1 beschreibt ein Parallelitätsproblem. Wie im Titel erwähnt, verwende ich Version 3.5.2.

Mir ist nicht klar, ob das Problem auf einen Fehler oder auf einen Fehler bei der Verwendung des Pakets zurückzuführen ist. Ich würde es begrüßen, wenn jemand, der mehr darüber weiß, etwas Licht in diese Sache bringen könnte. Ich hoffe, dass der Beispielcode nützlich wäre, um das Problem zu reproduzieren. Vielen Dank

[1]: