Lazy parst mit Stanford CoreNLP, um die Stimmung nur für bestimmte Sätze zu erhalten

Question

Jun 08, 2015, 06:40 PM

parsing performance java sentiment-analysis stanford-nlp

Lazy parst mit Stanford CoreNLP, um die Stimmung nur für bestimmte Sätze zu erhalten

Ich suche nach Möglichkeiten, um die Leistung meiner Stanford CoreNLP-Stimmungspipeline zu optimieren. Infolgedessen möchten Sie eine Aussage von Sätzen erhalten, jedoch nur von Sätzen, die bestimmte Stichwörter enthalten, die als Eingabe angegeben wurden.

ch habe zwei Ansätze ausprobier

Ansatz 1: StanfordCoreNLP-Pipeline, die den gesamten Text mit Sentiment @ annotie

Ich habe eine Pipeline von Annotatoren definiert: tokenize, ssplit, parse, sentiment. Ich habe es für den gesamten Artikel ausgeführt, dann nach Schlüsselwörtern in jedem Satz gesucht und, falls vorhanden, eine Methode ausgeführt, die den Schlüsselwortwert zurückgibt. Ich war nicht zufrieden, obwohl die Verarbeitung einige Sekunden dauert.

Dies ist der Code:

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation annotation = pipeline.process(text); // takes 2 seconds!!!!
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

Ansatz 2: StanfordCoreNLP-Pipeline, in der der gesamte Text mit Sätzen kommentiert wird, separate Annotatoren, die auf Sätzen von Interesse ausgeführt werden

Wegen der schwachen Leistung der ersten Lösung habe ich die zweite Lösung definiert. Ich habe eine Pipeline mit Annotatoren definiert: tokenize, ssplit. Ich habe in jedem Satz nach Schlüsselwörtern gesucht und, falls vorhanden, nur für diesen Satz eine Anmerkung erstellt und darauf die nächsten Anmerkungen ausgeführt: ParserAnnotator, BinarizerAnnotator und SentimentAnnotator.

Die Ergebnisse waren aufgrund von ParserAnnotator wirklich unbefriedigend. Auch wenn ich es mit den gleichen Eigenschaften initialisiert habe. Manchmal dauerte es sogar länger als die gesamte Pipeline, die für ein Dokument in Ansatz 1 ausgeführt wurde.

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit"); // parsing, sentiment removed
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// initiation of annotators to be run on sentences
ParserAnnotator parserAnnotator = new ParserAnnotator("pa", props);
BinarizerAnnotator  binarizerAnnotator = new BinarizerAnnotator("ba", props);
SentimentAnnotator sentimentAnnotator = new SentimentAnnotator("sa", props);

Annotation annotation = pipeline.process(text); // takes <100 ms
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        // code required to perform annotation on one sentence
        List<CoreMap> listWithSentence = new ArrayList<CoreMap>();
        listWithSentence.add(sentence);
        Annotation sentenceAnnotation  = new Annotation(listWithSentence);

        parserAnnotator.annotate(sentenceAnnotation); // takes 50 ms up to 2 seconds!!!!
        binarizerAnnotator.annotate(sentenceAnnotation);
        sentimentAnnotator.annotate(sentenceAnnotation);
        sentence = sentenceAnnotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);

        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

Frage

Ich frage mich, warum das Parsen in CoreNLP nicht "faul" ist? (In meinem Beispiel würde dies bedeuten: Wird nur ausgeführt, wenn das Gefühl für einen Satz aufgerufen wird.) Ist es aus Performancegründen?

Wie kann ein Parser für einen Satz fast so lange funktionieren wie ein Parser für einen ganzen Artikel (mein Artikel hatte 7 Sätze)? Ist es möglich, es so zu konfigurieren, dass es schneller funktioniert?