Как написать правильный файл декодирования на основе заданного .proto, чтение из .pb

Question

Apr 09, 2015, 09:02 AM

Как написать правильный файл декодирования на основе заданного .proto, чтение из .pb

Основываясь на ответе на этовопрос Я думаю, что я предоставил свой файл .pb с "неисправным декодером".

Это данные, которые я пытаюсь декодировать.

На основеListPeople.java пример, приведенный вУчебная документация по JavaЯ попытался написать что-то похожее, чтобы начать собирать эти данные, я написал это:

import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document;
import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document.Sentence;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.PrintStream;


public class ListDocument
{
    // Iterates though all people in the AddressBook and prints info about them.
    static void Print(Document document)
    {
        for ( Sentence sentence: document.getSentencesList() )
        {
            for(int i=0; i < sentence.getTokensCount(); i++)
            {
                System.out.println(" getTokens(" + i + ": " + sentence.getTokens(i) );
            }
        }
    }

    // Main function:  Reads the entire address book from a file and prints all
    //   the information inside.
    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.err.println("Usage:  ListPeople ADDRESS_BOOK_FILE");
            System.exit(-1);
        }

        // Read the existing address book.
        Document addressBook =
                Document.parseFrom(new FileInputStream(args[0]));

        Print(addressBook);
    }
}

Но когда я запускаю это, я получаю эту ошибку

Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
    at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
    at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
    at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:194)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:210)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:215)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at cc.refectorie.proj.relation.protobuf.DocumentProtos$Document.parseFrom(DocumentProtos.java:4770)
    at ListDocument.main(ListDocument.java:40)

поэтому, как я уже сказал выше, я думаю, что это связано со мной, а не с определением декодера. Есть ли способ посмотреть на файл .proto, который я пытаюсь использовать, и найти способ просто прочитать все эти данные?

Есть ли способ посмотреть на этот файл .proto и посмотреть, что я делаю не так?

Это первые несколько строк файла, которые я хочу прочитать:

Ü
&/guid/9202a8c04000641f8000000003221072&/guid/9202a8c04000641f80000000004cfd50NA"Ö

S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1850511.xml.pb„€€€øÿÿÿÿƒ€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"`str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"]str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Rstr:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON"Adep:[NMOD]->|PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Sstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Pstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Estr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON*ŒThe occasion was suitably exceptional : a reunion of the 1970s-era Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums ."¬
S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1849689.xml.pb†€€€øÿÿÿÿ…€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"cstr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"`str:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Ustr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON"Cdep:[NMOD]->|PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Vstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Sstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Hstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON*ÊTonight he brings his energies and expertise to the Miller Theater for the festival 's thrilling finale : a reunion of the 1970s Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums .â
&/guid/9202a8c04000641f80000000004cfd50&/guid/9202a8c04000641f8000000003221072NA"Ù

РЕДАКТИРОВАТЬ

Это файл, который другой исследователь использовал для разбора этих файлов, поэтому мне сказали, возможно ли, что я мог бы использовать это?

package edu.stanford.nlp.kbp.slotfilling.multir;

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.zip.GZIPInputStream;

import edu.stanford.nlp.kbp.slotfilling.classify.MultiLabelDataset;
import edu.stanford.nlp.kbp.slotfilling.common.Log;
import edu.stanford.nlp.kbp.slotfilling.multir.DocumentProtos.Relation;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.util.ErasureUtils;
import edu.stanford.nlp.util.HashIndex;
import edu.stanford.nlp.util.Index;

/**
 * Converts Hoffmann's data in protobuf format to our MultiLabelDataset
 * @author Mihai
 *
 */
public class ProtobufToMultiLabelDataset {
  static class RelationAndMentions {
    String arg1;
    String arg2;
    Set<String> posLabels;
    Set<String> negLabels;
    List<Mention> mentions;

    public RelationAndMentions(String types, String a1, String a2) {
      arg1 = a1;
      arg2 = a2;
      String [] rels = types.split(",");
      posLabels = new HashSet<String>();
      for(String r: rels){
        if(! r.equals("NA")) posLabels.add(r.trim());
      }
      negLabels = new HashSet<String>(); // will be populated later
      mentions = new ArrayList<Mention>();
    }
  };

  static class Mention {
    List<String> features;
    public Mention(List<String> feats) {
      features = feats;
    }
  }

    public static void main(String[] args) throws Exception {
      String input = args[0];

      InputStream is = new GZIPInputStream(
        new BufferedInputStream
        (new FileInputStream(input)));

      toMultiLabelDataset(is);
      is.close();
    }

    public static MultiLabelDataset<String, String> toMultiLabelDataset(InputStream is) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, true);
      MultiLabelDataset<String, String> dataset = toDataset(relations);
      return dataset;
    }

    public static void toDatums(InputStream is,
        List<List<Collection<String>>> relationFeatures,
        List<Set<String>> labels) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, false);
      toDatums(relations, relationFeatures, labels);
    }

    private static void toDatums(List<RelationAndMentions> relations,
        List<List<Collection<String>>> relationFeatures,
      List<Set<String>> labels) {
    for(RelationAndMentions rel: relations) {
      labels.add(rel.posLabels);
      List<Collection<String>> mentionFeatures = new ArrayList<Collection<String>>();
      for(int i = 0; i < rel.mentions.size(); i ++){
        mentionFeatures.add(rel.mentions.get(i).features);
      }
      relationFeatures.add(mentionFeatures);
    }
    assert(labels.size() == relationFeatures.size());
    }

    public static List<RelationAndMentions> toRelations(InputStream is, boolean generateNegativeLabels) throws IOException {
      //
      // Parse the protobuf
      //
    // all relations are stored here
    List<RelationAndMentions> relations = new ArrayList<RelationAndMentions>();
    // all known relations (without NIL)
    Set<String> relTypes = new HashSet<String>();
    Map<String, Map<String, Set<String>>> knownRelationsPerEntity =
      new HashMap<String, Map<String,Set<String>>>();
    Counter<Integer> labelCountHisto = new ClassicCounter<Integer>();
    Relation r = null;
    while ((r = Relation.parseDelimitedFrom(is)) != null) {
      RelationAndMentions relation = new RelationAndMentions(
          r.getRelType(), r.getSourceGuid(), r.getDestGuid());
      labelCountHisto.incrementCount(relation.posLabels.size());
      relTypes.addAll(relation.posLabels);
      relations.add(relation);

      for(int i = 0; i < r.getMentionCount(); i ++) {
        DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i);
        // String s = mention.getSentence();
        relation.mentions.add(new Mention(mention.getFeatureList()));
      }

      for(String l: relation.posLabels) {
        addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity);
      }
    }
    Log.severe("Loaded " + relations.size() + " relations.");
    Log.severe("Found " + relTypes.size() + " relation types: " + relTypes);
    Log.severe("Label count histogram: " + labelCountHisto);

    Counter<Integer> slotCountHisto = new ClassicCounter<Integer>();
    for(String e: knownRelationsPerEntity.keySet()) {
      slotCountHisto.incrementCount(knownRelationsPerEntity.get(e).size());
    }
    Log.severe("Slot count histogram: " + slotCountHisto);
    int negativesWithKnownPositivesCount = 0, totalNegatives = 0;
    for(RelationAndMentions rel: relations) {
      if(rel.posLabels.size() == 0) {
        if(knownRelationsPerEntity.get(rel.arg1) != null &&
           knownRelationsPerEntity.get(rel.arg1).size() > 0) {
          negativesWithKnownPositivesCount ++;
        }
        totalNegatives ++;
      }
    }
    Log.severe("Found " + negativesWithKnownPositivesCount + "/" + totalNegatives +
        " negative examples with at least one known relation for arg1.");

    Counter<Integer> mentionCountHisto = new ClassicCounter<Integer>();
    for(RelationAndMentions rel: relations) {
      mentionCountHisto.incrementCount(rel.mentions.size());
      if(rel.mentions.size() > 100)
        Log.fine("Large relation: " + rel.mentions.size() + "\t" + rel.posLabels);
    }
    Log.severe("Mention count histogram: " + mentionCountHisto);

    //
    // Detect the known negatives for each source entity
    //
    if(generateNegativeLabels) {
      for(RelationAndMentions rel: relations) {
        Set<String> negatives = new HashSet<String>(relTypes);
        negatives.removeAll(rel.posLabels);
        rel.negLabels = negatives;
      }
    }

    return relations;
    }

    private static MultiLabelDataset<String, String> toDataset(List<RelationAndMentions> relations) {
    int [][][] data = new int[relations.size()][][];
    Index<String> featureIndex = new HashIndex<String>();
    Index<String> labelIndex = new HashIndex<String>();
    Set<Integer> [] posLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);
    Set<Integer> [] negLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);

    int offset = 0, posCount = 0;
    for(RelationAndMentions rel: relations) {
      Set<Integer> pos = new HashSet<Integer>();
      Set<Integer> neg = new HashSet<Integer>();
      for(String l: rel.posLabels) {
        pos.add(labelIndex.indexOf(l, true));
      }
      for(String l: rel.negLabels) {
        neg.add(labelIndex.indexOf(l, true));
      }
      posLabels[offset] = pos;
      negLabels[offset] = neg;
      int [][] group = new int[rel.mentions.size()][];
      for(int i = 0; i < rel.mentions.size(); i ++){
        List<String> sfeats = rel.mentions.get(i).features;
        int [] features = new int[sfeats.size()];
        for(int j = 0; j < sfeats.size(); j ++) {
          features[j] = featureIndex.indexOf(sfeats.get(j), true);
        }
        group[i] = features;
      }
      data[offset] = group;
      posCount += posLabels[offset].size();
      offset ++;
    }

    Log.severe("Creating a dataset with " + data.length + " datums, out of which " + posCount + " are positive.");
    MultiLabelDataset<String, String> dataset = new MultiLabelDataset<String, String>(
        data, featureIndex, labelIndex, posLabels, negLabels);
    return dataset;
    }

    private static void addKnownRelation(String arg1, String arg2, String label,
        Map<String, Map<String, Set<String>>> knownRelationsPerEntity) {
      Map<String, Set<String>> myRels = knownRelationsPerEntity.get(arg1);
      if(myRels == null) {
        myRels = new HashMap<String, Set<String>>();
        knownRelationsPerEntity.put(arg1, myRels);
      }
      Set<String> mySlots = myRels.get(label);
      if(mySlots == null) {
        mySlots = new HashSet<String>();
        myRels.put(label, mySlots);
      }
      mySlots.add(arg2);
    }
}

Как написать правильный файл декодирования на основе заданного .proto, чтение из .pb

Ответы на вопрос(2)

Ваш ответ на вопрос

Популярные вопросы

Вы очень активны! Это здорово!

Как написать правильный файл декодирования на основе заданного .proto, чтение из .pb

Ответы на вопрос(2)

Ваш ответ на вопрос

Популярные вопросы