GitXplorerGitXplorer
s

lumberjack

public
10 stars
2 forks
1 issues

Commits

List of commits on branch master.
Unverified
aa23c10700643546e684da2060c1189b2d502c8b

Fix clippy warnings

ssebpuetz committed 5 years ago
Unverified
4d7ea7a1019e9a0f14a49ddee350ddf90a75c662

Fix resetting of NT spans.

ssebpuetz committed 6 years ago
Unverified
c773f373dad9e265e38c56c5a0b4a37bcfd9cf2e

Add secondary edges.

ssebpuetz committed 6 years ago
Unverified
4b254d328113b419d6aaaad357d7b4f2b429439f

Add NegraWriter.

ssebpuetz committed 6 years ago
Unverified
97f76432f5ed7ae0824615d5ad87554e6ac2e82c

Rework NegraReader.

ssebpuetz committed 6 years ago
Unverified
b2b811ee73d906a9dec66d74124c37a3ee776b3b

Distinguish missing features and features without values.

ssebpuetz committed 6 years ago

README

The README file for this repository.

Crate Build Status

lumberjack

Read and process constituency trees in various formats.

Install:

  • From crates.io:
cargo install lumberjack-utils
  • From GitHub:
cargo install --git https://github.com/sebpuetz/lumberjack

Usage as standalone:

  • Convert treebank in NEGRA export 4 format to bracketed TueBa V2 format
lumberjack-conversion --input_file treebank.negra --input_format negra \
    --output_format tueba --output_file treebank.tueba --projectivize
  • Retain only root node, NPs and PPs and print to simple bracketed format:
echo "NP PP" > filter_set.txt
lumberjack-conversion --input_file treebank.simple --input_format simple \
    --output_format tueba --output_file treebank.filtered \
    --filter filter_set.txt
  • Convert from treebank in simple bracketed to CONLLX format and annotate parent tags of terminals as features.
lumberjack-conversion --input_file treebank.simple --input_format  simple\
    --output_format conllx --output_file treebank.conll --parent 
  • Modifications in the following order:
  1. Reattach all terminals with part-of-speech starting with $ to the root node
  2. Remove all nonterminals except the root, Ss, NPs, PPs and VPs
  3. Assign unique identifiers based on the closest S to terminals
  4. Insert nodes with label label above terminals that aren't dominated by NP or PP
  5. Annotate label of parent node on terminals.
  6. Print to CONLLX format with annotations.
echo "S VP NP PP" > filter_set.txt
echo "NP PP" > insert_set.txt
echo "S" > id_set.txt
lumberjack-conversion --input_file treebank.simple --input_format simple\
    --output_format conllx --insertion_set insert_set.txt \
    --insertion_label label --id_set id_set.txt --reattach $\
    --parent parent --output_file treebank.conllx

Usage as rust library:

  • read and projectivize trees from NEGRA format and print to simple bracketed format
use std::io::{BufReader, File};

use lumberjack::io::{NegraReader, PTBFormat};
use lumberjack::Projectivize;

fn print_negra(path: &str) {
    let file = File::open(path).unwrap();
    let reader = NegraReader::new(BufReader::new(file));
    for tree in reader {
        let mut tree = tree.unwrap();
        tree.projectivize();
        println!("{}", PTBFormat::Simple.tree_to_string(&tree).unwrap());
    }
}
  • filter non-terminal nodes from trees in a treebank and print to simple bracketed format:
use lumberjack::{io::PTBFormat, Tree, TreeOps, util::LabelSet};

fn filter_nodes(iter: impl Iterator<Item=Tree>, set: LabelSet) {
    for mut tree in iter {
        tree.filter_nonterminals(|tree, nt| set.matches(tree[nt].label())).unwrap();
        println!("{}", PTBFormat::Simple.tree_to_string(&tree).unwrap());
    }
}
  • convert treebank in simple bracketed format to CONLLX with constituency structure encoded in the features field
use conllx::graph::Sentence;
use lumberjack::io::Encode;
use lumberjack::{Tree, TreeOps, UnaryChains};

fn to_conllx(iter: impl Iterator<Item=Tree>) {
    for mut tree in iter {
        tree.collaps_unary_chains().unwrap();
        tree.annotate_absolute().unwrap();
        println!("{}", Sentence::from(&tree));    
    }
}