GitXplorerGitXplorer
l

spacy-span-analyzer

public
6 stars
3 forks
6 issues

Commits

List of commits on branch master.
Unverified
05fdeba1eddd09efbf446bddc24a13a0c5b32648

Add archive note

lljvmiranda921 committed 3 years ago
Unverified
596da89225f5b63cdecec5a302c5aabfdc74b9e5

add fix for missing keys

ppmbaumgartner committed 3 years ago
Unverified
60a42afee85701b11fd458d4df87d1028ae9b142

add textacy

ppmbaumgartner committed 3 years ago
Unverified
5ab51129a750317cf65d167822093195ebb0f2b2

Bump version to 0.3.0

lljvmiranda921 committed 3 years ago
Verified
6314aea405bd9ce797679e4890cbb8723eee67d3

Add experiments for Nested NER (#17)

lljvmiranda921 committed 3 years ago
Unverified
0e929598acffede6132713f4df99c29f09c64de1

Add QoL option to save metrics to JSON

lljvmiranda921 committed 3 years ago

README

The README file for this repository.

💫 This library is now integrated into spaCy v3.4 as debug data!

spacy-span-analyzer

A simple tool to analyze the Spans in your dataset. It's tightly integrated with spaCy, so you can easily incorporate it to existing NLP pipelines. This is also a reproduction of Papay, et al's work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020).

⏳ Install

Using pip:

pip install spacy-span-analyzer

Directly from source (I highly recommend running this within a virtual environment):

git clone git@github.com:ljvmiranda921/spacy-span-analyzer.git
cd spacy-span-analyzer
pip install .

⏯ Usage

You can use the Span Analyzer as a command-line tool:

spacy-span-analyzer ./path/to/dataset.spacy

Or as an imported library:

import spacy
from spacy.tokens import DocBin
from spacy_span_analyzer import SpanAnalyzer

nlp = spacy.blank("en")  # or any Language model

# Ensure that your dataset is a DocBin
doc_bin = DocBin().from_disk("./path/to/data.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))

# Run SpanAnalyzer and get span characteristics
analyze = SpanAnalyzer(docs)
analyze.frequency  
analyze.length
analyze.span_distinctiveness
analyze.boundary_distinctiveness

Inputs are expected to be a list of spaCy Docs or a DocBin (if you're using the command-line tool).

Working with Spans

In spaCy, you'd want to store your Spans in the doc.spans property, under a particular spans_key (sc by default). Unlike the doc.ents property, doc.spans allows overlapping entities. This is useful especially for downstream tasks like Span Categorization.

A common way to do this is to use char_span to define a slice from your Doc:

doc = nlp(text)
spans = []
from annotation in annotations:
    span = doc.char_span(
        annotation["start"],
        annotation["end"],
        annotation["label"],
    )
    spans.append(span)

# Put all spans under a spans_key
doc.spans["sc"] = spans

You can also achieve the same thing by using set_ents or by creating a SpanGroup.