GitXplorerGitXplorer
f

clara

public
10 stars
3 forks
2 issues

Commits

List of commits on branch main.
Unverified
4438cc8fe4c5806cc4710e950e6a26eb09a9cc4a

add licensing info

jjramak committed 2 years ago
Unverified
df10966f508a60b7ab6bd7a9e29a6a91aab7ccd2

add model and simulator code.

jjramak committed 2 years ago
Unverified
42a0dc2b08972ef6b108a4b0eeb5080088ff533c

add folder and readme for mapping-aware model.

jjramak committed 2 years ago
Unverified
4dac11fd82e3db53d9fdef393f8a770768bd80aa

Add missing function generate_labeler_confusion_matrix in simulator.py

committed 4 years ago
Unverified
fc9399d845647dc940eabe422cc759f746d52bcb

Adding example with different labelers have different confusion matrices

committed 4 years ago
Unverified
d4b0b6711741c5d6c2be2da6d808454e4a9d5f1c

Initial commit

ffacebook-github-bot committed 4 years ago

README

The README file for this repository.

CLARA: Confidence of Labels and Raters

An implementation of the Gibbs sampler for the model (together with simulators to generate synthetic data) used in the paper "CLARA: Confidence of Labels and Raters" (KDD'20).

@inproceedings{clara-kdd-20,
    author = {Viet-An Nguyen and Peibei Shi and Jagdish Ramakrishnan and Udi Weinsberg and Henry C. Lin and Steve Metz and Neil Chandra and Jane Jing and Dimitris Kalimeris},
    title = {{CLARA: Confidence of Labels and Raters}},
    booktitle = {Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20)},
    year = {2020},
}

Simulating Data

Generate data without classifier scores

We can generate a dataset with 1000 items with the true prevalence theta = [0.8, 0.2] and all labelers share the same confusion matrix psi = [[0.9, 0.1], [0.05, 0.95]] as follow:

from simulator import generate_dataset_tiebreaking
df = generate_dataset_tiebreaking(
    dataset_id=0,
    theta=np.array([0.8, 0.2]),
    psi=np.array([[0.9, 0.1], [0.05, 0.95]]),
    num_items=1000,
)

The simulated data will look like:

dataset id labelers ratings true_rating
0 0_995 [0, 0] [0, 0] 0
0 0_996 [0, 0] [0, 0] 0
0 0_997 [0, 0, 0] [0, 1, 0] 0
0 0_998 [0, 0] [0, 0] 0
0 0_999 [0, 0] [0, 0] 0

Generate data with classifier scores

from simulator import generate_dataset_tiebreaking_with_scores
df = generate_dataset_tiebreaking_with_scores(
    dataset_id=1,
    theta=np.array([0.8, 0.2]),
    psi=np.array([[0.9, 0.1], [0.05, 0.95]]),
    num_items=1000,
)

Using CLARA

Fit the model

To fit a CLARA model with a single confusion matrix shared across all labelers

model = ClaraGibbs(burn_in=2000, num_samples=1000, sample_lag=3)
model.fit(A=1, R=2, ratings=np.array(df.ratings))

Estimate the prevalence

model.get_prevalence()

Estimate the confusion matrix

model.get_confusion_matrix(labeler_id=0)

Installation

Installation Requirements

  • Python >= 3.6
  • numpy
  • pandas
  • scipy

License

You may find out more about the license here.