What the new paper already has:
sentiment analysis (sst2) 2
linguistic acceptability (CoLA) 1
word in context (QNLI) 7
Winograd schema challenge (WNLI) 9
textual entailment (RTE) 8
Full list of GLUE (General Language Understanding Evaluation) benchmark tasks β the standard suite of 9 tasks
π§© Single-Sentence Tasks
CoLA (Corpus of Linguistic Acceptability)
- ~~Task: Determine whether a sentence is grammatically acceptable.~~
- ~~Labels: Acceptable / Unacceptable~~
- ~~Metric: Matthews correlation coefficient~~
SST-2 (Stanford Sentiment Treebank)
- ~~Task: Sentiment classification of movie reviews.~~
- ~~Labels: Positive / Negative~~
- ~~Metric: Accuracy~~
π Similarity and Paraphrase Tasks
- MRPC (Microsoft Research Paraphrase Corpus)
- Task: Determine if two sentences are paraphrases.
- Labels: Paraphrase / Not paraphrase
- Metrics: Accuracy, F1
- QQP (Quora Question Pairs)
- Task: Detect if two Quora questions have the same meaning.
- Labels: Duplicate / Not duplicate
- Metrics: Accuracy, F1
- STS-B (Semantic Textual Similarity Benchmark)
- Task: Predict the semantic similarity score between two sentences (scale 0β5).
- Metric: Pearson/Spearman correlation
π§ Inference Tasks
- MNLI (Multi-Genre Natural Language Inference)
- Task: Determine if a hypothesis is entailment, contradiction, or neutral relative to a premise.
- Labels: Entailment / Neutral / Contradiction
- Metrics: Accuracy (matched/mismatched)
QNLI (Question Natural Language Inference)
- ~~Task: Determine if a context sentence contains the answer to a question.~~
- ~~Labels: Entailment / Not entailment~~
- ~~Metric: Accuracy~~
RTE (Recognizing Textual Entailment)
- ~~Task: Binary entailment task combining data from RTE1βRTE3 and SICK.~~
- ~~Labels: Entailment / Not entailment~~
- ~~Metric: Accuracy~~
WNLI (Winograd NLI)
- ~~Task: Resolve coreference ambiguity (Winograd Schema Challenge format).~~
- ~~Labels: Entailment / Not entailment~~
- ~~Metric: Accuracy~~