Thinking note | Notion

What the new paper already has:

sentiment analysis (sst2) 2 linguistic acceptability (CoLA) 1

word in context (QNLI) 7 Winograd schema challenge (WNLI) 9 textual entailment (RTE) 8

Full list of GLUE (General Language Understanding Evaluation) benchmark tasks — the standard suite of 9 tasks

~~CoLA (Corpus of Linguistic Acceptability)~~
- ~~Task: Determine whether a sentence is grammatically acceptable.~~
- ~~Labels: Acceptable / Unacceptable~~
- ~~Metric: Matthews correlation coefficient~~
~~SST-2 (Stanford Sentiment Treebank)~~
- ~~Task: Sentiment classification of movie reviews.~~
- ~~Labels: Positive / Negative~~
- ~~Metric: Accuracy~~

MRPC (Microsoft Research Paraphrase Corpus)
- Task: Determine if two sentences are paraphrases.
- Labels: Paraphrase / Not paraphrase
- Metrics: Accuracy, F1
QQP (Quora Question Pairs)
- Task: Detect if two Quora questions have the same meaning.
- Labels: Duplicate / Not duplicate
- Metrics: Accuracy, F1
STS-B (Semantic Textual Similarity Benchmark)
- Task: Predict the semantic similarity score between two sentences (scale 0–5).
- Metric: Pearson/Spearman correlation

MNLI (Multi-Genre Natural Language Inference)
- Task: Determine if a hypothesis is entailment, contradiction, or neutral relative to a premise.
- Labels: Entailment / Neutral / Contradiction
- Metrics: Accuracy (matched/mismatched)
~~QNLI (Question Natural Language Inference)~~
- ~~Task: Determine if a context sentence contains the answer to a question.~~
- ~~Labels: Entailment / Not entailment~~
- ~~Metric: Accuracy~~
~~RTE (Recognizing Textual Entailment)~~
- ~~Task: Binary entailment task combining data from RTE1–RTE3 and SICK.~~
- ~~Labels: Entailment / Not entailment~~
- ~~Metric: Accuracy~~
~~WNLI (Winograd NLI)~~
- ~~Task: Resolve coreference ambiguity (Winograd Schema Challenge format).~~
- ~~Labels: Entailment / Not entailment~~
- ~~Metric: Accuracy~~