Our task was to create a system that examines scientific claims and determines whether they are supported or contradicted by any of the articles.

  1. The system takes a scientific claim as input.
  2. It identifies all relevant abstracts from a large corpus.
  3. It labels the claims based on whether the relevant abstract supports (SUPPORT) or contradicts (CONTRADICT) them.
  4. It selects sentences that prove the correctness of the labels.

Plan

  • Preprocessing of data
  • Search for a method to extract relevant abstracts (TF-IDF algorithm)
  • Extraction of relevant abstracts from the corpus
  • Search for a machine learning technique suitable for sentence comparison (Sentence-BERT)
  • Application of Sentence-BERT to solve the task
  • Verification of the system’s accuracy
  • Training the system with the obtained training set
  • Testing the system with test data
  • Optimizing the system.

Methods

TF-IDF: term frequency–inverse document frequency

TF-IDF returns a value that determines how important a term is in a collection of documents. We used this algorithm to return the k most relevant abstracts for each claim.

TF-IDF

Sentence-BERT

Sentence-BERT is a modified variation of BERT that uses a siamese and triplet neural network structure to compare sentences based on cosine similarity. This method significantly reduces the time to find similar sentences compared to BERT, while maintaining similar accuracy. Sentence-BERT was used for labeling and selecting sentences.

Sentence-BERT operation:

link

Pretrained models

We selected three pretrained models for experimentation. The best-performing one, according to SBERT documentation, is based on mpnet and was trained on the NLI+STSb (Semantic Textual Similarity benchmark) dataset. Additionally, we selected two others, one based on XLM-R and the other based on distilroberta, both trained on the Paraphrase Data dataset.

Model NameBase ModelPoolingTraining DataSTSb Performance (Higher = Better)
stsb-mpnet-base-v2mpnet-baseMean PoolingNLI+STSb88.57
paraphrase-xlm-r-multilingual-v1XLM-RMean PoolingParaphrase Data83.50
paraphrase-distilroberta-base-v1distilroberta-baseMean PoolingParaphrase Data81.81

Threshold Selection

We determined the label thresholds based on similarity values. We observed the approximate values around SUPPORT and CONTRADICT for each claim, and thus defined two threshold values. We refined the selection based on the results.

NOT ENOUGH INFOCONTRADICTSUPPORT
x < 0.450.45 < x < 0.600.60 < x
x < 0.470.47 < x < 0.580.58 < x
x < 0.500.50 < x < 0.580.58 < x
x < 0.520.52 < x < 0.580.58 < x

Results

We first tried the three pretrained models on 25 claims using the previously determined thresholds, and then applied the best-performing model to all claims. For evaluation, we used training and corpus datasets to verify the results.

Evaluation criteria:

  • Empty Evidence: where there is no information in the received Evidence / where there is actually none
  • Non-empty Evidence: where there is actual information in the Evidence
  • True SUPPORT: which is labeled as SUPPORT in the results / which is actually SUPPORT
  • True CONTRADICT: which is labeled as CONTRADICT in the results / which is actually CONTRADICT
  • False SUPPORT: which is labeled as SUPPORT in the results / which is actually CONTRADICT
  • False CONTRADICT: which is labeled as CONTRADICT in the results / which is actually SUPPORT
  • Correctly Predicted: claims correctly predicted / total number of tested claims
  • Accuracy: accuracy calculated by us, in contrast to the task description, we do not only measure exact matches, but also accept when our prediction is present in the training set prediction.

stsb-mpnet-base-v2

Thresholds0.45/0.600.47/0.580.50/0.580.52/0.58
Empty Evidence5/176/179/1711/17
Non-empty Evidence8888
True SUPPORT1/21/21/21/1
False CONTRADICT1/21/21/20/1
True CONTRADICT1/11/11/11/1
False SUPPORT0/10/10/10/1
Correctly Predicted7/258/2511/2513/25
Accuracy0.280.320.440.52

paraphrase-xlm-r-multilingual-v1

Thresholds0.45/0.600.47/0.580.50/0.580.52/0.58
Empty Evidence3/173/174/175/17
Non-empty Evidence8888
True SUPPORT1/21/21/21/2
False CONTRADICT1/21/21/21/2
True CONTRADICT2/32/41/31/3
False SUPPORT1/32/42/32/3
Correctly Predicted6/256/256/257/25
Accuracy0.240.240.240.28

paraphrase-distilroberta-base-v1

Thresholds0.45/0.600.47/0.580.50/0.580.52/0.58
Empty Evidence5/175/177/178/17
Non-empty Evidence8888
True SUPPORT1/21/21/21/2
False CONTRADICT1/21/21/21/2
True CONTRADICT2/22/22/22/2
False SUPPORT0/20/20/20/2
Correctly Predicted8/258/2510/2511/25
Accuracy0.320.320.400.44

Results on the Full Dataset

Based on all these results, the stsb-mpnet-base-v2 pretrained model achieved the best performance with the 0.52/0.58 thresholds, so we applied this model to the entire dataset.

Thresholds0.52/0.58
Empty Evidence126/304
Non-empty Evidence505
True SUPPORT356/631
False CONTRADICT275/631
True CONTRADICT143/323
False SUPPORT180/323
Correctly Predicted625/809
Accuracy0.77

Challenges:

The main challenge for us was the deadline. We didn’t have time for training and subsequent evaluation with our own model. Additionally, designing the evaluation was a difficult task, and while it was not perfect, the chosen models for comparison and the refinement of thresholds proved to be appropriate.