์œค์šฉ์„  Paper Reading Assistant 2ยถ

0. Remindยถ

  • ๋…ผ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ์งˆ๋ฌธ์„ ์ƒ์„ฑํ•˜๊ณ  ๋‚˜์˜ ๋‹ต๋ณ€์„ ์ฑ„์ ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ

  • method

    1. Download paper: Arxiv์—์„œ ๋…ผ๋ฌธ ์ •๋ณด์™€ pdf ๋‹ค์šด๋กœ๋“œ

    2. Preprocess paper: pdf์—์„œ ๋…ผ๋ฌธ text๋ฅผ ์ถ”์ถœ ํ›„ ์ „์ฒ˜๋ฆฌ

    3. Generate questions: ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ๊ด€๋ จ๋œ ์งˆ๋ฌธ ์ƒ์„ฑ

    4. Answer questions: ์งˆ๋ฌธ์— ๋Œ€ํ•œ pseudo ์ •๋‹ต ์ƒ์„ฑ

    5. Evaluate user answer: pseudo ์ •๋‹ต์„ ๊ธฐ์ค€์œผ๋กœ ์‚ฌ๋žŒ์˜ ๋‹ต๋ณ€ ์ฑ„์ 

1. Problemsยถ

  • ์ค‘์š”ํ•˜์ง€ ์•Š์€ ํ…์ŠคํŠธ๊ฐ€ ๋งŽ๋‹ค. (reference, table caption ๋“ฑ) โ†’ PDF layout analysis

  • ์ฑ„์ ์— ์‚ฌ์šฉํ•˜๋Š” BERTScore ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋‹ค. โ†’ Model Distillation

2. PDF layout analysisยถ

2.1. Introductionยถ

  • ๋…ผ๋ฌธ์€ ๋Œ€์ฒด๋กœ ์„œ๋ก , ๊ด€๋ จ ์—ฐ๊ตฌ, ๋ฐฉ๋ฒ•๋ก , ์‹คํ—˜, ๊ฒฐ๋ก ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ดย ์žˆ๋‹ค.

  • ํ•˜์ง€๋งŒ PDF๋กœ ๋ฐฐํฌ๋œ ๋…ผ๋ฌธ์—์„œ ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ์ž๋™์œผ๋กœ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค.

  • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ Document-Image Understanding ๋ชจ๋ธ๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.

  • ๊ทธ ์ค‘ย VIsualย LAyout (VILA) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋…ผ๋ฌธ์˜ ๊ตฌ์กฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ณผ์ •์„ ์ง„ํ–‰ํ•ด๋ณด์•˜๋‹ค.

2.2. VILA: Visual LAyoutยถ

  • VILA: Improving structured content extraction from scientific PDFs using visual layout groups

  • ๋ฌธ์„œ์˜ ๊ตฌ์กฐ๋ฅผ ์ธ์‹ํ•˜๋Š” Document Layout Analysis ๋ฌธ์ œ๋Š” ์ฃผ๋กœ token classification (NLP-centric)์ด๋‚˜ object detection (vision-centric) ๋ฌธ์ œ๋กœ ์น˜ํ™˜ํ•˜์—ฌ ํ•ด๊ฒฐํ•œ๋‹ค.

    • VILA์˜ ๊ฒฝ์šฐ token classification ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

  • VILA๋Š” line ๋˜๋Š” block๊ณผ ๊ฐ™์€ visual group ๋‚ด์˜ token๋“ค์€ ๊ฐ™์€ ๋ผ๋ฒจ์„ ๊ฐ€์ง„๋‹ค๋Š” โ€œgroup uniformity assumptionโ€์„ ๊ฐ•์กฐํ•œ๋‹ค.

  • group uniformity assumption๋ฅผ ๋”ฐ๋ฅด๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

    • I-VILA: group ์‚ฌ์ด์— speical token [BLK] ์ž…๋ ฅ

    • H-VILA: group ๋ณ„๋กœ self-attention ํ›„ group representation ์ถ”์ถœํ•˜์—ฌ group ๊ฐ„ self-attention ์ง„ํ–‰

  • ์„ฑ๋Šฅ์€ I-VILA๊ฐ€ ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ํšจ์œจ์„ฑ์€ H-VILA๊ฐ€ ๋” ์ข‹๋‹ค.

Illustration of the H-VILA

Illustration of the H-VILA [1]

2.3. Codeยถ

  • VILA๋ฅผ Docbank ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ fine-tuningํ•œ ๋ชจ๋ธ ์‚ฌ์šฉ [2]

  • VILA official repository์—์„œ ๊ณต๊ฐœํ•œ ์ฝ”๋“œ์™€ ๋ชจ๋ธ ์‚ฌ์šฉ

2.3.1. Setup

clone repsoitory

!git clone https://github.com/allenai/vila.git
import sys
sys.path.append('vila/src')

import libraries

import layoutparser as lp
from collections import defaultdict

from utils import download_paper
from vila.pdftools.pdf_extractor import PDFExtractor
from vila.predictors import HierarchicalPDFPredictor, LayoutIndicatorPDFPredictor

2.3.2. Modules

  • predict: pdf๋ฅผ ์ฝ์€ ๋’ค token classification ์ง„ํ–‰

  • construct_token_groups: ๊ฐ™์€ class๋กœ ์˜ˆ์ธก๋œ token๋ผ๋ฆฌ ๊ทธ๋ฃนํ™”

  • join_group_text: ๊ฐ™์€ group์˜ token๋“ค์„ ํ•˜๋‚˜์˜ text๋กœ ๋ฌถ์Œ

    • token์˜ bbox๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ ์œ ๋ฌด ๊ฒฐ์ •

  • construct_section_groups: section(์„œ๋ก , ๋ณธ๋ก , ๊ฒฐ๋ก  ๋“ฑ)๊ณผ section์— ํ•ด๋‹นํ•˜๋Š” paragraph ์ถ”์ถœ

def predict(pdf_path, pdf_extractor, vision_model, layout_model):
    page_tokens, page_images = pdf_extractor.load_tokens_and_image(pdf_path)

    pred_tokens = []
    for page_token, page_image in zip(page_tokens, page_images):
        blocks = vision_model.detect(page_image)
        page_token.annotate(blocks=blocks)
        pdf_data = page_token.to_pagedata().to_dict()
        pred_tokens += layout_model.predict(pdf_data, page_token.page_size)

    return pred_tokens
    

def construct_token_groups(pred_tokens):
    groups, group, group_type, prev_bbox = [], [], None, None
    
    for token in pred_tokens:
        if group_type is None:
            is_continued = True
            
        elif token.type == group_type:
            if group_type == 'section':
                is_continued = abs(prev_bbox[3] - token.coordinates[3]) < 1.
            else:
                is_continued = True

        else:
            is_continued = False

        
        # print(token.text, token.type, is_continued)
        group_type = token.type
        prev_bbox = token.coordinates
        if is_continued:
            group.append(token)
        
        else:
            groups.append(group)
            group = [token]
    
    if group:
        groups.append(group)

    return groups

def join_group_text(group):
    text = ''
    prev_bbox = None
    for token in group:
        if not text:
            text += token.text
    
        else:        
            if abs(prev_bbox[2] - token.coordinates[0]) > 2:
                text += ' ' + token.text
    
            else:
                text += token.text
    
        prev_bbox = token.coordinates
    return text

def construct_section_groups(token_groups):
    section_groups = defaultdict(list)

    section = None
    for group in token_groups:
        group_type = group[0].type
        group_text = join_group_text(group)
        
        if group_type == 'section':
            section = group_text
            section_groups[section]
    
        elif group_type == 'paragraph' and section is not None:
            section_groups[section].append(group_text)

    section_groups = {k: ' '.join(v) for k,v in section_groups.items()}
    return section_groups

2.3.3. Run

prepare models

pdf_extractor = PDFExtractor("pdfplumber")
vision_model = lp.EfficientDetLayoutModel("lp://PubLayNet") 
layout_model = HierarchicalPDFPredictor.from_pretrained("allenai/hvila-row-layoutlm-finetuned-docbank")

inference

pdf_path = '2307.03170v1.pdf'
pred_tokens = predict(pdf_path, pdf_extractor, vision_model, layout_model)
token_groups = construct_token_groups(pred_tokens)
section_groups = construct_section_groups(token_groups)

2.3.4. Results

section ๋ชฉ๋ก

sections = list(section_groups.keys())
print(sectiosn)

Untitled

section text

print(section_groups['6 Limitations and future work'])

Untitled

3. BERTScore Distillationยถ

3.1. Introductionยถ

  • BERTScore๋Š” pretrained language model์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ฃผ๋กœ ๋ฒˆ์—ญ, ์š”์•ฝ ๋“ฑ ๋ฌธ์žฅ ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•œ๋‹ค [3].

  • language model์˜ ํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก BERTScore์™€ Human evalution์˜ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ํฐ ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

  • ํ•˜์ง€๋งŒ ํฐ ๋ชจ๋ธ์€ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž Knowledge distillation์„ ํ†ตํ•ด ์ž‘์€ ๋ชจ๋ธ์ด ํฐ ๋ชจ๋ธ์˜ BERTScore๋ฅผ ๋”ฐ๋ผํ•˜๋„๋ก ํ•™์Šต์‹œ์ผฐ๋‹ค.

  • ๊ฒฐ๊ณผ ๋ชจ๋ธ: yongsun-yoon/minilmv2-bertscore-distilled

๋ชจ๋ธ๋ณ„ BERTScore์™€ Human evaluation๊ณผ์˜ ์ƒ๊ด€๊ด€๊ณ„ [4]

๋ชจ๋ธ๋ณ„ BERTScore์™€ Human evaluation๊ณผ์˜ ์ƒ๊ด€๊ด€๊ณ„ [4]

3.2. Setupยถ

import math
import wandb
import easydict
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm

import torch
import torch.nn.functional as F

import huggingface_hub
from bert_score import BERTScorer
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel

cfg = easydict.EasyDict(
    device = 'cuda:0',
    student_name = 'nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large',
    teacher_name = 'microsoft/deberta-large-mnli',
    teacher_layer_idx = 18,
    lr = 5e-5,
    batch_size = 8,
    num_epochs = 5
)

3.3. Dataยถ

  • ์–ด๋А์ •๋„ ์œ ์‚ฌ์„ฑ์ด ์žˆ๋Š” ๋ฌธ์žฅ์Œ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด GLUE MNLI ๋ฐ์ดํ„ฐ์…‹ ์„ ํƒ

class Dataset(torch.utils.data.Dataset):
    def __init__(self, data, text1_key, text2_key):
        self.data = data
        self.text1_key = text1_key
        self.text2_key = text2_key

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        text1 = item[self.text1_key]
        text2 = item[self.text2_key]
        return text1, text2

    def collate_fn(self, batch):
        texts1, texts2 = zip(*batch)
        return texts1, texts2

    def get_dataloader(self, batch_size, shuffle):
        return torch.utils.data.DataLoader(self, batch_size=batch_size, shuffle=shuffle)

data = load_dataset('glue', 'mnli')

train_data = data['train']
train_data = train_data.train_test_split(train_size=80000)['train']
train_dataset = Dataset(train_data, 'premise', 'hypothesis')
train_loader = train_dataset.get_dataloader(cfg.batch_size, True)

test_data = data['validation_mismatched'].train_test_split(test_size=4000)['test']
test_dataset = Dataset(test_data, 'premise', 'hypothesis')
test_loader = test_dataset.get_dataloader(cfg.batch_size, False)

3.4. Modelยถ

teacher_tokenizer = AutoTokenizer.from_pretrained(cfg.teacher_name)
teacher_model = AutoModel.from_pretrained(cfg.teacher_name)
_ = teacher_model.eval().requires_grad_(False).to(cfg.device)

student_tokenizer = AutoTokenizer.from_pretrained(cfg.student_name)
student_model = AutoModel.from_pretrained(cfg.student_name)
_ = student_model.train().to(cfg.device)
optimizer = torch.optim.Adam(student_model.parameters(), lr=cfg.lr)

3.5. Trainยถ

  • ๋‘ ๋ฌธ์žฅ์˜ cross attention score๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค teacher์˜ attention ๋ถ„ํฌ๋ฅผ student๊ฐ€ ๋”ฐ๋ผํ•˜๋„๋ก ํ•™์Šต

  • loss function์œผ๋กœ ๋‘ ๋ถ„ํฌ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” kl divergence๋ฅผ ์‚ฌ์šฉ

  • ์ด๋•Œ teacher model๊ณผ student model์˜ tokenizer๊ฐ€ ๋‹ค๋ฅผ ๊ฒฝ์šฐ token ๋‹จ์œ„์˜ ๋น„๊ต๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด token์„ word ๋‹จ์œ„๋กœ ๋ณ€ํ™˜ํ–ˆ๋‹ค.

def get_word_embeds(model, tokenizer, texts, layer_idx=-1, max_length=384):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=max_length, return_tensors='pt').to(model.device)
    outputs = model(**inputs, output_hidden_states=True)
    
    num_texts = inputs.input_ids.size(0)
    token_embeds = outputs.hidden_states[layer_idx]

    batch_word_embeds = []
    for i in range(num_texts):
        text_word_embeds = []

        j = 0
        while True:
            token_span = inputs.word_to_tokens(i, j)
            if token_span is None: break

            word_embed = token_embeds[i][token_span.start:token_span.end].mean(dim=0)
            text_word_embeds.append(word_embed)
            j += 1

        text_word_embeds = torch.stack(text_word_embeds, dim=0).unsqueeze(0) # (1, seq_length, hidden_dim)
        batch_word_embeds.append(text_word_embeds) 

    return batch_word_embeds

def kl_div_loss(s, t, temperature):
    if len(s.size()) != 2:
        s = s.view(-1, s.size(-1))
        t = t.view(-1, t.size(-1))

    s = F.log_softmax(s / temperature, dim=-1)
    t = F.softmax(t / temperature, dim=-1)
    return F.kl_div(s, t, reduction='batchmean') * (temperature ** 2)

def transpose_for_scores(h, num_heads):
    batch_size, seq_length, dim = h.size()
    head_size = dim // num_heads
    h = h.view(batch_size, seq_length, num_heads, head_size)
    return h.permute(0, 2, 1, 3) # (batch, num_heads, seq_length, head_size)

def attention(h1, h2, num_heads, attention_mask=None):
    # assert h1.size() == h2.size()
    head_size = h1.size(-1) // num_heads
    h1 = transpose_for_scores(h1, num_heads) # (batch, num_heads, seq_length, head_size)
    h2 = transpose_for_scores(h2, num_heads) # (batch, num_heads, seq_length, head_size)

    attn = torch.matmul(h1, h2.transpose(-1, -2)) # (batch_size, num_heads, seq_length, seq_length)
    attn = attn / math.sqrt(head_size)
    if attention_mask is not None:
        attention_mask = attention_mask[:, None, None, :]
        attention_mask = (1 - attention_mask) * -10000.0
        attn = attn + attention_mask

    return attn

def train_epoch(
    teacher_model, teacher_tokenizer, 
    student_model, student_tokenizer,
    train_loader,
    teacher_layer_idx,
):
    
    student_model.train()
    pbar = tqdm(train_loader)
    for texts1, texts2 in pbar:
        teacher_embeds1 = get_word_embeds(teacher_model, teacher_tokenizer, texts1, layer_idx=teacher_layer_idx)
        teacher_embeds2 = get_word_embeds(teacher_model, teacher_tokenizer, texts2, layer_idx=teacher_layer_idx)
        
        student_embeds1 = get_word_embeds(student_model, student_tokenizer, texts1, layer_idx=-1)
        student_embeds2 = get_word_embeds(student_model, student_tokenizer, texts2, layer_idx=-1)
    
        teacher_scores1 = [attention(e1, e2, 1) for e1, e2 in zip(teacher_embeds1, teacher_embeds2)]
        student_scores1 = [attention(e1, e2, 1) for e1, e2 in zip(student_embeds1, student_embeds2)]
        loss1 = torch.stack([kl_div_loss(ts, ss, temperature=1.) for ts, ss in zip(teacher_scores1, student_scores1)]).mean()
        
        teacher_scores2 = [attention(e2, e1, 1) for e1, e2 in zip(teacher_embeds1, teacher_embeds2)]
        student_scores2 = [attention(e2, e1, 1) for e1, e2 in zip(student_embeds1, student_embeds2)]
        loss2 = torch.stack([kl_div_loss(ts, ss, temperature=1.) for ts, ss in zip(teacher_scores2, student_scores2)]).mean()
        
        loss = (loss1 + loss2) * 0.5
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        log = {'loss': loss.item(), 'loss1': loss.item(), 'loss2': loss2.item()}
        wandb.log(log)
        pbar.set_postfix(log)

def test_epoch(
    teacher_model, teacher_tokenizer, 
    student_model, student_tokenizer,
    test_loader,
    teacher_layer_idx,
):
    student_model.eval()
    test_loss, num_data = 0, 0
    for texts1, texts2 in test_loader:
        with torch.no_grad():
            teacher_embeds1 = get_word_embeds(teacher_model, teacher_tokenizer, texts1, layer_idx=teacher_layer_idx)
            teacher_embeds2 = get_word_embeds(teacher_model, teacher_tokenizer, texts2, layer_idx=teacher_layer_idx)
            
            student_embeds1 = get_word_embeds(student_model, student_tokenizer, texts1, layer_idx=-1)
            student_embeds2 = get_word_embeds(student_model, student_tokenizer, texts2, layer_idx=-1)
    
        teacher_scores1 = [attention(e1, e2, 1) for e1, e2 in zip(teacher_embeds1, teacher_embeds2)]
        student_scores1 = [attention(e1, e2, 1) for e1, e2 in zip(student_embeds1, student_embeds2)]
        loss1 = torch.stack([kl_div_loss(ts, ss, temperature=1.) for ts, ss in zip(teacher_scores1, student_scores1)]).mean()
        
        teacher_scores2 = [attention(e2, e1, 1) for e1, e2 in zip(teacher_embeds1, teacher_embeds2)]
        student_scores2 = [attention(e2, e1, 1) for e1, e2 in zip(student_embeds1, student_embeds2)]
        loss2 = torch.stack([kl_div_loss(ts, ss, temperature=1.) for ts, ss in zip(teacher_scores2, student_scores2)]).mean()
        
        loss = (loss1 + loss2) * 0.5
        batch_size = len(texts1)
        test_loss += loss.item() * batch_size
        num_data += batch_size

    test_loss /= num_data
    return test_loss

wandb.init(project='bert-score-distillation')

best_loss = 1e10
for ep in range(cfg.num_epochs):
    train_epoch(teacher_model, teacher_tokenizer, student_model, student_tokenizer, train_loader, cfg.teacher_layer_idx)
    test_loss = test_epoch(teacher_model, teacher_tokenizer, student_model, student_tokenizer, test_loader, cfg.teacher_layer_idx)

    print(f'ep {ep:02d} | loss {test_loss:.3f}')
    if test_loss < best_loss:
        student_model.save_pretrained('checkpoint')
        student_tokenizer.save_pretrained('checkpoint')
        best_loss = test_loss
        wandb.log({'test_loss': test_loss})

W&B Chart 2023. 7. 5. แ„‹แ…ฉแ„Œแ…ฅแ†ซ 7_25_51.png

3.6. Evaluateยถ

  • ํ•™์Šต ๊ฒฐ๊ณผ teacher model๊ณผ์˜ BERTScore ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ 0.806์—์„œ 0.936์œผ๋กœ ํ–ฅ์ƒ

def calculate_score(scorer, loader):
    scores = []
    for texts1, texts2 in tqdm(loader):
        P, R, F = scorer.score(texts1, texts2)
        scores += F.tolist()
    return scores

teacher_scorer = BERTScorer(model_type=cfg.teacher_name, num_layers=cfg.teacher_layer_idx)
student_scorer = BERTScorer(model_type=cfg.student_name, num_layers=6)
distilled_student_scorer = BERTScorer(model_type='checkpoint', num_layers=6)

teacher_scores = calculate_score(teacher_scorer, test_loader)
student_scores = calculate_score(student_scorer, test_loader)
distilled_scores = calculate_score(distilled_student_scorer, test_loader)

scores = pd.DataFrame({'teacher': teacher_scores, 'student': student_scores, 'distilled': distilled_scores})
scores.corr().round(3)

Untitled

  • scatterplot ์ƒ์—์„œ๋„ distillationํ•œ ํ›„์— teacher์˜ BERTScore๋ฅผ ๋” ์ž˜ ๋”ฐ๋ผํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Untitled

Referenceยถ

[1] Shen, Z., Lo, K., Wang, L. L., Kuehl, B., Weld, D. S., & Downey, D. (2022). VILA: Improving structured content extraction from scientific PDFs using visual layout groups.ย Transactions of the Association for Computational Linguistics,ย 10, 376-392.ISO 690

[2] Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., & Zhou, M. (2020). DocBank: A benchmark dataset for document layout analysis.ย arXiv preprint arXiv:2006.01038.

[3] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). Bertscore: Evaluating text generation with bert.ย arXiv preprint arXiv:1904.09675.

[4] https://github.com/Tiiiger/bert_score