์œค์šฉ์„  Paper Reading Assistantยถ

1. Introductionยถ

  • ๋…ผ๋ฌธ์„ ์ œ๋Œ€๋กœ ์ฝ๊ธฐ ์œ„ํ•œ ๋„๊ตฌ ํ•„์š”

  • ๋…ผ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ์งˆ๋ฌธ์„ ์ƒ์„ฑํ•˜๊ณ  ๋‚˜์˜ ๋‹ต๋ณ€์„ ์ฑ„์ ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ

2. Methodยถ

  1. Download paper: Arxiv์—์„œ ๋…ผ๋ฌธ ์ •๋ณด์™€ pdf ๋‹ค์šด๋กœ๋“œ

  2. Preprocess paper: pdf์—์„œ ๋…ผ๋ฌธ text๋ฅผ ์ถ”์ถœ ํ›„ ์ „์ฒ˜๋ฆฌ

  3. Generate questions: ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ๊ด€๋ จ๋œ ์งˆ๋ฌธ ์ƒ์„ฑ

  4. Answer questions: ์งˆ๋ฌธ์— ๋Œ€ํ•œ pseudo ์ •๋‹ต ์ƒ์„ฑ

  5. Evaluate user answer: pseudo ์ •๋‹ต์„ ๊ธฐ์ค€์œผ๋กœ ์‚ฌ๋žŒ์˜ ๋‹ต๋ณ€ ์ฑ„์ 

2.1. Download paperยถ

  • ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ arxiv id์— ํ•ด๋‹นํ•˜๋Š” ๋…ผ๋ฌธ ์ •๋ณด์™€ pdf ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ

  • arxiv ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ

def download_paper(arxiv_id):
    search = arxiv.Search(id_list = [arxiv_id])
    result = next(search.results())

    file_path = f'pdfs/{arxiv_id}.pdf'
    if not os.path.exists(file_path):
        result.download_pdf(dirpath='pdfs', filename=f'{arxiv_id}.pdf')
    
    authors = ', '.join([a.name for a in result.authors])
    paper = {'title': result.title, 'arxiv_id': arxiv_id, 'authors': authors, 'abstract': result.summary, 'file_path': file_path}
    return paper

def print_paper(paper):
    print('title: ' + paper['title'] + '\n')
    print('url: ' + 'https://arxiv.org/abs/' + paper['arxiv_id'] + '\n')
    print('authors: ' + paper['authors'] + '\n')
    print('abstract: ' + paper['abstract'])

arxiv_id = '2305.13298'
paper = download_paper(arxiv_id)
print(paper)

Untitled

2.2. Preprocess paperยถ

  • fitz ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ pdf ํŒŒ์ผ์—์„œ ํ…์ŠคํŠธ ์ถ”์ถœ

  • ์ผ์ • ๊ธธ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ chunk ๊ตฌ์„ฑ

  • chunk๋ณ„๋กœ embedding ์ถ”์ถœ ํ›„ index์— ์ €์žฅ

  • langchain ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ

    • RecursiveCharacterTextSplitter: ํ…์ŠคํŠธ๋ฅผ ํŠน์ • ๊ธธ์ด๋กœ ๋ถ„๋ฆฌ

    • HuggingFaceEmbeddings: Text embedding wrapper

    • FAISS: Faiss vector index wrapper

def clean_text(text):
    text = text.replace('\n', ' ')
    text = re.sub('\s+', ' ', text)
    return text

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    num_pages = doc.page_count

    text = []
    for p in range(num_pages):
        page = doc[p]
        page_text = page.get_text('text')
        page_text = clean_text(page_text)
        text.append(page_text)

    doc.close()
    text = '\n'.join(text)
    return text

def split_text(text, chunk_size=1024):
    text_splitter =  RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=100)
    chunks = text_splitter.split_text(text)
    chunks = [c for c in chunks if len(c) > chunk_size * 0.9]
    return chunks

text = extract_text(paper['file_path'])
chunks = split_text(text)

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.from_texts(chunks, embeddings)

2.3. Generate questionsยถ

  • chunk๋ฅผ ๋ณด๊ณ  ๋‹ต๋ณ€ ๊ฐ€๋Šฅํ•œ ์งˆ๋ฌธ ์ƒ์„ฑ

  • chunk ์„ ํƒ

    • reference, acknowledgement์™€ ๊ฐ™์ด ๋…ผ๋ฌธ ๋‚ด์šฉ๊ณผ ๊ด€๋ จ ์—†๋Š” chunk ์กด์žฌ

    • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ abstract๊ณผ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ 5๊ฐœ์˜ chunk๋งŒ ์„ ํƒ

  • ์งˆ๋ฌธ ์ƒ์„ฑ

    • ํ•œ chunk๋‹น 3๊ฐœ ์งˆ๋ฌธ ์ƒ์„ฑ

    • langchain์˜ ChatOpenAI๋ชจ๋“ˆ ์‚ฌ์šฉ

  • ์งˆ๋ฌธ ์„ ํƒ

    • ์œ ์‚ฌํ•œ ์งˆ๋ฌธ๋“ค์ด ๋งŽ์ด ์ƒ์„ฑ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ์งˆ๋ฌธ๋“ค์„ ์„ ํƒํ•˜๋Š” ๊ณผ์ • ํ•„์š”

    • extractive summarization์—์„œ ์‚ฌ์šฉํ•˜๋Š” Maximal marginal relevance๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์งˆ๋ฌธ ์„ ํƒ

def clean_questions(questions):
    return [q.replace(f'{i+1}.', '').strip() for i, q in enumerate(questions.split('\n'))]

def generate_questions(chat, context):
    system_content = (
        'Generate three questions based on the paragraph. '
        'All questions should be answerable using the information provided in the paragraph.'
    )

    messages = [
        SystemMessage(content=system_content),
        HumanMessage(content=context)
    ]

    questions = run_chat(chat, messages)
    return clean_questions(questions)

class MMR(object):
    def __init__(self, k, _lambda):
        self.k = k
        self._lambda = _lambda
    
    def get_similarity(self, s1, s2):
        """ Get cosine similarity between vectors

        Params:
        s1 (np.array): 1d sentence embedding (512,)
        s2 (np.array): 1d sentence embedding (512,)
        
        Returns:
        sim (float): cosine similarity 
        """

        cossim = np.dot(s1, s2) / (np.linalg.norm(s1) * np.linalg.norm(s2))
        sim = 1 - np.arccos(cossim) / np.pi
        return sim
    
    def get_similarity_with_matrix(self, s, m):
        """Get cosine similarity between vector and matrix

        Params:
        s (np.array): 1d sentence embedding (512,)
        m (np.array): 2d sentences' embedding (n, 512)

        Returns:
        sim (np.array): similarity (n,)
        """

        cossim = np.dot(m, s) / (np.linalg.norm(s) * np.linalg.norm(m, axis=1))
        sim = 1 - np.arccos(cossim) / np.pi
        return sim
    
    def get_mmr_score(self, s, q, selected):
        """Get MMR (Maximal Marginal Relevance) score of a sentence

        Params:
        s (np.array): sentence embedding (512,)
        q (np.array): query embedding (512,)
        selected (np.array): embedding of selected sentences (m, 512)

        Returns:
        mmr_score (float)
        """

        relevance = self._lambda * self.get_similarity(s, q)
        if selected.shape[0] > 0:
            negative_diversity = (1 - self._lambda) * np.max(self.get_similarity_with_matrix(s, selected))
        else:
            negative_diversity = 0
        return relevance - negative_diversity

    def summarize(self, embedding):
        selected = [False] * len(embedding)

        query = np.mean(embedding, axis=0) # (512,)
        while np.sum(selected) < self.k:
            selected_embedding = embedding[selected]
            remaining_idx = [idx for idx, i in enumerate(selected) if not i]
            mmr_score = [self.get_mmr_score(embedding[i], query, selected_embedding) for i in remaining_idx]
            best_idx = remaining_idx[np.argsort(mmr_score)[-1]]
            selected[best_idx] = True

        selected = np.where(selected)[0].tolist()
        return selected

contexts = db.similarity_search(paper['abstract'], k=5)
contexts = [c.page_content for c in contexts]

chat = ChatOpenAI(openai_api_key=OPENAI_API_KEY)
questions = []
for ctx in contexts:
    questions += generate_questions(chat, ctx)

question_embeds = embeddings.embed_documents(questions)
question_embeds = np.array(question_embeds)

mmr = MMR(k=3, _lambda=0.5)
question_idxs = mmr.summarize(question_embeds)
selected_questions = [questions[i] for i in question_idxs]
selected_questions

์ƒ์„ฑ ์งˆ๋ฌธ ์˜ˆ์‹œ

What is DIFFUSIONNER and how does it formulate named entity recognition?
How does the model generate entity boundaries during inference?
What are the advantages of DIFFUSIONNER over previous models?

2.4. Answer questionsยถ

  • ์ƒ์„ฑํ•œ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ƒ์„ฑ

  • ์งˆ๋ฌธ๊ณผ ์œ ์‚ฌํ•œ chunk๋ฅผ ๊ฒ€์ƒ‰ ํ›„ chatgpt์— ํ•จ๊ป˜ ์ž…๋ ฅ

def answer_question(chat, context, question):
    system_content = (
        "Answer the question as truthfully as possible using the provided context. "
        "The answer should be one line."
    )

    user_content = f'Context:\n{context}\nQuestion:\n{question}'

    messages = [
        SystemMessage(content=system_content),
        HumanMessage(content=user_content)
    ]

    return run_chat(chat, messages)

qnas = []
for q in selected_questions:
    ctx = db.similarity_search(q, k=1)[0].page_content
    ans = answer_question(chat, ctx, q)
    qnas.append((q, ans))

๋‹ต๋ณ€ ์˜ˆ์‹œ

Q: What is DIFFUSIONNER and how does it formulate named entity recognition?
A: DIFFUSIONNER formulates named entity recognition task as a boundary-denoising diffusion process and generates named entities from noisy spans.
evidence: we propose DIFFUSIONNER, which formulates the named entity recognition task as a boundary-denoising diffusion process and thus generates named entities from noisy spans.

Q: How does the model generate entity boundaries during inference?
A: The model predicts entity boundaries at the word level using max-pooling to aggregate subwords into word representations.
evidence: Entity boundaries are predicted at the word level, and we use max-pooling to aggregate subwords into word representations

Q: What are the advantages of DIFFUSIONNER over previous models?
A: DIFFUSIONNER can achieve better performance while maintaining a faster inference speed with minimal parameter scale compared to previous generation-based models.
evidence: we ๏ฌnd that DIFFUSIONNER could achieve better perfor- mance while maintaining a faster inference speed with minimal parameter scale.

2.5. Evaluate user answerยถ

  • syntactic evaluation: ์˜์–ด ๋ฌธ๋ฒ•๊ณผ ํ‘œํ˜„์— ๋Œ€ํ•œ ํ‰๊ฐ€

    • ChatGPT ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์ •๋œ ๋ฌธ์žฅ ์ƒ์„ฑ

  • semantic evaluation: pseudo ์ •๋‹ต๊ณผ ์˜ˆ์ธก ์ •๋‹ต์˜ ์˜๋ฏธ์  ์œ ์‚ฌ๋„ ํ‰๊ฐ€

    • BERTScore ์‚ฌ์šฉ

def edit_english(chat, text):
    system_content = (
        'You are a English spelling corrector and improver. '
        'User will give you an English text and you will answer the corrected and improved version of the text. ' 
        'Reply only the corrected and improved text, do not write explanations. ' 
        'If the text is perfect write "The text is perfect."'
        f'The text is "{text}"'
    )

    messages = [SystemMessage(content=system_content)]
    return run_chat(chat, messages)

bert_scorer = BERTScorer(model_type='microsoft/deberta-base-mnli')
question, answer, _ = qnas[0]
print('Question:', question)

user_answer = 'DiffusionNER is a named entity recognition model which formulates the NER task as a boundary denoising diffusion process.'

syntactic_evaluation = edit_english(chat, user_answer)
semantic_evaluation = bert_scorer.score([user_answer], [answer])[2][0].item() # P, R, F1
semantic_evaluation = round(semantic_evaluation * 100, 2)

๊ฒฐ๊ณผ ์˜ˆ์‹œ

Untitled

3. Future worksยถ

  • Component ๊ฐœ์„ 

    • ๊ณตํ†ต ์งˆ๋ฌธ

    • prompt engineering

    • evaluation algorithm

    • distillation to open-source llm

  • UI ๊ฐœ๋ฐœ