์ค์ฉ์ Paper Reading Assistant
Contents
์ค์ฉ์ Paper Reading Assistantยถ
1. Introductionยถ
๋ ผ๋ฌธ์ ์ ๋๋ก ์ฝ๊ธฐ ์ํ ๋๊ตฌ ํ์
๋ ผ๋ฌธ๊ณผ ๊ด๋ จ๋ ์ง๋ฌธ์ ์์ฑํ๊ณ ๋์ ๋ต๋ณ์ ์ฑ์ ํ ์ ์๋ ์์คํ ๊ฐ๋ฐ
2. Methodยถ
Download paper: Arxiv์์ ๋ ผ๋ฌธ ์ ๋ณด์ pdf ๋ค์ด๋ก๋
Preprocess paper: pdf์์ ๋ ผ๋ฌธ text๋ฅผ ์ถ์ถ ํ ์ ์ฒ๋ฆฌ
Generate questions: ๋ ผ๋ฌธ์ ์ฝ๊ณ ๊ด๋ จ๋ ์ง๋ฌธ ์์ฑ
Answer questions: ์ง๋ฌธ์ ๋ํ pseudo ์ ๋ต ์์ฑ
Evaluate user answer: pseudo ์ ๋ต์ ๊ธฐ์ค์ผ๋ก ์ฌ๋์ ๋ต๋ณ ์ฑ์
2.1. Download paperยถ
์ฌ์ฉ์๊ฐ ์ ๋ ฅํ arxiv id์ ํด๋นํ๋ ๋ ผ๋ฌธ ์ ๋ณด์ pdf ํ์ผ ๋ค์ด๋ก๋
arxiv
๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ฌ์ฉ
def download_paper(arxiv_id):
search = arxiv.Search(id_list = [arxiv_id])
result = next(search.results())
file_path = f'pdfs/{arxiv_id}.pdf'
if not os.path.exists(file_path):
result.download_pdf(dirpath='pdfs', filename=f'{arxiv_id}.pdf')
authors = ', '.join([a.name for a in result.authors])
paper = {'title': result.title, 'arxiv_id': arxiv_id, 'authors': authors, 'abstract': result.summary, 'file_path': file_path}
return paper
def print_paper(paper):
print('title: ' + paper['title'] + '\n')
print('url: ' + 'https://arxiv.org/abs/' + paper['arxiv_id'] + '\n')
print('authors: ' + paper['authors'] + '\n')
print('abstract: ' + paper['abstract'])
arxiv_id = '2305.13298'
paper = download_paper(arxiv_id)
print(paper)
2.2. Preprocess paperยถ
fitz
๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ pdf ํ์ผ์์ ํ ์คํธ ์ถ์ถ์ผ์ ๊ธธ์ด๋ฅผ ๊ธฐ์ค์ผ๋ก ํ ์คํธ๋ฅผ ๋ถ๋ฆฌํ์ฌ chunk ๊ตฌ์ฑ
chunk๋ณ๋ก embedding ์ถ์ถ ํ index์ ์ ์ฅ
langchain
๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ฌ์ฉRecursiveCharacterTextSplitter
: ํ ์คํธ๋ฅผ ํน์ ๊ธธ์ด๋ก ๋ถ๋ฆฌHuggingFaceEmbeddings
: Text embedding wrapperFAISS
: Faiss vector index wrapper
def clean_text(text):
text = text.replace('\n', ' ')
text = re.sub('\s+', ' ', text)
return text
def extract_text(pdf_path):
doc = fitz.open(pdf_path)
num_pages = doc.page_count
text = []
for p in range(num_pages):
page = doc[p]
page_text = page.get_text('text')
page_text = clean_text(page_text)
text.append(page_text)
doc.close()
text = '\n'.join(text)
return text
def split_text(text, chunk_size=1024):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=100)
chunks = text_splitter.split_text(text)
chunks = [c for c in chunks if len(c) > chunk_size * 0.9]
return chunks
text = extract_text(paper['file_path'])
chunks = split_text(text)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.from_texts(chunks, embeddings)
2.3. Generate questionsยถ
chunk๋ฅผ ๋ณด๊ณ ๋ต๋ณ ๊ฐ๋ฅํ ์ง๋ฌธ ์์ฑ
chunk ์ ํ
reference, acknowledgement์ ๊ฐ์ด ๋ ผ๋ฌธ ๋ด์ฉ๊ณผ ๊ด๋ จ ์๋ chunk ์กด์ฌ
์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ ผ๋ฌธ abstract๊ณผ ์ ์ฌ๋๊ฐ ๋์ 5๊ฐ์ chunk๋ง ์ ํ
์ง๋ฌธ ์์ฑ
ํ chunk๋น 3๊ฐ ์ง๋ฌธ ์์ฑ
langchain
์ChatOpenAI
๋ชจ๋ ์ฌ์ฉ
์ง๋ฌธ ์ ํ
์ ์ฌํ ์ง๋ฌธ๋ค์ด ๋ง์ด ์์ฑ๋๊ธฐ ๋๋ฌธ์ ๋ค๋ฅธ ์ง๋ฌธ๋ค์ ์ ํํ๋ ๊ณผ์ ํ์
extractive summarization์์ ์ฌ์ฉํ๋ Maximal marginal relevance๋ฅผ ์ฌ์ฉํ์ฌ ์ง๋ฌธ ์ ํ
def clean_questions(questions):
return [q.replace(f'{i+1}.', '').strip() for i, q in enumerate(questions.split('\n'))]
def generate_questions(chat, context):
system_content = (
'Generate three questions based on the paragraph. '
'All questions should be answerable using the information provided in the paragraph.'
)
messages = [
SystemMessage(content=system_content),
HumanMessage(content=context)
]
questions = run_chat(chat, messages)
return clean_questions(questions)
class MMR(object):
def __init__(self, k, _lambda):
self.k = k
self._lambda = _lambda
def get_similarity(self, s1, s2):
""" Get cosine similarity between vectors
Params:
s1 (np.array): 1d sentence embedding (512,)
s2 (np.array): 1d sentence embedding (512,)
Returns:
sim (float): cosine similarity
"""
cossim = np.dot(s1, s2) / (np.linalg.norm(s1) * np.linalg.norm(s2))
sim = 1 - np.arccos(cossim) / np.pi
return sim
def get_similarity_with_matrix(self, s, m):
"""Get cosine similarity between vector and matrix
Params:
s (np.array): 1d sentence embedding (512,)
m (np.array): 2d sentences' embedding (n, 512)
Returns:
sim (np.array): similarity (n,)
"""
cossim = np.dot(m, s) / (np.linalg.norm(s) * np.linalg.norm(m, axis=1))
sim = 1 - np.arccos(cossim) / np.pi
return sim
def get_mmr_score(self, s, q, selected):
"""Get MMR (Maximal Marginal Relevance) score of a sentence
Params:
s (np.array): sentence embedding (512,)
q (np.array): query embedding (512,)
selected (np.array): embedding of selected sentences (m, 512)
Returns:
mmr_score (float)
"""
relevance = self._lambda * self.get_similarity(s, q)
if selected.shape[0] > 0:
negative_diversity = (1 - self._lambda) * np.max(self.get_similarity_with_matrix(s, selected))
else:
negative_diversity = 0
return relevance - negative_diversity
def summarize(self, embedding):
selected = [False] * len(embedding)
query = np.mean(embedding, axis=0) # (512,)
while np.sum(selected) < self.k:
selected_embedding = embedding[selected]
remaining_idx = [idx for idx, i in enumerate(selected) if not i]
mmr_score = [self.get_mmr_score(embedding[i], query, selected_embedding) for i in remaining_idx]
best_idx = remaining_idx[np.argsort(mmr_score)[-1]]
selected[best_idx] = True
selected = np.where(selected)[0].tolist()
return selected
contexts = db.similarity_search(paper['abstract'], k=5)
contexts = [c.page_content for c in contexts]
chat = ChatOpenAI(openai_api_key=OPENAI_API_KEY)
questions = []
for ctx in contexts:
questions += generate_questions(chat, ctx)
question_embeds = embeddings.embed_documents(questions)
question_embeds = np.array(question_embeds)
mmr = MMR(k=3, _lambda=0.5)
question_idxs = mmr.summarize(question_embeds)
selected_questions = [questions[i] for i in question_idxs]
selected_questions
์์ฑ ์ง๋ฌธ ์์
What is DIFFUSIONNER and how does it formulate named entity recognition?
How does the model generate entity boundaries during inference?
What are the advantages of DIFFUSIONNER over previous models?
2.4. Answer questionsยถ
์์ฑํ ์ง๋ฌธ์ ๋ํ ๋ต๋ณ์ ์์ฑ
์ง๋ฌธ๊ณผ ์ ์ฌํ chunk๋ฅผ ๊ฒ์ ํ chatgpt์ ํจ๊ป ์ ๋ ฅ
def answer_question(chat, context, question):
system_content = (
"Answer the question as truthfully as possible using the provided context. "
"The answer should be one line."
)
user_content = f'Context:\n{context}\nQuestion:\n{question}'
messages = [
SystemMessage(content=system_content),
HumanMessage(content=user_content)
]
return run_chat(chat, messages)
qnas = []
for q in selected_questions:
ctx = db.similarity_search(q, k=1)[0].page_content
ans = answer_question(chat, ctx, q)
qnas.append((q, ans))
๋ต๋ณ ์์
Q: What is DIFFUSIONNER and how does it formulate named entity recognition?
A: DIFFUSIONNER formulates named entity recognition task as a boundary-denoising diffusion process and generates named entities from noisy spans.
evidence: we propose DIFFUSIONNER, which formulates the named entity recognition task as a boundary-denoising diffusion process and thus generates named entities from noisy spans.
Q: How does the model generate entity boundaries during inference?
A: The model predicts entity boundaries at the word level using max-pooling to aggregate subwords into word representations.
evidence: Entity boundaries are predicted at the word level, and we use max-pooling to aggregate subwords into word representations
Q: What are the advantages of DIFFUSIONNER over previous models?
A: DIFFUSIONNER can achieve better performance while maintaining a faster inference speed with minimal parameter scale compared to previous generation-based models.
evidence: we ๏ฌnd that DIFFUSIONNER could achieve better perfor- mance while maintaining a faster inference speed with minimal parameter scale.
2.5. Evaluate user answerยถ
syntactic evaluation: ์์ด ๋ฌธ๋ฒ๊ณผ ํํ์ ๋ํ ํ๊ฐ
ChatGPT ์ฌ์ฉํ์ฌ ์์ ๋ ๋ฌธ์ฅ ์์ฑ
semantic evaluation: pseudo ์ ๋ต๊ณผ ์์ธก ์ ๋ต์ ์๋ฏธ์ ์ ์ฌ๋ ํ๊ฐ
BERTScore ์ฌ์ฉ
def edit_english(chat, text):
system_content = (
'You are a English spelling corrector and improver. '
'User will give you an English text and you will answer the corrected and improved version of the text. '
'Reply only the corrected and improved text, do not write explanations. '
'If the text is perfect write "The text is perfect."'
f'The text is "{text}"'
)
messages = [SystemMessage(content=system_content)]
return run_chat(chat, messages)
bert_scorer = BERTScorer(model_type='microsoft/deberta-base-mnli')
question, answer, _ = qnas[0]
print('Question:', question)
user_answer = 'DiffusionNER is a named entity recognition model which formulates the NER task as a boundary denoising diffusion process.'
syntactic_evaluation = edit_english(chat, user_answer)
semantic_evaluation = bert_scorer.score([user_answer], [answer])[2][0].item() # P, R, F1
semantic_evaluation = round(semantic_evaluation * 100, 2)
๊ฒฐ๊ณผ ์์
3. Future worksยถ
Component ๊ฐ์
๊ณตํต ์ง๋ฌธ
prompt engineering
evaluation algorithm
distillation to open-source llm
UI ๊ฐ๋ฐ