๋ฐ•์ง„์Šฌ - Project Moogeulยถ

1. ํ”„๋กœ์ ํŠธ ์š”์•ฝยถ

example

๋ชฉ์ ยถ

  • ๊ธ€์— ๋‚ดํฌ๋œ ๊ฐ์ •์„ ์ถ”์ถœํ•˜๊ณ  ๊ทธ ๊ฐ์ •์„ ํ†ตํ•ด ๋˜๋Œ์•„ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์„œ๋น„์Šค ์ƒ์„ฑ

  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์™€ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํƒœ์Šคํฌ๋ฅผ ๊ฒฝํ—˜ํ•˜๊ณ , ๊ทธ ์ค‘ ํ•ด๋‹น ์„œ๋น„์Šค์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์„ ์ ์šฉ

๊ธฐ๋Šฅยถ

  • ๊ธ€์„ ์“ฐ๊ณ  ์‹ถ์–ดํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๊ธ€์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๊ณต๊ฐ„์„ ์ œ๊ณต

  • ์—์„ธ์ด ํ•œํŽธ์ด ์™„๋ฃŒ๋˜๋ฉด, ์ž‘์„ฑ์ž์˜ ๊ฐ์ •์„ ๋ถ„์„ํ•˜๊ณ  ๊ทธ ๊ฐ์ •๊ณผ ๊ด€๋ จ๋œ ๋‹จ์–ด ํ†ต๊ณ„๋ฅผ ์ œ๊ณต

๊ธฐ๋Œ€ํšจ๊ณผยถ

  • ์‚ฌ์šฉ์ž๋Š” ์ž๊ธฐ ์ž์‹ ์˜ ๋‹จ์–ด ์‚ฌ์šฉ ํ–‰ํƒœ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค

  • ์ €์ž ๋ณ„ ๋‹จ์–ด-๊ฐ์ • ์‚ฌ์šฉ ํ–‰ํƒœ ๋น„๊ต๋ฅผ ํ†ตํ•ด ํŠน์ง•์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค

2. ๊ฒฐ๊ณผยถ

Project Moogeul - a Hugging Face Space by seriouspark

์„œ๋น„์Šค ๊ตฌ์กฐยถ

ํ”„๋กœ์„ธ์Šค

๋‚ด์šฉ

ํ•„์š” ๋ฐ์ดํ„ฐ์…‹

ํ•„์š” ๋ชจ๋ธ๋ง

๊ธฐํƒ€ ํ•„์š”ํ•ญ๋ชฉ

1. ๋‹จ์–ด ์ž…๋ ฅ ์‹œ ์—์„ธ์ด 1ํŽธ์„ ์“ธ ์ˆ˜ ์žˆ๋Š” โ€˜๊ธ€์“ฐ๊ธฐโ€™ ๊ณต๊ฐ„ ์ œ๊ณต

๋„ค์ด๋ฒ„ ํ•œ๊ตญ์–ด ์‚ฌ์ „

-

streamlit ๋Œ€์‹œ๋ณด๋“œ

2. ์—์„ธ์ด ๋‚ด ๋ฌธ์žฅ ๋ถ„๋ฅ˜

ํ•œ๊ตญ์–ด ๊ฐ์ •๋ถ„์„ ์ž๋ฃŒ 58000์—ฌ๊ฑด

xlm-roberta

-

3. ๋ฌธ์žฅ ๋ณ„ ๊ฐ์ • ๋ผ๋ฒจ ๋ฐ˜ํ™˜

ํ•œ๊ตญ์–ด ๊ฐ์ •๋ถ„์„ ์ž๋ฃŒ 58000์—ฌ๊ฑด + ๋ผ๋ฒจ ๋‹จ์ˆœํ™” (60๊ฐœ โ†’ 6๊ฐœ)

Bert Classifier

4. ๋ฌธ์žฅ ๋‚ด ๋ช…์‚ฌ, ํ˜•์šฉ์‚ฌ๋ฅผ konlpy ํ™œ์šฉํ•˜์—ฌ ์ถ”์ถœ

ํ•œ๊ตญ์–ด ๊ฐ์ •๋ถ„์„ ์ž๋ฃŒ 58000์—ฌ๊ฑด

konlpy Kkma

huggingface - pos tagger ๊ฒ€ํ† 

5. ๋ช…์‚ฌ, ํ˜•์šฉ์‚ฌ์™€ ๊ฐ์ • ๋ผ๋ฒจ์„ pair ๋กœ ๋งŒ๋“ค์–ด ๋นˆ๋„ ์ง‘๊ณ„

-

-

-

6. ํ•ด๋‹น ๋นˆ๋„ ๊ธฐ๋ฐ˜์˜ ๋ฆฌ๋ทฐ ์ œ๊ณต (์ €์ž & ์—์„ธ์ด๋ฆฌ์ŠคํŠธ ์ˆ˜์ง‘)

์นผ๋Ÿผ ์ˆ˜์ง‘

(์€์œ , ์ •์ดํ˜„, ๋“€๋‚˜, ์ด ๊ฑด)

-

selenium / request / BeutifulSoup

3. ์„œ๋น„์Šค ์‚ฌ์šฉ ํ”„๋กœ์„ธ์Šคยถ

1. ๊ธ€์“ฐ๊ธฐยถ

  • ๋„ค์ด๋ฒ„ ์‚ฌ์ „์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ response ๋ฅผ parsing

  • ์œ ์ € ๋‹จ์–ด ์ž…๋ ฅ โ†’ ์‚ฌ์ „ ์† ์œ ์‚ฌ๋‹จ์–ด ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜

  • ์‚ฌ์ „ ์† ์œ ์‚ฌ๋‹จ์–ด ์ž…๋ ฅ โ†’ ์‚ฌ์ „ ์† ์œ ์‚ฌ๋œป ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜

2.๊ธ€ ๋ถ„์„ํ•˜๊ธฐยถ

  • QA๋ชจ๋ธ ํ™œ์šฉํ•ด ๋ฌธ์žฅ โ†’ ๊ฐ์ • ๊ตฌ

  • SentenceTransformer ํ™œ์šฉํ•ด ๊ฐ์ • ๊ตฌ โ†’ ์ž„๋ฒ ๋”ฉ

  • ๋ถ„๋ฅ˜๋ชจ๋ธ ํ™œ์šฉํ•ด (์ž„๋ฒ ๋”ฉ - ๋ผ๋ฒจ) ํ•™์Šต

  • roberta ํ™œ์šฉํ•ด ๋ช…์‚ฌ,ํ˜•์šฉ์‚ฌ ์ถ”์ถœ

4. ํ…Œ์ŠคํŠธ ํžˆ์Šคํ† ๋ฆฌยถ

1. QA๋ชจ๋ธยถ

model_name = 'AlexKay/xlm-roberta-large-qa-multilingual-finedtuned-ru'
question = 'what is the person feeling?'
context = '์Šฌํผ ์•„์ฃผ ์Šฌํ”„๊ณ  ํž˜๋“ค์–ด'
question_answerer = pipeline(task = 'question-answering',model = model_name)
answer = question_answerer(question=question, context=context)

print(answer)

{โ€˜scoreโ€™: 0.5014625191688538, โ€˜startโ€™: 0, โ€˜endโ€™: 13, โ€˜answerโ€™: โ€˜์Šฌํผ ์•„์ฃผ ์Šฌํ”„๊ณ  ํž˜๋“ค์–ดโ€™}

  • xlm-roberta-large-qa ๋ชจ๋ธ๊ตฌ์กฐ

    • xlm : cross-lingual language model

      ์ฐธ๊ณ ์ž๋ฃŒ1

      • ๋‹ค๊ตญ์–ด๋ฅผ ๋ชฉํ‘œ๋กœ ์‚ฌ์ „ํ•™์Šต ์‹œํ‚จ bert๋ฅผ ๊ต์ฐจ์–ธ์–ด๋ชจ๋ธ(xlm) dlfkrh qnfma

      • xlm ์€ ๋‹จ์ผ ์–ธ์–ด ๋ฐ ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ด ์‚ฌ์ „ํ•™์Šต

      • ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ์…‹์€ ์–ธ์–ด ์Œ์˜ ํ…์ŠคํŠธ๋กœ ๊ตฌ์„ฑ(๋™์ผํ•œ ๋‚ด์šฉ์˜ 2๊ฐœ ๋‹ค๋ฅธ ์–ธ์–ด ํ…์ŠคํŠธ)

      • BPE๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋ชจ๋“  ์–ธ์–ด์—์„œ ๊ณต์œ ๋œ ์–ดํœ˜๋ฅผ ์‚ฌ์šฉ

      • ์‚ฌ์ „ ํ•™์Šต ์ „๋žต

        • ์ธ๊ณผ์–ธ์–ด๋ชจ๋ธ๋ง CLM : ์ด์ „ ๋‹จ์–ด์…‹์—์„œ ํ˜„์žฌ ๋‹จ์–ด์˜ ํ™•๋ฅ ์„ ์˜ˆ์ธก

        • ๋งˆ์Šคํฌ์–ธ์–ด๋ชจ๋ธ๋ง MLM : ํ† ํฐ 15%๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๋งˆ์Šคํ‚น ํ›„, ๋งˆ์Šคํฌ๋œ ํ† ํฐ ์˜ˆ์ธก (80%์€ [mask]๋กœ ๊ต์ฒด, 10%๋Š” ์ž„์˜ ๋ฌด์ž‘์œ„ ๋‹จ์–ด๋กœ ๊ต์ฒด, 10%๋Š” ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์Œ

        • ๋ฒˆ์—ญ ์–ธ์–ด๋ชจ๋ธ๋ง TLM : ์„œ๋กœ ๋‹ค๋ฅธ ์–ธ์–ด๋กœ์„œ ๋™์ผํ•œ ํ…์ŠคํŠธ๋กœ ๊ตฌ์„ฑ๋œ ๋ณ‘๋ ฌ ๊ต์ฐจ ์–ธ์–ด๋ชจ๋ธ์„ ์ด์šฉ

      • XLM-RoBERTa : ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌํ•˜๊ธฐ์— ์ž๋ฃŒ๊ฐ€ ์ ์€ ์–ธ์–ด๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด, MLM์œผ๋กœ๋งŒ ํ•™์Šต์‹œํ‚ค๊ณ  TLM์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ

    • qa ๋ชจ๋ธ

      ์ฐธ๊ณ ์ž๋ฃŒ ๋…ธํŠธ๋ถ

      • ํ•™์Šต ์‹œ QG(question generation) ๊ณผ QA(Question Answer) ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋‰จ

    • config.json

      {
        "_name_or_path": "AlexKay/xlm-roberta-large-qa-multilingual-finedtuned-ru",
        "architectures": [
          "XLMRobertaForQuestionAnswering"
        ],
        "attention_probs_dropout_prob": 0.1,
        "bos_token_id": 0,
        "eos_token_id": 2,
        "gradient_checkpointing": false,
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.1,
        "hidden_size": 1024,
        "initializer_range": 0.02,
        "intermediate_size": 4096,
        "language": "english",
        "layer_norm_eps": 1e-05,
        "max_position_embeddings": 514,
        "model_type": "xlm-roberta",
        "name": "XLMRoberta",
        "num_attention_heads": 16,
        "num_hidden_layers": 24,
        "output_past": true,
        "pad_token_id": 1,
        "position_embedding_type": "absolute",
        "transformers_version": "4.6.1",
        "type_vocab_size": 1,
        "use_cache": true,
        "vocab_size": 250002
      }
      
    • RoBERTa

      • ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ: RoBERTa๋Š” BERT์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ

      • ์–‘๋ฐฉํ–ฅ ์ปจํ…์ŠคํŠธ: RoBERTa๋Š” ๋ฌธ์žฅ์˜ ์–‘๋ฐฉํ–ฅ ์ปจํ…์ŠคํŠธ๋ฅผ ๊ณ ๋ ค

      • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๊ธด ํŠธ๋ ˆ์ด๋‹: RoBERTa๋Š” BERT๋ณด๋‹ค ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์™€ ๋” ๊ธด ํŠธ๋ ˆ์ด๋‹ ์‹œ๊ฐ„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ›ˆ๋ จ

      • BERT์˜ ํŠธ๋ ˆ์ด๋‹ ๊ณผ์ •์— ํฌํ•จ๋œ NSP ํƒœ์Šคํฌ๋ฅผ RoBERTa๋Š” ์ œ๊ฑฐ

      example

  • ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹

    • Fine tuned on English and Russian QA datasets

model_name = 'monologg/koelectra-base-v2-finetuned-korquad'
question = 'what is the person feeling?'
context = '์Šฌํผ ์•„์ฃผ ์Šฌํ”„๊ณ  ํž˜๋“ค์–ด'
question_answerer = pipeline(task = 'question-answering',model = model_name)
answer = question_answerer(question=question, context=context)

print(answer)

{โ€˜scoreโ€™: 0.6014181971549988, โ€˜startโ€™: 6, โ€˜endโ€™: 13, โ€˜answerโ€™: โ€˜์Šฌํ”„๊ณ  ํž˜๋“ค์–ดโ€™}

  • koelectra-base ๋ชจ๋ธ๊ตฌ์กฐ

    example

    ์ฐธ๊ณ ์ž๋ฃŒ1, ๋…ผ๋ฌธ

    • electra

      • 2020 ๊ตฌ๊ธ€ ๋ฆฌ์„œ์น˜ ํŒ€์—์„œ ๋ฐœํ‘œํ•œ ๋ชจ๋ธ

      • Efficiently Learning an Encoder that Classifies Token Replacements Accurately

      • BERT์˜ ๊ฒฝ์šฐ, ๋งŽ์€ ์–‘์˜ ์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค๋ฅผ ํ•„์š”๋กœํ•จ

        • ํ•˜๋‚˜์˜ ๋ฌธ์žฅ์—์„œ 15%๋งŒ ๋งˆ์Šคํ‚นํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์‹ค์ œ ํ•™์Šตํ•˜๋Š” ํ† ํฐ์ด 15%

      • ์ž…๋ ฅ์„ ๋งˆ์Šคํ‚น ํ•˜๋Š” ๋Œ€์‹ , ์†Œ๊ทœ๋ชจ ๋„คํŠธ์›Œํฌ์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ๊ทธ๋Ÿด๋“ฏํ•œ ๋Œ€์•ˆ์œผ๋กœ ํ† ํฐ์„ ๋Œ€์ฒดํ•จ์œผ๋กœ์จ ์ž…๋ ฅ์„ ๋ณ€๊ฒฝ

      • ์†์ƒ๋œ ํ† ํฐ์˜ ์›๋ž˜ ์‹ ์›์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๋Œ€์‹ , ์†์ƒ๋œ ์ž…๋ ฅ์˜ ๊ฐ ํ† ํฐ์ด ์ƒ์„ฑ๊ธฐ ์ƒ˜ํ”Œ๋กœ ๋Œ€์ฒด๋˜์—ˆ๋Š”์ง€ ํ™•์ธ

        • original token VS replaced token ๋งž์ถ”๋Š” ๊ฒƒ ๊ฐ„์˜ ์ฐจ์ด๋ฐœ์ƒ

    โ‡’ Robert ์™€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋ฉด์„œ 1/4 ๋ฏธ๋งŒ์˜ ์ปดํ“จํŒ… ์ž์›์„ ํ™œ์šฉ

  • ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ : ์ฐธ๊ณ ๋งํฌ

    • SKT์˜ KoBERT

    • TwoBlock AI์˜ HanBERT

    • ETRI์˜ KorBERT

    โ†’ ํ•œ์ž, ์ผ๋ถ€ ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ / ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ๋ถ„๋ฆฌ๊ธฐ (kss) ์‚ฌ์šฉ / ๋‰ด์Šค ๊ด€๋ จ ๋ฌธ์žฅ์€ ์ œ๊ฑฐ (๋ฌด๋‹จ์ „์žฌ, (์„œ์šธ=๋‰ด์Šค1) ๋“ฑ ํฌํ•จ๋˜๋ฉด ๋ฌด์กฐ๊ฑด ์ œ์™ธ)

  • ์ตœ์ข… ๊ฒฐ๊ณผ

    index

    score

    start

    end

    answer

    ์ผ์€ ์™œ ํ•ด๋„ ํ•ด๋„ ๋์ด ์—†์„๊นŒ? ํ™”๊ฐ€ ๋‚œ๋‹ค.

    0.9913754463195801

    19

    24

    ํ™”๊ฐ€ ๋‚œ๋‹ค

    ์ด๋ฒˆ ๋‹ฌ์— ๋˜ ๊ธ‰์—ฌ๊ฐ€ ๊นŽ์˜€์–ด! ๋ฌผ๊ฐ€๋Š” ์˜ค๋ฅด๋Š”๋ฐ ์›”๊ธ‰๋งŒ ์ž๊พธ ๊นŽ์ด๋‹ˆ๊นŒ ๋„ˆ๋ฌด ํ™”๊ฐ€ ๋‚˜.

    0.5683395862579346

    41

    45

    ํ™”๊ฐ€ ๋‚˜

    ํšŒ์‚ฌ์— ์‹ ์ž…์ด ๋“ค์–ด์™”๋Š”๋ฐ ๋งํˆฌ๊ฐ€ ๊ฑฐ์Šฌ๋ ค. ๊ทธ๋Ÿฐ ์• ๋ฅผ ๋งค์ผ ๋ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋‹ˆ๊นŒ ์ŠคํŠธ๋ ˆ์Šค ๋ฐ›์•„.

    0.9996705651283264

    45

    49

    ์ŠคํŠธ๋ ˆ์Šค

    ์ง์žฅ์—์„œ ๋ง‰๋‚ด๋ผ๋Š” ์ด์œ ๋กœ ๋‚˜์—๊ฒŒ๋งŒ ์˜จ๊ฐ– ์‹ฌ๋ถ€๋ฆ„์„ ์‹œ์ผœ. ์ผ๋„ ๋งŽ์€ ๋ฐ ์ •๋ง ๋ถ„ํ•˜๊ณ  ์„ญ์„ญํ•ด.

    0.8939215540885925

    42

    49

    ๋ถ„ํ•˜๊ณ  ์„ญ์„ญํ•ด

    ์–ผ๋งˆ ์ „ ์ž…์‚ฌํ•œ ์‹ ์ž…์‚ฌ์›์ด ๋‚˜๋ฅผ ๋ฌด์‹œํ•˜๋Š” ๊ฒƒ ๊ฐ™์•„์„œ ๋„ˆ๋ฌด ํ™”๊ฐ€ ๋‚˜.

    0.5234862565994263

    32

    34

    ํ™”๊ฐ€

    ์ง์žฅ์— ๋‹ค๋‹ˆ๊ณ  ์žˆ์ง€๋งŒ ์‹œ๊ฐ„๋งŒ ๋ฒ„๋ฆฌ๋Š” ๊ฑฐ ๊ฐ™์•„. ์ง„์ง€ํ•˜๊ฒŒ ์ง„๋กœ์— ๋Œ€ํ•œ ๊ณ ๋ฏผ์ด ์ƒ๊ฒจ.

    0.9997361898422241

    31

    41

    ์ง„๋กœ์— ๋Œ€ํ•œ ๊ณ ๋ฏผ์ด

    ์„ฑ์ธ์ธ๋ฐ๋„ ์ง„๋กœ๋ฅผ ์•„์ง๋„ ๋ชป ์ •ํ–ˆ๋‹ค๊ณ  ๋ถ€๋ชจ๋‹˜์ด ๋…ธ์—ฌ์›Œํ•˜์…”. ๋‚˜๋„ ์„ญ์„ญํ•ด.

    0.9988294839859009

    36

    39

    ์„ญ์„ญํ•ด

    ํ‡ด์‚ฌํ•œ ์ง€ ์–ผ๋งˆ ์•ˆ ๋์ง€๋งŒ ์ฒœ์ฒœํžˆ ์ง์žฅ์„ ๊ตฌํ•ด๋ณด๋ ค๊ณ .

    0.5484525561332703

    19

    28

    ์ง์žฅ์„ ๊ตฌํ•ด๋ณด๋ ค๊ณ 

    ์กธ์—…๋ฐ˜์ด๋ผ์„œ ์ทจ์—…์„ ์ƒ๊ฐํ•ด์•ผ ํ•˜๋Š”๋ฐ ์ง€๊ธˆ ๋„ˆ๋ฌด ๋А๊ธ‹ํ•ด์„œ ์ด๋ž˜๋„ ๋˜๋‚˜ ์‹ถ์–ด.

    0.9842100739479065

    7

    15

    ์ทจ์—…์„ ์ƒ๊ฐํ•ด์•ผ

    ์š”์ฆ˜ ์ง์žฅ์ƒํ™œ์ด ๋„ˆ๋ฌด ํŽธํ•˜๊ณ  ์ข‹์€ ๊ฒƒ ๊ฐ™์•„!

    0.1027943417429924

    3

    8

    ์ง์žฅ์ƒํ™œ์ด

    ์ทจ์—…ํ•ด์•ผ ํ•  ๋‚˜์ด์ธ๋ฐ ์ทจ์—…ํ•˜๊ณ  ์‹ถ์ง€๊ฐ€ ์•Š์•„.

    0.10440643876791

    7

    11

    ๋‚˜์ด์ธ๋ฐ

    ๋ฉด์ ‘์—์„œ ๋ถ€๋ชจ๋‹˜ ์ง์—…์— ๋Œ€ํ•œ ์งˆ๋ฌธ์ด ๋“ค์–ด์™”์–ด.

    0.9965717792510986

    5

    12

    ๋ถ€๋ชจ๋‹˜ ์ง์—…์—

    ํฐ์ผ์ด์•ผ. ๋ถ€์žฅ๋‹˜๊ป˜ ๊ฒฐ์žฌ๋ฐ›์•„์•ผ ํ•˜๋Š” ์„œ๋ฅ˜๊ฐ€ ์‚ฌ๋ผ์กŒ์–ด. ํ•œ ์‹œ๊ฐ„ ๋’ค์— ์ œ์ถœํ•ด์•ผ ํ•˜๋Š”๋ฐ ์–ด๋””๋กœ ๊ฐ”์ง€?

    0.07094824314117432

    0

    5

    ํฐ์ผ์ด์•ผ.

    ๋‚˜ ์–ผ๋งˆ ์ „์— ๋ฉด์ ‘ ๋ณธ ํšŒ์‚ฌ์—์„œ ๋ฉด์ ‘ ํ•ฉ๊ฒฉํ–ˆ๋‹ค๊ณ  ์—ฐ๋ฝ๋ฐ›์•˜์—ˆ๋Š”๋ฐ ์˜ค๋Š˜ ๋‹ค์‹œ ์ž…์‚ฌ ์ทจ์†Œ ํ†ต๋ณด๋ฐ›์•„์„œ ๋‹นํ˜น์Šค๋Ÿฌ์›Œ.

    0.998587429523468

    53

    58

    ๋‹นํ˜น์Šค๋Ÿฌ์›Œ

    ๊ธธ์„ ๊ฐ€๋‹ค๊ฐ€ ์šฐ์—ฐํžˆ ๋งˆ์ฃผ์นœ ๋™๋„ค ์•„์ฃผ๋จธ๋‹ˆ๊ป˜์„œ ์ทจ์—…ํ–ˆ๋ƒ๊ณ  ๋ฌผ์–ด๋ณด์…”์„œ ๋‹นํ™ฉํ–ˆ์–ด.

    0.9999895095825195

    37

    41

    ๋‹นํ™ฉํ–ˆ์–ด

    ์–ด์ œ ํ•ฉ๊ฒฉ ํ†ต๋ณด๋ฅผ ๋ฐ›์€ ํšŒ์‚ฌ์—์„œ ๋ฌธ์ž๋ฅผ ์ž˜๋ชป ๋ฐœ์†กํ–ˆ๋‹ค๊ณ  ์—ฐ๋ฝ์ด ์™”์–ด. ๋„ˆ๋ฌด ๋‹นํ˜น์Šค๋Ÿฝ๊ณ  ์†์ƒํ•ด.

    0.8316713571548462

    42

    51

    ๋‹นํ˜น์Šค๋Ÿฝ๊ณ  ์†์ƒํ•ด

    ๋‚˜ ์˜ค๋Š˜ ์ฒซ ์ถœ๊ทผ ํ–ˆ๋Š”๋ฐ ๋„ˆ๋ฌด ๋‹นํ™ฉ์Šค๋Ÿฌ์› ์–ด!

    0.9923190474510193

    17

    23

    ๋‹นํ™ฉ์Šค๋Ÿฌ์› ์–ด

    ์ด๋ฒˆ์— ์ง์žฅ์„ ์ด์งํ–ˆ๋Š”๋ฐ ๊ธ€์Ž„ ๋งŒ๋‚˜๊ณ  ์‹ถ์ง€ ์•Š์€ ์‚ฌ๋žŒ์„ ๋งŒ๋‚˜์„œ ์•„์ฃผ ๋‹นํ™ฉ์Šค๋Ÿฝ๋”๋ผ๊ณ .

    0.4635336995124817

    38

    45

    ๋‹นํ™ฉ์Šค๋Ÿฝ๋”๋ผ๊ณ 

2. ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธยถ

tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer .encode('๋‹นํ˜น์Šค๋Ÿฝ๊ณ  ์†์ƒํ•ด')
  • ์ž„๋ฒ ๋”ฉ ๊ฐ’ (, 10)

    ([101, 9067, 119438, 12605, 118867, 11664, 9449, 14871, 14523, 102])

  • tokenizer

example

์ฐธ๊ณ ์ž๋ฃŒ

from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('jhgan/ko-sroberta-multitask')
sentences = ['๋‹นํ˜น์Šค๋Ÿฝ๊ณ  ์†์ƒํ•ด',]
embeddings = encoder.encode(sentences)
print(embeddings)
  • ์ž„๋ฒ ๋”ฉ ๊ฐ’ (1, 768) (์ค‘๋žต)

    [[-0.8137736 -0.37767226 โ€ฆ -0.4278595 -0.4228025 ]]

  • SentenceTransformer ์ฐธ๊ณ ์ž๋ฃŒ

    Untitled

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

3. ๋ถ„๋ฅ˜๋ชจ๋ธยถ

  • ํ‰๊ฐ€๊ธฐ์ค€ accuracy (from sklearn.metrics import accuracy_score)

  • baseline : 0.17 (label 6๊ฐœ ์ค‘ 1๊ฐœ ์ž„์˜ ์„ ํƒ๋  ๋น„์œจ, 1/6)

#1์ฐจ
class BertClassifier(nn.Module):

  def __init__(self, dropout = 0.3):
    super(BertClassifier, self).__init__()

    self.bert= BertModel.from_pretrained('bert-base-multilingual-cased')
    self.dropout = nn.Dropout(dropout)
    self.linear = nn.Linear(768, 6)
    self.relu = nn.ReLU()

  def forward(self, input_id, mask):
    _, pooled_output = self.bert(input_ids = input_id, attention_mask = mask, return_dict = False)
    dropout_output = self.dropout(pooled_output)
    linear_output = self.linear(dropout_output)
    final_layer= self.relu(linear_output)

    return final_layer

# 2์ฐจ
model = AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-cased',  )

accuracy 55.7%

  • bert ๋ชจ๋ธ

    • ๋‹ค์ค‘ ์–ธ์–ด ๋ชจ๋ธ์— dropout / linear / relu ๋ฅผ ์ถ”๊ฐ€ํ•œ ํ•จ์ˆ˜

    • epoch = 2 : ์ ์ • ์ˆ˜์ค€ (train / test accuracy 0.55~0.57)

    • epoch = 10 : ๊ณผ์ ํ•ฉ (train accuracy 0.98 / test accuracy = 0.56)

      โ†’ epoch = 2 ์—์„œ ํ•™์Šตํ•œ ์ˆ˜์ค€๊ณผ epoch 10์—์„œ ํ•™์Šตํ•œ ๋ฐ์ดํ„ฐ ํŒจํ„ด์ด ํฌ๊ฒŒ ๋‹ค๋ฅด์ง€ ์•Š์Œ

example

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=y)

model = RandomForestClassifier()
model.fit(X_train_features_top50, X_train_label)
prediction = model.predict(X_test_features_top50)
score = accuracy_score(X_test_label, prediction)
print('top50 accuracy: ', score) # ๋”๋–จ์–ด์กŒ๋‹ค.

models = [
    RandomForestClassifier(),
    LogisticRegression(max_iter = 5000),
    SVC()
]

grid_searches = []
for model in models:
  grid_search = GridSearchCV(model, param_grid = {}, cv = 5)
  grid_searches.append(grid_search)

for grid_search in tqdm(grid_searches):
  grid_search.fit(X_train_feature, X_train_label)

best_models = []
for grid_search in grid_searches:
  best_model = grid_search.best_estimator_
  best_model.append(best_model)

ensemble_model = VotingClassifier(best_models)
ensemble_model.fit(X_train_feature, X_train_label)
predictions = ensemble_model.predict(X_test_feature)
accuracy = accuracy_score(X_test_label, predictions)

accuracy 60.3%

example

Untitled

![example] )

4. ๋ช…์‚ฌ/ํ˜•์šฉ์‚ฌ ์ถ”์ถœยถ

from konlpy.tag import Okt

okt = Okt()

def get_noun(text):
  noun_list = [k for k, v  in okt.pos(text) if (v == 'Noun' and len(k) > 1)]
  return noun_list
def get_adj(text):
  adj_list = [k for k, v  in okt.pos(text) if (v == 'Adjective') and (len(k) > 1)]
  return adj_list
def get_verb(text):
  verb_list = [k for k, v  in okt.pos(text) if (v == 'Verb') and (len(k) > 1)]
  return verb_list

text = '์–ด์ œ ํ•ฉ๊ฒฉ ํ†ต๋ณด๋ฅผ ๋ฐ›์€ ํšŒ์‚ฌ์—์„œ ๋ฌธ์ž๋ฅผ ์ž˜๋ชป ๋ฐœ์†กํ–ˆ๋‹ค๊ณ  ์—ฐ๋ฝ์ด ์™”์–ด. ๋„ˆ๋ฌด ๋‹นํ˜น์Šค๋Ÿฝ๊ณ  ์†์ƒํ•ด.'

get_noun(text)
get_adj(text)
get_verb(text)

get_noun: ['์–ด์ œ', 'ํ•ฉ๊ฒฉ', 'ํ†ต๋ณด', 'ํšŒ์‚ฌ', '๋ฌธ์ž', '์ž˜๋ชป', '๋ฐœ์†ก', '์—ฐ๋ฝ']

get_adj: ['๋‹นํ˜น์Šค๋Ÿฝ๊ณ ', '์†์ƒํ•ด']

get_verb: ['๋ฐ›์€', 'ํ–ˆ๋‹ค๊ณ ', '์™”์–ด']

tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-large-korean-upos")
posmodel=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-large-korean-upos")

pipeline=TokenClassificationPipeline(tokenizer=tokenizer,
                                     model=posmodel,
                                     aggregation_strategy="simple",
                                     task = 'token-classification')
nlp=lambda x:[(x[t["start"]:t["end"]],t["entity_group"]) for t in pipeline(x)]
nlp(text)

# result
[('์–ด์ œ ํ•ฉ๊ฒฉ', 'NOUN'),
 ('ํ†ต๋ณด๋ฅผ', 'NOUN'),
 ('๋ฐ›์€', 'VERB'),
 ('ํšŒ์‚ฌ์—์„œ', 'ADV'),
 ('๋ฌธ์ž๋ฅผ', 'NOUN'),
 ('์ž˜๋ชป', 'ADV'),
 ('๋ฐœ์†กํ–ˆ๋‹ค๊ณ ', 'VERB'),
 ('์—ฐ๋ฝ์ด', 'NOUN'),
 ('์™”์–ด', 'VERB'),
 ('.', 'PUNCT'),
 ('๋„ˆ๋ฌด', 'ADV'),
 ('๋‹นํ˜น์Šค๋Ÿฝ๊ณ ', 'CCONJ'),
 ('์†์ƒํ•ด', 'VERB'),
 ('.', 'PUNCT')]
from konlpy.tag import Okt

okt = Okt()

def get_noun(text):
  noun_list = [k for k, v  in okt.pos(text) if (v == 'Noun' and len(k) > 1)]
  return noun_list
def get_adj(text):
  adj_list = [k for k, v  in okt.pos(text) if (v == 'Adjective') and (len(k) > 1)]
  return adj_list
def get_verb(text):
  verb_list = [k for k, v  in okt.pos(text) if (v == 'Verb') and (len(k) > 1)]
  return verb_list

text = '์–ด์ œ ํ•ฉ๊ฒฉ ํ†ต๋ณด๋ฅผ ๋ฐ›์€ ํšŒ์‚ฌ์—์„œ ๋ฌธ์ž๋ฅผ ์ž˜๋ชป ๋ฐœ์†กํ–ˆ๋‹ค๊ณ  ์—ฐ๋ฝ์ด ์™”์–ด. ๋„ˆ๋ฌด ๋‹นํ˜น์Šค๋Ÿฝ๊ณ  ์†์ƒํ•ด.'

get_noun(text)
get_adj(text)
get_verb(text)

get_noun: ['์–ด์ œ', 'ํ•ฉ๊ฒฉ', 'ํ†ต๋ณด', 'ํšŒ์‚ฌ', '๋ฌธ์ž', '์ž˜๋ชป', '๋ฐœ์†ก', '์—ฐ๋ฝ']

get_adj: ['๋‹นํ˜น์Šค๋Ÿฝ๊ณ ', '์†์ƒํ•ด']

get_verb: ['๋ฐ›์€', 'ํ–ˆ๋‹ค๊ณ ', '์™”์–ด']

4. ๋ถ„์„ ๋ฆฌํฌํŠธยถ

1. ์ €์ž ๋ถ„์„ ๊ฐ€์ด๋“œ๋ผ์ธยถ

  • ์ด ๋ฐœํ–‰ ๊ธ€ ์ˆ˜

    • ๊ธ€ ๋‹น ๋ฌธ์žฅ ์ˆ˜

      1. ๊ธ€ ์† ๋ฌธ์žฅ ๊ฐฏ์ˆ˜

      2. ๊ธ€ ์† ๋‹จ์–ด ๊ฐฏ์ˆ˜

        ๋ช…์‚ฌ ์ˆ˜ / ํ˜•์šฉ์‚ฌ ์ˆ˜

      3. ๋ฌธ์žฅ ๊ธฐ์ค€ ์ตœ๊ณ  ๊ฐ์ •

      4. ๋‹จ์–ด ๊ธฐ์ค€ ์ตœ๊ณ  ๊ฐ์ •

      5. ๋‹จ์–ด ๋ณ„ ๊ฐ์ • ๊ฐฏ์ˆ˜

    • ๊ธ€ ๋ณ„ ๊ฐ์ • / ๋‹จ์–ด (๋ช…์‚ฌ, ํ˜•์šฉ์‚ฌ)

      • ๊ฐ์ • 1๊ฑด๋‹น ๋‹จ์–ด ์œ ๋‹ˆํฌ ์ˆ˜ : ๊ฐ€์žฅ ๋‹ค์ฑ„๋กœ์šด ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•œ ๊ฐ์ •์€ ?

      • ๊ฐ์ • 1๊ฑด ๋‹น ์ตœ๋‹ค ๋‹จ์–ด : ๊ทธ ๊ฐ์ •์„ ๋Œ€ํ‘œํ•˜๋Š” ๋‹จ์–ด๋Š”? / ์–ด๋–ค ๋‹จ์–ด๋ฅผ ์“ธ ๋•Œ ๊ทธ ๊ฐ์ •์ด ๋งŽ์ด ์˜ฌ๋ผ์™”์„๊นŒ?

      • ๋‹จ์–ด 1๊ฑด๋‹น ์œ ๋‹ˆํฌ ๊ฐ์ • : ๊ฐ€์žฅ ๋ณต์žกํ•œ ๊ฐ์ •์„ ๋งŒ๋“ค์–ด๋‚ธ ๋‹จ์–ด๋Š”?

      • ๋‹จ์–ด 1๊ฑด๋‹น ์ตœ๋‹ค ๊ฐ์ • : ๊ทธ ๋‹จ์–ด๋ฅผ ์“ธ ๋•Œ ์–ด๋–ค ๊ฐ์ •์ด ๋งŽ์ด ์˜ฌ๋ผ์™”์„๊นŒ?

2. ์˜ˆ์‹œยถ

  • ์—์„ธ์ด์ŠคํŠธ <์€์œ > / ์ด 17๊ฐœ์˜ ์—์„ธ์ด ์ˆ˜์ง‘ (์ค‘๋žต)

    # ์ œ๋ชฉ : ์‚ฌ๋ž‘์— ๋น ์ง€์ง€ ์•Š๋Š” ํ•œ ์‚ฌ๋ž‘์€ ์—†๋‹ค
    
    ์˜ํ™” <๋‚˜์˜ ์‚ฌ๋ž‘, ๊ทธ๋ฆฌ์Šค>์˜ ํ•œ ์žฅ๋ฉด
    
    ํ•œ ์‚ฌ๋žŒ์—๊ฒŒ ๋‹ค๊ฐ€์˜ค๋Š” ์‚ฌ๋ž‘์˜ ๊ธฐํšŒ์— ๊ด€์‹ฌ์ด ๋งŽ๋‹ค....
    "์‚ฌ๋ž‘์— ๋น ์ง€์ง€ ์•Š๋Š” ํ•œ ์‚ฌ๋ž‘์€ ์—†๋‹ค. "(151์ชฝ) ์‚ฌ๋ž‘์€ ํŠน๋ณ„ํ•œ ์ง€์‹์ด๋‚˜ ๊ธฐ์ˆ ์ด ํ•„์š”์น˜ ์•Š๋‹ค๋Š” ์ ์—์„œ ์‰ฝ๊ณ , ์ž๊ธฐ๋ฅผ ๋‚ด๋ ค๋†“์•„์•ผ ํ•œ๋‹ค๋Š” ์ ์—์„œ ์–ด๋ ต๋‹ค.
    ๊ทธ๋Ÿฌ๋‹ˆ ์‚ฌ๋ž‘์„ ์–ผ๋งˆ๋‚˜ ํ•ด๋ณด์•˜๋А๋ƒ๋Š” ์งˆ๋ฌธ์€ ์ด๋ ‡๊ฒŒ ๋ฐ”๊ฟ€ ์ˆ˜๋„ ์žˆ๋‹ค. ๋‹น์‹ ์€ ๋‹ค๋ฅธ ์กด์žฌ๊ฐ€ ๋˜์–ด๋ณด์•˜๋А๋ƒ. ์™œ ์‚ฌ๋ž‘์ด ํ•„์š”ํ•˜๋ƒ๊ณ  ๋ฌป๋Š”๋‹ค๋ฉด,
    ๋น„ํ™œ์„ฑํ™”๋œ ์ž์•„์˜ ํ™œ์„ฑํ™”๊ฐ€ ์•”์šธํ•œ ํ˜„์‹ค์— ์ˆจ๊ตฌ๋ฉ์„ ์—ด์–ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋‹ตํ•˜๊ฒ ๋‹ค. ์กด์žฌ์˜ ๋“ฑ์ด ์ผœ์ง€๋Š” ์ˆœ๊ฐ„ ์‚ฌ๋ž‘์€ ์†์‚ญ์ธ๋‹ค. โ€œ์‚ถ์„ ๋ถ™๋“ค๊ณ  ์ตœ์„ ์„ ๋‹คํ•ด์š”. โ€(123์ชฝ)
    
    
  • ์ด ๋ฐœํ–‰ ๊ธ€ ์ˆ˜ : 17๊ฑด

  • 70.35 ๋ฌธ์žฅ / 1๊ธ€

  • ๊ธ€ ์† ๋ฌธ์žฅ ๊ฐœ์ˆ˜

title

๋ฌธ์žฅ ์ˆ˜

โ€˜๋ถˆ์Œํ•œ ์•„์ดโ€™ ๋งŒ๋“œ๋Š” โ€˜์ด์ƒํ•œ ์–ด๋ฅธ๋“คโ€™

53

๊ธ€์“ฐ๊ธฐ๋Š” ๋‚˜์™€ ์นœํ•ด์ง€๋Š” ์ผ

62

๋‚˜๋ฅผ ์•„ํ”„๊ฒŒ ํ•˜๋Š” ์ฐฉํ•œ ์‚ฌ๋žŒ๋“ค

65

๋‹ค์ •ํ•œ ์–ผ๊ตด์„ ์™„์„ฑํ•˜๋Š” ๋ฒ•

65

๋”ธ์— ๋Œ€ํ•˜์—ฌ, ์‹ค์€ ์—„๋งˆ์— ๋Œ€ํ•˜์—ฌ

68

๋งˆ์นจ๋‚ด ์‚ฌ๋Š” ๋ฒ•์„ ๋ฐฐ์šฐ๋‹ค

64

๋งŒ๊ตญ์˜ ์‹ฑ๊ธ€ ๋ ˆ์ด๋””์Šค์—ฌ, ๋ฒ„ํ…จ์ฃผ์˜ค!

69

๋ฌธ๋ช…์˜ ํŽธ๋ฆฌ๊ฐ€ ๋ˆ„๊ตฐ๊ฐ€์˜ ์ฃฝ์Œ์— ๋นš์ง€๊ณ  ์žˆ์Œ์„

87

์‚ฌ๋ž‘์— ๋น ์ง€์ง€ ์•Š๋Š” ํ•œ ์‚ฌ๋ž‘์€ ์—†๋‹ค

63

์„ฑํญ๋ ฅ ๊ฐ€ํ•ด์ž์—๊ฒŒ ํŽธ์ง€๋ฅผ ๋ณด๋ƒˆ๋‹ค

83

์Šฌํ””์„ ๊ณต๋ถ€ํ•ด์•ผ ํ•˜๋Š” ์ด์œ 

70

์•Œ๋ ค์ฃผ์ง€ ์•Š์œผ๋ฉด ๊ทธ ์ด์œ ๋ฅผ ๋ชจ๋ฅด์‹œ๊ฒ ์–ด์š”?

67

์šฐ๋ฆฌ๋Š” ์™œ ์‚ด์ˆ˜๋ก ๋นš์Ÿ์ด๊ฐ€ ๋˜๋Š”๊ฐ€

73

์šธ๋”๋ผ๋„ ์ •ํ™•ํ•˜๊ฒŒ ๋งํ•˜๋Š” ๊ฒƒ

77

์ธ๊ณต์ž๊ถ์„ ์ƒ๊ฐํ•จ

79

์นœ๊ตฌ ๊ฐ™์€ ์—„๋งˆ์™€ ๋”ธ์ด๋ผ๋Š” ํ™˜์ƒ

80

ํ•˜์ฐฎ์€ ๋งŒ๋‚จ๋“ค์— ๋Œ€ํ•œ ์˜ˆ์˜

88

  • ๊ธ€ ์† ๋‹จ์–ด ๊ฐœ์ˆ˜

example

  • ๋ฌธ์žฅ ๊ธฐ์ค€ ์ตœ๊ณ  ๊ฐ์ •

    • ์—์„ธ์ด๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฌธ์žฅ์˜ ๊ฐ์ • ๋ผ๋ฒจ์„ ์ง‘๊ณ„

    • โ€˜๊ธ€์“ฐ๊ธฐ๋Š” ๋‚˜์™€ ์นœํ•ด์ง€๋Š”์ผโ€™ ์ด๋ผ๋Š” ์—์„ธ์ด์—์„œ๋Š” [๋ถˆ์•ˆ] ์ด ๊ฐ€์žฅ ๋†’์œผ๋ฉฐ [๊ธฐ์จ] ๊ณผ [๋ถ„๋…ธ] ๊ฐ€ ๊ทธ ๋‹ค์Œ ๊ฐ

    โ†’ ๋ฌธ์žฅ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ์ •ํ™•๋„ ์ƒ์Šน์‹œ, ํ•ด๋‹น ๋ฐฉ์‹์œผ๋กœ ์—์„ธ์ด ๋ณ„ โ€˜์ฃผ์š” ๊ฐ์ •โ€™ ์„ ์‰ฝ๊ฒŒ ๊ตฌ๋ถ„ ๋ฐ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Œ

    example

  • ๋‹จ์–ด ๊ธฐ์ค€ ์ตœ๊ณ  ๊ฐ์ • / ๋‹จ์–ด ๋ณ„ ๊ฐ์ • ๊ฐœ์ˆ˜

    • ๋ช…์‚ฌ์™€ ํ˜•์šฉ์‚ฌ๋ฅผ ํ•ฉ์ณค์„ ๊ฒฝ์šฐ, ์•„๋‹ˆ๋ผ/์—†๋Š” ๋“ฑ์˜ ๋‹จ์–ด๋“ค์ด ์ƒ์œ„์— ์œ„์น˜

    • ๋ช…์‚ฌ๋งŒ ์ถ”์ถœ ์‹œ, 2๊ฐœ ์ด์ƒ์˜ ๊ฐ์ •์ด ๋‹ด๊ธด ๋‹จ์–ด๋Š” ์ฐพ์ง€ ๋ชปํ•จ

    โ†’ ๋” ๋งŽ์€ ์—์„ธ์ด๋ฅผ ์ˆ˜์ง‘ / lemmatized ๋œ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•ด ์˜๋ฏธ ๊ธฐ์ค€์œผ๋กœ ์žฌ๊ตฌ์„ฑ์ด ๊ฐ€๋Šฅํ•ด๋ณด์ž„

[๋ช…์‚ฌ + ํ˜•์šฉ์‚ฌ]

example

  • ๊ฐ€์žฅ ๋‹ค์ฑ„๋กœ์šด ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•œ ๊ฐ์ •์€ ? (๊ฐ์ • ๋ณ„ ๋ฌธ์žฅ 1๊ฑด๋‹น ๋‹จ์–ด)

    • [๋ถˆ์•ˆ] ๋‹จ์–ด ์ข…๋ฅ˜๊ฐ€ 954๊ฑด์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์Œ

    • ๋ฌธ์žฅ ์ˆ˜๋กœ ๋‚˜๋ˆ ๋ณด์•˜์„ ๋•Œ, [๋‹นํ™ฉ] ์˜ ๊ฐ์ •์ด 3.24๊ฑด์œผ๋กœ ๋ฌธ์žฅ 1๊ฑด์—์„œ 3๊ฐœ ์ด์ƒ์˜ ๋‹จ์–ด๋“ค์ด ์ถ”์ถœ๋˜์–ด ๋งคํ•‘

    emotion

    vocab_cnt

    sentence_cnt

    vocab_per_sentence

    ๋ถˆ์•ˆ

    954

    327

    2.917431

    ์Šฌํ””

    681

    245

    2.779592

    ๋ถ„๋…ธ

    736

    260

    2.917431

    ๊ธฐ์จ

    541

    197

    2.746193

    ๋‹นํ™ฉ

    318

    98

    3.244898

    ์ƒ์ฒ˜

    272

    86

    3.162791

  • ๊ทธ ๊ฐ์ •์„ ๋Œ€ํ‘œํ•˜๋Š” ๋‹จ์–ด๋Š”? / ์–ด๋–ค ๋‹จ์–ด๋ฅผ ์“ธ ๋•Œ ๊ทธ ๊ฐ์ •์ด ๋งŽ์ด ์˜ฌ๋ผ์™”์„๊นŒ? (๊ฐ์ • 1๊ฑด ๋‹น ์ตœ๋‹ค ๋‹จ์–ด)

    • [๋‹นํ™ฉ] ์˜ ๊ฒฝ์šฐ ๋ถ€๋„๋Ÿฌ์šด, ์ด์ƒํ•œ ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์ƒ์œ„์— ์กด์žฌ

    • ๊ทธ ์™ธ์˜ ๊ฐ์ •์˜ ๊ฒฝ์šฐ, ๋‹จ์–ด๋กœ ํŠน์„ฑ์„ ์ฐพ๊ธฐ ์–ด๋ ค์›€

    โ†’ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด๋“ค์ด ์ƒ์œ„์— ์กด์žฌํ•˜์—ฌ, ํ•ด๋‹น ๋ฐฉ์‹์œผ๋กœ ์ถ”์ถœ ๋’ค tfidf ๋“ฑ์˜ ๋นˆ๋„ ๊ธฐ๋ฐ˜ ์Šค์ฝ”์–ด๋กœ ๋‹จ์–ด ์ถ”๊ฐ€ ์ •๋ ฌ์ด ๊ฐ€๋Šฅํ•ด๋ณด์ž„

    example

  • ๊ฐ€์žฅ ๋ณต์žกํ•œ ๊ฐ์ •์„ ๋งŒ๋“ค์–ด๋‚ธ ๋‹จ์–ด๋Š”? (๋‹จ์–ด 1๊ฑด๋‹น ์œ ๋‹ˆํฌ ๊ฐ์ •)

    • ์žˆ๋Š”, ์—†๋Š”, ์žˆ๋Š”, ์žˆ๋‹ค ๋“ฑ์ด ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•˜์˜€๊ณ , ๊ทธ์— ๋”ฐ๋ฅธ ๊ฐ์ •์ข…๋ฅ˜๋„ ๊ฐ€์žฅ ๋งŽ์Œ

    • ๋‹จ์–ด ๋งฅ๋ฝ์— ๋”ฐ๋ผ ์–ด๋А ๋ถ€๋ถ„์—๋‚˜ ์‰ฝ๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ๊ฐ’๋“ค์ด ๋ชจ๋“  ๊ฐ์ •์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

    โ†’ ๋ณต์žกํ•œ ๊ฐ์ •์˜ ๊ธฐ์ค€์„ ๋‹ค์‹œ ์ œ์‹œํ•˜๊ณ  (์˜ˆ: ์ƒ์ถฉ๋˜๋Š” ๊ฐ์ • ) ๊ทธ์— ๋”ฐ๋ฅธ โ€˜๋ณต์žกํ•œ ๊ฐ์ •โ€™ ๊ณผ โ€˜๋‹จ์–ดโ€™ ์กฐํ•ฉ์„ ์ฐพ์•„๋ณผ ์ˆ˜ ์žˆ์Œ

    example

    example

  • ๊ทธ ๋‹จ์–ด๋ฅผ ์“ธ ๋•Œ ์–ด๋–ค ๊ฐ์ •์ด ๋งŽ์ด ์˜ฌ๋ผ์™”์„๊นŒ? (๋‹จ์–ด 1๊ฑด๋‹น ์ตœ๋‹ค ๊ฐ์ •)

    • [์ข‹์•„ํ•˜๋Š”] ๋‹จ์–ด์˜ ๊ฒฝ์šฐ, ๋ถˆ์•ˆ :2 , ๊ธฐ์จ : 1 , ์ƒ์ฒ˜ : 1

    • [๋ถ€๋„๋Ÿฌ์šด] ๋‹จ์–ด์˜ ๊ฒฝ์šฐ, ๋‹นํ™ฉ : 4

    • [๋‚˜์•ฝํ•œ] ๋‹จ์–ด์˜ ๊ฒฝ์šฐ ๊ธฐ์จ : 1

    example