huggingface/trl SFTTrainer のバックアップ(No.2)

text = f"以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい\n\n### 指示:   \n{example['instruction'][i]}                                      \n\n### 応答: \n{example['output'][i]}<|endoftext|>"

英語、question, context, answer
ちなみに、コンテキストには、RAGのチャンクの様な参考情報が入るイメージ

text = f"Please answer the question based on the given context.              \n\n### question\n{example['question'][i]}\n\n ### context\n{example['context'][i]}\n\n### answer\n{example['answer'][i]}<|endoftext|>"

SFTTrainerのformatting_func引数に渡すプロンプトフォーマット変換用関数を定義（末尾にeos_token文字列を追加）

print(tokenizer.eos_token)
#'<|endoftext|>'

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい\n\n### 指示:\n{example['instruction'][i]}\n\n### 応答:\n{example['output'][i]}<|endoftext|>"
        output_texts.append(text)
    return output_texts

↑

損失計算 †

Instruction + Responseの両方を含むプロンプトを使って学習するが、損失計算は「全トークン」vs「応答部分のみ」がある。
- 「すべてのトークンを損失計算対象にする（全体を正解として学習）」vs「応答部分だけを損失計算対象にする（Instructionは条件として与えるのみ）」
- 後者の方が「モデルが指示を理解し、応答を生成する能力」に焦点を絞るため、効果的とされ、多くの先行研究でも応答部分のみを損失計算するのが一般的

SFTTrainerのdata_collator引数に渡すDataCollatorForCompletionOnlyLMを定義（使用するにはpacking=Falseが必要）

template

response_templateは必須指定

response_template = "### 応答:\n" # "### answer\n"

instruction_templateは複数回の対話形式の場合に必要（1問1答形式の場合は不要）
```
instruction_template = "### 指示:\n" # "### question\n"
```

response_template以降のトークンだけを labels に設定
（他は、PyTorch の CrossEntropyLoss? で無視されるラベル = -100）

from trl import DataCollatorForCompletionOnlyLM

# response_templateは必須指定
response_template = "### 応答:\n"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

↑

SFTTrainerの構成 †

formatting_funcで1問1答形式の文字列を作る。
data_collatorで応答部分だけを損失対象にする。
packing=Falseにより、packingせずに処理する。

from transformers import TrainingArguments
from trl import SFTTrainer

# SFTTrainerはTrainingArgumentsを使用することができる。
# 指定しない場合、TrainingArgumentsのデフォルトが指定される。
args = TrainingArguments(
    output_dir='./output',
    num_train_epochs=2,
    gradient_accumulation_steps=8,
    per_device_train_batch_size=8,
    save_strategy="no",
    logging_steps=20,
    lr_scheduler_type="constant",
    save_total_limit=1,
    fp16=True,
)

# data_collatorが指定されていない場合、以下のようにDataCollatorForLanguageModelingがmlm=Falseで使われる。
# つまり通常のCausal LMを学習することになる。
# if data_collator is None:
#     data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
# packing=False（default）ではdataset_text_fieldかformatting_funcを指定する必要あり
trainer = SFTTrainer(
    model,
    args=args,
    train_dataset=dolly_train_dataset,
    formatting_func=formatting_prompts_func,
    max_seq_length=1024,
    data_collator=collator,
)