sacrebleu==1.5.1,本notebook需要指定sacrebleu的版本,详见[github issue](SacreBLEU update · Issue #2737 · huggingface/datasets (github.com) )
本次任务使用
数据集,
加载数据
Copy from datasets import load_dataset , load_metric
raw_datasets = load_dataset ( "wmt16" , "ro-en" )
metric = load_metric ( "sacrebleu" )
数据长这样
{'translation': {'en': 'Membership of Parliament: see Minutes', 'ro': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}
可以看到一句en对应一句romanian
依然使用metric.compute(predictions=fake_preds, references=fake_labels)
api计算得分。
预处理
将得到的tokens转化为模型需要对应的token id
Copy from transformers import AutoTokenizer
# 需要安装`sentencepiece`: pip install sentencepiece
tokenizer = AutoTokenizer . from_pretrained (model_checkpoint)
以我们使用的mBART模型为例,我们需要正确设置source语言和target语言。如果您要翻译的是其他双语语料,请查看这里 。我们可以检查source和target语言的设置:
Copy if "mbart" in model_checkpoint :
tokenizer . src_lang = "en-XX"
tokenizer . tgt_lang = "ro-RO"
注意:为了给模型准备好翻译的targets,我们使用as_target_tokenizer
来控制targets所对应的特殊token:
Copy with tokenizer.as_target_tokenizer():
print(tokenizer("Hello, this one sentence!"))
model_input = tokenizer("Hello, this one sentence!")
tokens = tokenizer.convert_ids_to_tokens(model_input['input_ids'])
# 打印看一下special toke
print('tokens: {}'.format(tokens))
{'input_ids': [10334, 1204, 3, 15, 8915, 27, 452, 59, 29579, 581, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokens: ['▁Hel', 'lo', ',', '▁', 'this', '▁o', 'ne', '▁se', 'nten', 'ce', '!', '</s>']
如果您使用的是T5预训练模型的checkpoints,需要对特殊的前缀进行检查。T5使用特殊的前缀来告诉模型具体要做的任务,具体前缀例子如下:
Copy if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
prefix = "translate English to Romanian: "
else:
prefix = ""
现在我们可以把所有内容放在一起组成我们的预处理函数了。我们对样本进行预处理的时候,我们还会truncation=True
这个参数来确保我们超长的句子被截断。默认情况下,对与比较短的句子我们会自动padding。
组合起来:
Copy max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"
def preprocess_function(examples):
inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
targets = [ex[target_lang] for ex in examples["translation"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
以上的预处理函数可以处理一个样本,也可以处理多个样本exapmles。如果是处理多个样本,则返回的是多个样本被预处理之后的结果list。
Copy preprocess_function(raw_datasets['train'][:2])
Copy {'input_ids': [[393, 4462, 14, 1137, 53, 216, 28636, 0], [24385, 14, 28636, 14, 4646, 4622, 53, 216, 28636, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[42140, 494, 1750, 53, 8, 59, 903, 3543, 9, 15202, 0], [36199, 6612, 9, 15202, 122, 568, 35788, 21549, 53, 8, 59, 903, 3543, 9, 15202, 0]]}
使用map函数对所有样本进行预处理。
Copy tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
微调transformer模型
既然数据已经准备好了,现在我们需要下载并加载我们的预训练模型,然后微调预训练模型。既然我们是做seq2seq任务,那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSeq2SeqLM
这个类。和tokenizer相似,from_pretrained
方法同样可以帮助我们下载并加载模型,同时也会对模型进行缓存,就不会重复下载模型啦。
Copy from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
由于我们微调的任务是机器翻译,而我们加载的是预训练的seq2seq 模型,所以不会 像上个问答任务里加载模型的时候扔掉了一些不匹配的神经网络参数。
Copy batch_size = 16
args = Seq2SeqTrainingArguments (
"test-translation" ,
evaluation_strategy = "epoch" , # 每个epcoh会做一次验证评估
learning_rate = 2e-5 ,
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
weight_decay = 0.01 ,
save_total_limit = 3 , # 由于我们的数据集比较大,同时`Seq2SeqTrainer`会不断保存模型,所以我们需要告诉它至多保存`save_total_limit=3`个模型。
num_train_epochs = 1 ,
predict_with_generate = True ,
fp16 = False ,
)
最后我们需要一个数据收集器data collator,将我们处理好的输入喂给模型。
Copy data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
设置好Seq2SeqTrainer
还剩最后一件事情,那就是我们需要定义好评估方法。我们使用metric
来完成评估。将模型预测送入评估之前,我们也会做一些数据后处理:
Copy import numpy as np
def postprocess_text ( preds , labels ):
preds = [pred . strip () for pred in preds]
labels = [[label . strip () ] for label in labels]
return preds , labels
def compute_metrics ( eval_preds ):
preds , labels = eval_preds
if isinstance (preds, tuple ):
preds = preds [ 0 ]
decoded_preds = tokenizer . batch_decode (preds, skip_special_tokens = True )
# Replace -100 in the labels as we can't decode them.
labels = np . where (labels != - 100 , labels, tokenizer.pad_token_id)
decoded_labels = tokenizer . batch_decode (labels, skip_special_tokens = True )
# Some simple post-processing
decoded_preds , decoded_labels = postprocess_text (decoded_preds, decoded_labels)
result = metric . compute (predictions = decoded_preds, references = decoded_labels)
result = { "bleu" : result [ "score" ]}
prediction_lens = [np . count_nonzero (pred != tokenizer.pad_token_id) for pred in preds]
result [ "gen_len" ] = np . mean (prediction_lens)
result = { k : round (v, 4 ) for k , v in result . items ()}
return result
最后将所有的参数/数据/模型传给Seq2SeqTrainer
即可
Copy trainer = Seq2SeqTrainer (
model,
args,
train_dataset = tokenized_datasets[ "train" ],
eval_dataset = tokenized_datasets[ "validation" ],
data_collator = data_collator,
tokenizer = tokenizer,
compute_metrics = compute_metrics
)
调用train
方法进行微调训练。