Skip to content

Commit b31849a

Browse files
louismartinfacebook-github-bot
authored andcommitted
Camembert model and code (#904)
Summary: Check locally that everything works fine. Model is uploaded to fbaipublicfiles. I fixed a few inconsistencies in the bpe encoding along the way, e.g. related to #1306.. Pull Request resolved: fairinternal/fairseq-py#904 Reviewed By: ngoyal2707 Differential Revision: D18418345 Pulled By: louismartin fbshipit-source-id: 53acb4d021581968d70430ee9babee07d6573c17
1 parent a92bcda commit b31849a

File tree

5 files changed

+89
-4
lines changed

5 files changed

+89
-4
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ modeling and other text generation tasks.
66

77
### What's New:
88

9+
- November 2019: [CamemBERT model and code released](examples/camembert/README.md)
910
- November 2019: [BART model and code released](examples/bart/README.md)
1011
- November 2019: [XLM-R models and code released](examples/xlmr/README.md)
1112
- September 2019: [Nonautoregressive translation code released](examples/nonautoregressive_translation/README.md)

examples/camembert/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# CamemBERT: a French BERT
2+
3+
## Introduction
4+
5+
CamemBERT is a pretrained language model trained on 138GB of French text based on RoBERTa.
6+
7+
## Pre-trained models
8+
9+
Model | #params | vocab size | Download
10+
---|---|---|---
11+
`CamemBERT` | 110M | 32k | [camembert.v0.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert.v0.tar.gz)
12+
13+
14+
## Example usage
15+
16+
##### Load CamemBERT from torch.hub (PyTorch >= 1.1):
17+
```python
18+
import torch
19+
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
20+
camembert.eval() # disable dropout (or leave in train mode to finetune)
21+
```
22+
23+
##### Load CamemBERT (for PyTorch 1.0 or custom models):
24+
```python
25+
# Download camembert model
26+
wget https://dl.fbaipublicfiles.com/fairseq/models/camembert.v0.tar.gz
27+
tar -xzvf camembert.v0.tar.gz
28+
29+
# Load the model in fairseq
30+
from fairseq.models.roberta import CamembertModel
31+
camembert = CamembertModel.from_pretrained('/path/to/camembert.v0')
32+
camembert.eval() # disable dropout (or leave in train mode to finetune)
33+
```
34+
35+
##### Filling masks:
36+
```python
37+
masked_line = 'Le camembert est <mask> :)'
38+
camembert.fill_mask(masked_line, topk=3)
39+
# [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
40+
# ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
41+
# ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]
42+
```
43+
44+
##### Extract features from Camembert:
45+
```python
46+
# Extract the last layer's features
47+
line = "J'aime le camembert!"
48+
tokens = camembert.encode(line)
49+
last_layer_features = camembert.extract_features(tokens)
50+
assert last_layer_features.size() == torch.Size([1, 10, 768])
51+
52+
# Extract all layer's features (layer 0 is the embedding layer)
53+
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
54+
assert len(all_layers) == 13
55+
assert torch.all(all_layers[-1] == last_layer_features)
56+
```

examples/roberta/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ RoBERTa iterates on BERT's pretraining procedure, including training the model l
88

99
### What's New:
1010

11-
- November 2019: Multilingual encoder (XLM-RoBERTa) is available [XLM-R](https://github.com/pytorch/fairseq/examples/xlmr).
11+
- November 2019: French model (CamemBERT) is available [CamemBERT](https://github.com/pytorch/fairseq/tree/master/examples/camembert).
12+
- November 2019: Multilingual encoder (XLM-RoBERTa) is available [XLM-R](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
1213
- September 2019: TensorFlow and TPU support via the [transformers library](https://github.com/huggingface/transformers).
1314
- August 2019: RoBERTa is now supported in the [pytorch-transformers library](https://github.com/huggingface/pytorch-transformers).
1415
- August 2019: Added [tutorial for finetuning on WinoGrande](https://github.com/pytorch/fairseq/tree/master/examples/roberta/wsc#roberta-training-on-winogrande-dataset).

fairseq/models/roberta/hub_interface.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ def encode(self, sentence: str, *addl_sentences, no_separator=False) -> torch.Lo
5858
for s in addl_sentences:
5959
bpe_sentence += (' </s>' if not no_separator else '')
6060
bpe_sentence += ' ' + self.bpe.encode(s) + ' </s>'
61-
tokens = self.task.source_dictionary.encode_line(bpe_sentence, append_eos=False)
61+
tokens = self.task.source_dictionary.encode_line(bpe_sentence, append_eos=False, add_if_not_exist=False)
6262
return tokens.long()
6363

6464
def decode(self, tokens: torch.LongTensor):
@@ -146,8 +146,9 @@ def fill_mask(self, masked_input: str, topk: int = 5):
146146
[self.bpe.encode(text_span.rstrip()) for text_span in text_spans]
147147
).strip()
148148
tokens = self.task.source_dictionary.encode_line(
149-
'<s> ' + text_spans_bpe,
150-
append_eos=True,
149+
'<s> ' + text_spans_bpe + ' </s>',
150+
append_eos=False,
151+
add_if_not_exist=False,
151152
)
152153

153154
masked_index = (tokens == self.task.mask_idx).nonzero()
@@ -168,6 +169,9 @@ def fill_mask(self, masked_input: str, topk: int = 5):
168169
topk_filled_outputs = []
169170
for index, predicted_token_bpe in enumerate(topk_predicted_token_bpe.split(' ')):
170171
predicted_token = self.bpe.decode(predicted_token_bpe)
172+
# Quick hack to fix https://github.com/pytorch/fairseq/issues/1306
173+
if predicted_token_bpe.startswith('\u2581'):
174+
predicted_token = ' ' + predicted_token
171175
if " {0}".format(masked_token) in masked_input:
172176
topk_filled_outputs.append((
173177
masked_input.replace(

fairseq/models/roberta/model.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,29 @@ def from_pretrained(cls, model_name_or_path, checkpoint_file='model.pt', data_na
218218
return RobertaHubInterface(x['args'], x['task'], x['models'][0])
219219

220220

221+
@register_model('camembert')
222+
class CamembertModel(RobertaModel):
223+
@classmethod
224+
def hub_models(cls):
225+
return {
226+
'camembert.v0': 'http://dl.fbaipublicfiles.com/fairseq/models/camembert.v0.tar.gz',
227+
}
228+
229+
@classmethod
230+
def from_pretrained(cls, model_name_or_path, checkpoint_file='model.pt', data_name_or_path='.', bpe='sentencepiece', **kwargs):
231+
from fairseq import hub_utils
232+
x = hub_utils.from_pretrained(
233+
model_name_or_path,
234+
checkpoint_file,
235+
data_name_or_path,
236+
archive_map=cls.hub_models(),
237+
bpe=bpe,
238+
load_checkpoint_heads=True,
239+
**kwargs,
240+
)
241+
return RobertaHubInterface(x['args'], x['task'], x['models'][0])
242+
243+
221244
class RobertaLMHead(nn.Module):
222245
"""Head for masked language modeling."""
223246

0 commit comments

Comments
 (0)