Commit 
							
							·
						
						5546055
	
1
								Parent(s):
							
							0e9f43f
								
Update README.md (#11)
Browse files- Update README.md (b84905490c2d4796d85a42590c88a95af94cd5e2)
Co-authored-by: Jesse <[email protected]>
    	
        README.md
    CHANGED
    
    | @@ -21,21 +21,21 @@ the Hugging Face team. | |
| 21 | 
             
            ## Model description
         | 
| 22 |  | 
| 23 | 
             
            BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
         | 
| 24 | 
            -
            was pretrained on the raw texts only, with no humans  | 
| 25 | 
             
            publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
         | 
| 26 | 
             
            was pretrained with two objectives:
         | 
| 27 |  | 
| 28 | 
             
            - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
         | 
| 29 | 
             
              the entire masked sentence through the model and has to predict the masked words. This is different from traditional
         | 
| 30 | 
             
              recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
         | 
| 31 | 
            -
              GPT which internally  | 
| 32 | 
             
              sentence.
         | 
| 33 | 
             
            - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
         | 
| 34 | 
             
              they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
         | 
| 35 | 
             
              predict if the two sentences were following each other or not.
         | 
| 36 |  | 
| 37 | 
             
            This way, the model learns an inner representation of the English language that can then be used to extract features
         | 
| 38 | 
            -
            useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
         | 
| 39 | 
             
            classifier using the features produced by the BERT model as inputs.
         | 
| 40 |  | 
| 41 | 
             
            ## Model variations
         | 
| @@ -43,7 +43,7 @@ classifier using the features produced by the BERT model as inputs. | |
| 43 | 
             
            BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.  
         | 
| 44 | 
             
            Chinese and multilingual uncased and cased versions followed shortly after.  
         | 
| 45 | 
             
            Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.  
         | 
| 46 | 
            -
            Other 24 smaller models are released  | 
| 47 |  | 
| 48 | 
             
            The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
         | 
| 49 |  | 
| @@ -62,7 +62,7 @@ The detailed release history can be found on the [google-research/bert readme](h | |
| 62 |  | 
| 63 | 
             
            You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
         | 
| 64 | 
             
            be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
         | 
| 65 | 
            -
            fine-tuned versions  | 
| 66 |  | 
| 67 | 
             
            Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
         | 
| 68 | 
             
            to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
         | 
| @@ -195,7 +195,7 @@ then of the form: | |
| 195 | 
             
            [CLS] Sentence A [SEP] Sentence B [SEP]
         | 
| 196 | 
             
            ```
         | 
| 197 |  | 
| 198 | 
            -
            With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
         | 
| 199 | 
             
            the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
         | 
| 200 | 
             
            consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
         | 
| 201 | 
             
            "sentences" has a combined length of less than 512 tokens.
         | 
|  | |
| 21 | 
             
            ## Model description
         | 
| 22 |  | 
| 23 | 
             
            BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it
         | 
| 24 | 
            +
            was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
         | 
| 25 | 
             
            publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
         | 
| 26 | 
             
            was pretrained with two objectives:
         | 
| 27 |  | 
| 28 | 
             
            - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
         | 
| 29 | 
             
              the entire masked sentence through the model and has to predict the masked words. This is different from traditional
         | 
| 30 | 
             
              recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like
         | 
| 31 | 
            +
              GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the
         | 
| 32 | 
             
              sentence.
         | 
| 33 | 
             
            - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
         | 
| 34 | 
             
              they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
         | 
| 35 | 
             
              predict if the two sentences were following each other or not.
         | 
| 36 |  | 
| 37 | 
             
            This way, the model learns an inner representation of the English language that can then be used to extract features
         | 
| 38 | 
            +
            useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard
         | 
| 39 | 
             
            classifier using the features produced by the BERT model as inputs.
         | 
| 40 |  | 
| 41 | 
             
            ## Model variations
         | 
|  | |
| 43 | 
             
            BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.  
         | 
| 44 | 
             
            Chinese and multilingual uncased and cased versions followed shortly after.  
         | 
| 45 | 
             
            Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.  
         | 
| 46 | 
            +
            Other 24 smaller models are released afterward.  
         | 
| 47 |  | 
| 48 | 
             
            The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
         | 
| 49 |  | 
|  | |
| 62 |  | 
| 63 | 
             
            You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
         | 
| 64 | 
             
            be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
         | 
| 65 | 
            +
            fine-tuned versions of a task that interests you.
         | 
| 66 |  | 
| 67 | 
             
            Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
         | 
| 68 | 
             
            to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
         | 
|  | |
| 195 | 
             
            [CLS] Sentence A [SEP] Sentence B [SEP]
         | 
| 196 | 
             
            ```
         | 
| 197 |  | 
| 198 | 
            +
            With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
         | 
| 199 | 
             
            the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
         | 
| 200 | 
             
            consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
         | 
| 201 | 
             
            "sentences" has a combined length of less than 512 tokens.
         | 

