Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -27,7 +27,6 @@ To develop our WizardCoder model, we begin by adapting the Evol-Instruct method | |
| 27 |  | 
| 28 | 
             
            ## Comparing WizardCoder with the Closed-Source Models.
         | 
| 29 |  | 
| 30 | 
            -
            The SOTA LLMs for code generation, such as GPT4, Claude, and Bard, are predominantly closed-source. Acquiring access to the APIs of these models proves challenging. In this study, we adopt an alternative approach by retrieving the scores for HumanEval and HumanEval+ from the [LLM-Humaneval-Benchmarks](https://github.com/my-other-github-account/llm-humaneval-benchmarks). Notably, all the mentioned models generate code solutions for each problem utilizing a single attempt, and the resulting pass rate percentage is reported. Our **WizardCoder** generates answers using greedy decoding.
         | 
| 31 |  | 
| 32 | 
             
            🔥 The following figure shows that our **WizardCoder attains the third position in this benchmark**, surpassing Claude-Plus (59.8 vs. 53.0) and Bard (59.8 vs. 44.5). Notably, our model exhibits a substantially smaller size compared to these models.
         | 
| 33 |  | 
| @@ -35,9 +34,11 @@ The SOTA LLMs for code generation, such as GPT4, Claude, and Bard, are predomina | |
| 35 | 
             
            <a ><img src="https://raw.githubusercontent.com/nlpxucan/WizardLM/main/WizardCoder/imgs/pass1.png" alt="WizardCoder" style="width: 86%; min-width: 300px; display: block; margin: auto;"></a>
         | 
| 36 | 
             
            </p>
         | 
| 37 |  | 
|  | |
|  | |
| 38 | 
             
            ## Comparing WizardCoder with the Open-Source Models.
         | 
| 39 |  | 
| 40 | 
            -
            The following table  | 
| 41 |  | 
| 42 |  | 
| 43 | 
             
            | Model            | HumanEval Pass@1 | MBPP Pass@1 |
         | 
| @@ -56,7 +57,10 @@ The following table conducts a comprehensive comparison of our **WizardCoder** w | |
| 56 | 
             
            | WizardLM-30B  1.0| 37.8             |--           |
         | 
| 57 | 
             
            | WizardCoder-15B  1.0 | **57.3**     |**51.8**     |
         | 
| 58 |  | 
| 59 | 
            -
             | 
|  | |
|  | |
|  | |
| 60 |  | 
| 61 | 
             
            ## Call for Feedbacks
         | 
| 62 | 
             
            We welcome everyone to use your professional and difficult instructions to evaluate WizardCoder, and show us examples of poor performance and your suggestions in the [issue discussion](https://github.com/nlpxucan/WizardLM/issues) area. We are focusing on improving the Evol-Instruct now and hope to relieve existing weaknesses and issues in the the next version of WizardCoder. After that, we will open the code and pipeline of up-to-date Evol-Instruct algorithm and work with you together to improve it.
         | 
|  | |
| 27 |  | 
| 28 | 
             
            ## Comparing WizardCoder with the Closed-Source Models.
         | 
| 29 |  | 
|  | |
| 30 |  | 
| 31 | 
             
            🔥 The following figure shows that our **WizardCoder attains the third position in this benchmark**, surpassing Claude-Plus (59.8 vs. 53.0) and Bard (59.8 vs. 44.5). Notably, our model exhibits a substantially smaller size compared to these models.
         | 
| 32 |  | 
|  | |
| 34 | 
             
            <a ><img src="https://raw.githubusercontent.com/nlpxucan/WizardLM/main/WizardCoder/imgs/pass1.png" alt="WizardCoder" style="width: 86%; min-width: 300px; display: block; margin: auto;"></a>
         | 
| 35 | 
             
            </p>
         | 
| 36 |  | 
| 37 | 
            +
            ❗**Note: In this study, we copy the scores for HumanEval and HumanEval+ from the [LLM-Humaneval-Benchmarks](https://github.com/my-other-github-account/llm-humaneval-benchmarks). Notably, all the mentioned models generate code solutions for each problem utilizing a **single attempt**, and the resulting pass rate percentage is reported. Our **WizardCoder** generates answers using greedy decoding and tests with the same [code](https://github.com/evalplus/evalplus).**
         | 
| 38 | 
            +
             | 
| 39 | 
             
            ## Comparing WizardCoder with the Open-Source Models.
         | 
| 40 |  | 
| 41 | 
            +
            The following table clearly demonstrates that our **WizardCoder** exhibits a substantial performance advantage over all the open-source models. ❗**If you are confused with the different scores of our model (57.3 and 59.8), please check the Notes.**
         | 
| 42 |  | 
| 43 |  | 
| 44 | 
             
            | Model            | HumanEval Pass@1 | MBPP Pass@1 |
         | 
|  | |
| 57 | 
             
            | WizardLM-30B  1.0| 37.8             |--           |
         | 
| 58 | 
             
            | WizardCoder-15B  1.0 | **57.3**     |**51.8**     |
         | 
| 59 |  | 
| 60 | 
            +
             | 
| 61 | 
            +
            ❗**Note: The reproduced result of StarCoder on MBPP.**
         | 
| 62 | 
            +
             | 
| 63 | 
            +
            ❗**Note: The above table conducts a comprehensive comparison of our **WizardCoder** with other models on the HumanEval and MBPP benchmarks. We adhere to the approach outlined in previous studies by generating **20 samples** for each problem to estimate the pass@1 score and evaluate with the same [code](https://github.com/openai/human-eval/tree/master). The scores of GPT4 and GPT3.5 reported by [OpenAI](https://openai.com/research/gpt-4) are 67.0 and 48.1 (maybe these are the early version GPT4&3.5).**
         | 
| 64 |  | 
| 65 | 
             
            ## Call for Feedbacks
         | 
| 66 | 
             
            We welcome everyone to use your professional and difficult instructions to evaluate WizardCoder, and show us examples of poor performance and your suggestions in the [issue discussion](https://github.com/nlpxucan/WizardLM/issues) area. We are focusing on improving the Evol-Instruct now and hope to relieve existing weaknesses and issues in the the next version of WizardCoder. After that, we will open the code and pipeline of up-to-date Evol-Instruct algorithm and work with you together to improve it.
         | 

