moreover commited on
Commit
3daf3b4
·
1 Parent(s): acd4009

first commit

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: ExecEval
3
+ sdk: docker
4
+ suggested_hardware: cpu-basic
5
+ pinned: false
6
+ ---
7
+
8
+
9
+ # ExecEval
10
+
11
+ A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
12
+
13
+ This repository is a part of our ongoing effort to build large scale execution based evaluation benchmark published as [xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval](https://arxiv.org/abs/2303.03004). If you are using this tool, plesae consider citing the paper.
14
+
15
+ ```
16
+ @misc{khan2023xcodeeval,
17
+ title={xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval},
18
+ author={Mohammad Abdullah Matin Khan and M Saiful Bari and Xuan Long Do and Weishi Wang and Md Rizwan Parvez and Shafiq Joty},
19
+ year={2023},
20
+ eprint={2303.03004},
21
+ archivePrefix={arXiv},
22
+ primaryClass={cs.CL}
23
+ }
24
+ ```
25
+ Part of this work was submitted as a requirement for the Master of Science degree in Computer Science and Applications at the Islamic University of Technology by Muhammad Abdullah Matin Khan. (The thesis or project report will be added upon publication).
26
+
27
+ ```
28
+ @misc{khan2024xcodeeval,
29
+ title={Development of a Code Search Engine Using Natural Language Processing Techniques},
30
+ author={Mohammad Abdullah Matin Khan},
31
+ year={2024},
32
+ publication={Journal of Engineering and Technology (JET)}
33
+ url=TBA
34
+ }
35
+ ```
36
+
37
+ ## Dependencies:
38
+
39
+ - [docker-ce](https://docs.docker.com/engine/install/)
40
+
41
+ ## Steps (Assuming dependencies satisfied):
42
+
43
+ 1. Clone this [ExecEval repository](https://github.com/ntunlp/ExecEval).
44
+ 2. `cd ExecEval`
45
+ 3. `docker build . -t exec-eval:1.0`
46
+ 4. `docker run -it -p x:y -e NUM_WORKERS=67 exec-eval:1.0`. This will expose port `y` (default `5000`) as `http://localhost:y` on the local machine whereas port `x` is used within the docker container which can be set by environment variable `GUNICORN_PORT`. The `NUM_WORKERS` is an environment variable representing the number of parallel execution engine workers. It is recommended to not use all cpus, as if cpu goes into 100% load it might affect execution speed of the codes uncontrollably, and keeping some cpus free for evaluation script. A valid example assuming less cpus available: `docker run -it -p 5000:5000 -e NUM_WORKERS=5 exec-eval:1.0`
47
+
48
+ ### Expected outcome:
49
+
50
+ A http server should be running on `$PORT=y` (default `5000`) which can parallely execute codes and return their output.
51
+
52
+ ## Some helpful definitions:
53
+
54
+ ### Definition of ExtendedUnittest:
55
+
56
+ ```py
57
+ # dataclass
58
+ class ExtendedUnittest:
59
+ input: str
60
+ output: list[str] = field(default_factory=list)
61
+ result: str | None = None
62
+ exec_outcome: ExecOutcome | None = None
63
+ ```
64
+
65
+ ### Definition of ExecOutcome:
66
+
67
+ ```py
68
+ class ExecOutcome(Enum):
69
+ PASSED = "PASSED" # code executes and output matches expected output
70
+ WRONG_ANSWER = "WRONG_ANSWER" # code executes and output does NOT matches expected output
71
+ TIME_LIMIT_EXCEEDED = "TIME_LIMIT_EXCEEDED" # code executes and didn't exit in time, output is ignored in this case
72
+ RUNTIME_ERROR = "RUNTIME_ERROR" # code failed to execute (crashed)
73
+ COMPILATION_ERROR = "COMPILATION_ERROR" # code failed to compile
74
+ MEMORY_LIMIT_EXCEEDED = "MEMORY_LIMIT_EXCEEDED" # code exceeded memory limit during execution
75
+ ```
76
+
77
+ ### Definition of ResourceLimits:
78
+
79
+ For detailed description of each attributes go to [man page of getrlimit](https://man7.org/linux/man-pages/man2/getrlimit.2.html).
80
+
81
+ ```py
82
+ class ResourceLimits:
83
+ core: int = 0 # RLIMIT_CORE
84
+ data: int = -1 # RLIMIT_DATA
85
+ # nice: int = 20 # RLIMIT_NICE
86
+ fsize: int = 0 # RLIMIT_FSIZE
87
+ sigpending: int = 0 # RLIMIT_SIGPENDING
88
+ # memlock: int = -1 # RLIMIT_MEMLOCK
89
+ rss: int = -1 # RLIMIT_RSS
90
+ nofile: int = 4 # RLIMIT_NOFILE
91
+ msgqueue: int = 0 # RLIMIT_MSGQUEUE
92
+ rtprio: int = 0 # RLIMIT_RTPRIO
93
+ stack: int = -1 # RLIMIT_STACK
94
+ cpu: int = 2 # RLIMIT_CPU, CPU time, in seconds.
95
+ nproc: int = 1 # RLIMIT_NPROC
96
+ _as: int = 2 * 1024 ** 3 # RLIMIT_AS set to 2GB by default
97
+ locks: int = 0 # RLIMIT_LOCKS
98
+ # rttime: int = 2 # RLIMIT_RTTIME, Timeout for real-time tasks.
99
+ ```
100
+
101
+ ## API endpoints:
102
+
103
+ ### API to execute code:
104
+
105
+ - End point: /api/execute_code
106
+ - Method: POST
107
+ - Content-type: application/json
108
+ - Post request json format:
109
+
110
+ ```py
111
+ # json of dict of this dataclass
112
+ class JobData:
113
+ language: str # language of the code to be executed, usually found in sample["lang"] field
114
+ source_code: str #source_code, usually found in sample["source_code"] field
115
+ unittests: list[ExtendedUnittest] # unittests, usually found in unittest_db[sample["src_uid"]] field which do contain more key value pairs than input, output; so skip them
116
+ compile_cmd: str | None = None # compiler program e.g. gcc, g++, clang++, go, rustc, javac
117
+ compile_flags: str | None = None # flags passed during compilation e.g. "-std=c++11 -lm -static ...
118
+ execute_cmd: str | None = None # executor program (mainly interpreter for interpreted languages) e.g. python2, pypy2, ruby, php
119
+ execute_flags: str | None = None # flags to executor program e.g. "-o -nologo", "-W ignore
120
+ limits: ResourceLimits = field(default_factory=ResourceLimits) # Resource limits
121
+ block_network: bool = True # block network access for codes executed by ExecEval (True is safer)
122
+ stop_on_first_fail: bool = True # stops executing a code if a unit test fails (True for faster execution)
123
+ use_sanitizer: bool = False # This kept to allow some codes of xCodeEval (e.g. MS C++) to execute on linux during testing ExecEval with xCodeEval test data. (False should be ok)
124
+
125
+ ```
126
+
127
+ - Response json format: ExtendedUnittest
128
+
129
+ ### API to get list of runtimes available:
130
+
131
+ - End point: /api/all_runtimes
132
+ - Method: GET
133
+ - Content-type: application/json
134
+ - Response format:
135
+
136
+ ```json
137
+ [
138
+ {
139
+ "compile_cmd": "gcc", // program to compile with
140
+ "compile_flags": "-fno-optimize-sibling-calls -w -fno-strict-aliasing -DONLINE_JUDGE -include limits.h -fno-asm -s -O2 -DONLINE_JUDGE -include math.h -static -lm", // default compiler flags
141
+ "execute_cmd": "",
142
+ "execute_flags": "",
143
+ "has_sanitizer": true,
144
+ "is_compiled": true,
145
+ "runtime_name": "GNU C",
146
+ "timelimit_factor": 1
147
+ },
148
+ {
149
+ "compile_cmd": "python3",
150
+ "compile_flags": "-W ignore -m py_compile",
151
+ "execute_cmd": "python3", // program to execute with
152
+ "execute_flags": "-W ignore -OO -s -S", // flags to execute with
153
+ "has_sanitizer": false, // is a sanitizer implemented in execution_engine/settings.py
154
+ "is_compiled": true, // true if there is a compile cmd
155
+ "runtime_name": "Python 3", // name which needs to match with the language passed in api for execute code
156
+ "timelimit_factor": 3 // a multiplier for time allowed to execute as some languages are slower than others
157
+ }
158
+ // etc.
159
+ ]
160
+ ```
161
+
162
+ ## Evaluation
163
+
164
+ ### pass@k
165
+
166
+ Check the `eval_scripts` directory. The dependencies are mentioned in `requirements.txt`. Run `pip install -r eval_scripts/requirements.txt`. The entry point is through `eval_passk.py`. Run `python eval_scripts/eval_passk.py --help` for description of arguments.
167
+
168
+ #### Example of most typical usage:
169
+
170
+ ```sh
171
+ python eval_scripts/eval_passk.py $path_to_samples_to_evaluate --k "1,2,5,10" --n_workers 129 --limits_by_lang_cfg_file eval_scripts/limits_by_lang.yaml --unittest_file $path_to_unittest_db_file --execeval_url "http://localhost:5000" --use_sanitizer 0
172
+
173
+ ```
174
+
175
+ ## **IMPORTANT**
176
+
177
+ - pip dependencies to run evaluation script is listed in `eval_scripts/requirements.txt`.
178
+ - Sanitize functions are available in `execution_engine/settings.py`.
179
+ - Default compiler or execution flags are available in `execution_engine/config.yaml`.
180
+ - Default resource limits for all supported languages are available in `eval_scripts/limits_by_lang.yaml`.
181
+ - The machine generated codes to be executed should be a list of json with following key value pairs present to work properly:
182
+ - source_code: the code to be executed.
183
+ - lang: the language/runtime to use to execute in `ExecEval`.
184
+ - src_uid: the unique id to retrieve unittests from unittest_db.
185
+ - task_id: an unique id assigned by machine/model trainer to represent the task they are solving. For example, **program synthesis** should have `task_id` same as `src_uid` whereas **Code translation** can have `task_id` same as the index of the test sample for which the code is generated.
186
+ - Be extra careful with the files used to run the scripts, for most parts following the files i.e. `unittest_db` by **xCodeEval** and other files by **ExecEval** should be okay except for the file with machine generated codes.
187
+
188
+ ## Security measures:
189
+
190
+ - Use seperate unpreviledged user for each worker to limit access to different resources.
191
+ - Use `prlimit` to limit resources allowed for the execution.
192
+ - Use `seccomp` to limit socket syscalls (can be easily extended to arbitrary syscall blocker with the caveat that some syscalls are required by some languages to execute code).
193
+ - Thus arbitrary resource usage is restricted.
194
+ - Compilation is not so secure as execution with the assumption that the code needs to find vulnerability in the compiler to exploit this point. (This part not tested)
README.md CHANGED
@@ -1,3 +1,11 @@
 
 
 
 
 
 
 
 
1
  # ExecEval
2
 
3
  A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
 
1
+ ---
2
+ title: ExecEval
3
+ sdk: docker
4
+ suggested_hardware: cpu-basic
5
+ pinned: false
6
+ ---
7
+
8
+
9
  # ExecEval
10
 
11
  A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.