数据集:
py_ast
该数据集包含用于训练和评估DeepSyn工具的解析AST。Python程序是从GitHub存储库收集的,通过删除重复文件、删除项目分叉(复制另一个现有存储库的副本)和保留只解析的程序,并且AST拥有最多30,000个节点,同时我们还试图删除混淆文件。
代码表示,无监督学习
Python
典型的数据点包含Python程序的AST,已解析。主要键是ast,其中存储每个程序的AST。每个子节点都会有以下信息:type(节点类型)、children(枚举给定节点是否有子节点的非空列表)、value(如果给定节点有任何硬编码值,则为该值,否则为"N/A")。例如:
[ {"type":"Module","children":[1,4]},{"type":"Assign","children":[2,3]},{"type":"NameStore","value":"x"},{"type":"Num","value":"7"}, {"type":"Print","children":[5]}, {"type":"BinOpAdd","children":[6,7]}, {"type":"NameLoad","value":"x"}, {"type":"Num","value":"1"} ] 数据分为训练集和测试集。最终的拆分大小如下:
| train | validation | |
|---|---|---|
| py_ast examples | 100000 | 50000 |
[需要更多信息]
[需要更多信息]
[需要更多信息]
资源语言生产者是谁?[需要更多信息]
[需要更多信息]
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
Raychev, V., Bielik, P., 和 Vechev, M
MIT, BSD 和 Apache
@InProceedings{OOPSLA ’16, ACM,title = {Probabilistic Model for Code with Decision Trees.},authors={Raychev, V., Bielik, P., 和 Vechev, M.},year={2016}}
@inproceedings{10.1145/2983990.2984041,
author = {Raychev, Veselin and Bielik, Pavol and Vechev, Martin},
title = {Probabilistic Model for Code with Decision Trees},
year = {2016},
isbn = {9781450344449},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2983990.2984041},
doi = {10.1145/2983990.2984041},
booktitle = {Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications},
pages = {731–747},
numpages = {17},
keywords = {Code Completion, Decision Trees, Probabilistic Models of Code},
location = {Amsterdam, Netherlands},
series = {OOPSLA 2016}
}
感谢 @reshinthadithyan 添加了该数据集。