CodeFuseEval - 代码类大模型多任务评估基准
背景
2023年是大模型全面发展的一年,截止7月,中国累计有130个大模型问世,国外大模型138个,可谓名副其实的百模争鸣。在大模型研发过程中,模型能力评估已然成为必要环节。相比传统的系统架构,模型输出的内容是预测生成式,具有很大的不确定性,它可以基于已知的知识生成未知的知识,这些涌现的新知识证明了模型推理及泛化能力但也带来了不确定风险,特别是模型融入会话式产品,面对用户各类的问答,更容易激发能力涌现,如何及时发现这些新涌现的能力及如何保障模型的输出是有用的、无害的、真实的,是当前大模型评估面临的很大挑战,为此,代码大模型的评估基准必然需要多类多维。
什么是代码大模型评测
1、代码大模型的评测内容
2、代码大模型的评测方法
什么是CodeFuseEval
1、代码能力评测基准介绍
2、CodeFuseEval评测基准介绍
GitHub 地址:https://github.com/codefuse-ai/codefuse-evaluation ModelScope 地址:https://modelscope.cn/datasets/codefuse-ai/CodeFuseEval
评测数据集构建
开源评测集:调研选取业界权威机构发布的开源评测集评测代码类大模型,便于与业界进行横向对比如HumanEval-x/wikisql/CoNaLa等,注意开源评测集在纳入评测前需要先校验评测集是否被污染,是否有错误等; 众测标注数据集:根据代码大模型的目标场景及用户,我们构建了基于代码类知识及众测信息的数据评测集,通过组织不同编程语言的专家众测和白名单用户众测,收集并分析这些真实的、贴合实际的、多样的数据补充开源数据集中缺失部分,比如中文场景/计算机学科知识等,建立代码类特定场景的基准便于版本迭代的纵向比较; 待训数据集划分:从准备的完整数据集中按比例切分用来做测试集,划分比例会根据原始数据集大小来调整。
{
"task_id": "Python/177",
"prompt": "import java.io.*;\nimport java.lang.*;\nimport java.util.*;\nimport java.math.*;\n\n\nclass ProdSquare {\n /**\n * * Write a Java function to check whether the given number can be represented by product of two squares or not.\n *\n * > prodSquare(25)\n * false\n * > prodSquare(30)\n * false\n * > prodSquare(16)\n * true\n */\n public static Boolean prodSquare(int n) {\n{\n int a = 1;\n int b = 1;\n for (int i = 1; i <= n; i++) {\n if (a * i < 0) {\n b = b * i;\n } else {\n a = a * i;\n }\n }\n return b == 1;\n }\n}",
"canonical_solution": "Write a python function to check whether the given number can be represented by product of two squares or not.def prod_Square(n):\r\n for i in range(2,(n) + 1):\r\n if (i*i < (n+1)):\r\n for j in range(2,n + 1):\r\n if ((i*i*j*j) == n):\r\n return True;\r\n return False;",
"test": ["assert prod_Square(25) == False", "assert prod_Square(30) == False", "assert prod_Square(16) == True"],
"desc_en": "Write a python function to check whether the given number can be represented by product of two squares or not.",
"Difficulty": "mbpp",
"desc_cn": "写一个函数来检查给定的数字是否可以用两个正方形的乘积来表示。"
}
评测执行框架
3、评测结果示例
各类指标计算示例
pass@k(功能正确性) | bluert(语义相似度) |
bleu(词相似) | codebleu(语法相似) |
评测结果可视化
4、未来展望
参考文献
Evaluating large language models trained on code.Codex HumanEval https://arxiv.org/pdf/2107.03374.pdf
HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X https://arxiv.org/pdf/2303.17568.pdf
The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks(500 test tasks), designed to be solvable by entry-level programmers. Google Research. Program Synthesis with Large Language Models https://arxiv.org/pdf/2108.07732.pdf
Code Comment Generation metrics is Bleu .codeTRans .an encoder-decoder transformer model for tasks in the software engineering domain.https://arxiv.org/pdf/2104.02443.pdf
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation https://arxiv.org/pdf/2102.04664.pdf
Holistic Evaluation of Language Models https://arxiv.org/pdf/2211.09110.pdf
Program Synthesis with Large Language Models https://arxiv.org/pdf/2108.07732.pdf
Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia。https://arxiv.org/abs/1709.00103
往期推荐
点这里 ↓↓↓ 记得 关注✔ 标星⭐ 哦
微信扫码关注该文公众号作者