[1] Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation, https://arxiv.org/pdf/2212.02684.pdf
[2] Evaluating the Performance of Large Language Models on GAOKAO Benchmark, https://arxiv.org/pdf/2305.12474.pdf
[3] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, https://arxiv.org/pdf/2305.01210.pdf
[4] Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks, https://arxiv.org/pdf/2305.14201.pdf
[5] Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance, https://arxiv.org/pdf/2305.13225.pdf
[6] MAYBE ONLY 0.5% DATA IS NEEDED: A PRELIMINARY EXPLORATION OF LOW TRAINING DATA INSTRUCTION TUNING, https://arxiv.org/pdf/2305.09246.pdf
[7] Chain-of-thought prompting for responding to in-depth dialogue questions with LLM, https://arxiv.org/pdf/2305.11792.pdf
[8] CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation, https://arxiv.org/pdf/2305.14318.pdf
[9] Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/pdf/2305.10601.pdf
[10] Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation, https://arxiv.org/pdf/2305.10679.pdf
[11] Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models, https://arxiv.org/pdf/2305.10276.pdf
[12] Text Classification via Large Language Models, https://arxiv.org/pdf/2305.08377.pdf
[13] Knowledge Rumination for Pre-trained Language Models, https://arxiv.org/pdf/2305.08732.pdf
[14] Unlocking Temporal Question Answering for Large Language Models Using Code Execution, https://arxiv.org/pdf/2305.15014.pdf