CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

💡 Updates & News

[01/05/2025] : Our paper has been accepted by ICML 2025!
[25/02/2025] 📄 Our paper has been released on Arxiv today!

📝 Contents

💡 Updates & News
📝 Contents
💾 CODESYNC
🚀 Execution
🤗 Contributing
👍 Acknowledgement
⭐ Citation

💾 CODESYNC

⚠️ Work in progress P.S. We will finish the program until June.

CodeSync is a data engine used for generating training set and benchmark automatically to assess the capabilities of LLMs on synchronizing with specific-version APIs.

CodeSync consists of 4 key steps, as illustrated above:

Real-Time API Update Tracking: tracks and collects API updates by comparing legacy and specific versions of libraries.
Real-World API Invocation Retrieval: crawl API invocations an locate valid API calls.
Legacy-Updated API Invocation Synthesis: leverages LLMs to synthesize new API invocation statements based on legacy and updated signatures, respectively, and reorganize to Metadata.
CodeSyncBench Constructor: generate a comprehensive benchmark based on Metadata.

For more details, please refer to paper.

The implementation of CodeSync can be referenced in DataProcessor.

🚀 Execution

You can execute CodeSync by executing bash script:

codesync.sh --crawling --filter --synthesis --benchmark

or executing python script:

pipeline.py --crawling True --filter True --synthesis True --benchmark True

🤗 Contributing

Contributions to this project are welcome. Please consider the following ways to contribute:

Reporting issues
Proposing new features or improvements
Benchmark other mainstream LLMs

👍 Acknowledgement

Many thanks to Zhaoyang Chu and Zhengxiang Cheng for their invaluable effort in this project!

We also thank these great projects:

HumanEval is a widely used Python dataset to evaluate code generation.
LLaMA-Factory is a reliable framework for tuning models
BigCode-Evaluation is an excellent framework for evaluation of code generation models.

⭐ Citation

@misc{wang2025codesync,
      title={CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale}, 
      author={Chenlong Wang and Zhaoyang Chu and Zhengxiang Cheng and Xuyi Yang and Kaiyue Qiu and Yao Wan and Zhou Zhao and Xuanhua Shi and Dongping Chen},
      year={2025},
      eprint={2502.16645},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.16645}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
DataProcessor		DataProcessor
Figures		Figures
hparams		hparams
util		util
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

💡 Updates & News

📝 Contents

💾 CODESYNC

🚀 Execution

🤗 Contributing

👍 Acknowledgement

⭐ Citation

About

Uh oh!

Releases

Packages

Languages

Lucky-Wang-Chenlong/CodeSync

Folders and files

Latest commit

History

Repository files navigation

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

💡 Updates & News

📝 Contents

💾 CODESYNC

🚀 Execution

🤗 Contributing

👍 Acknowledgement

⭐ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages