PyTorch Day China 2025: Full Schedule

In-person | 2025 June 7
Learn more on our website

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered to participate in the sessions. If you have not registered but would like to join us, please visit the BAAI Conference webpage.

Please note: This schedule is automatically displayed in China Standard Time (UTC+08:00). To see the schedule in your preferred timezone, please select from the drop-down located at the bottom of the menu to the right.

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

09:00 CST

Keynote: Welcome & Opening Remarks - Matt White, The Linux Foundation

Saturday June 7, 2025 09:00 - 09:20 CST

TBA

Speakers

Matt White

GM of AI, Executive Director, PyTorch, Linux Foundation

Matt White is the Executive Director of the PyTorch Foundation and GM of AI at the Linux Foundation. He is also the Director of the Generative AI Commons. Matt has nearly 30 years of experience in applied research and standards in AI and data in telecom, media and gaming industries... Read More →

Saturday June 7, 2025 09:00 - 09:20 CST
TBA

Keynote Sessions

09:20 CST

Keynote: Running Large Models on Any AI Chip: PyTorch + Open-Source Stack (FlagOS) for Architecture-Free Deployment - Yonghua Lin, BAAI

Saturday June 7, 2025 09:20 - 09:40 CST

TBA

Speakers

Yonghua Lin

Vice President and Chief Engineer, Beijing Academy of Artificial Intelligence (BAAI)

Yonghua Lin serves as the Vice President and Chief Engineer at the Beijing Academy of Artificial Intelligence (BAAI). She oversees key research directions including general technologies for large-scale AI models, AI system research, open-source initiatives, and industrial ecosystem... Read More →

Saturday June 7, 2025 09:20 - 09:40 CST
TBA

Keynote Sessions

09:40 CST

Diving in Hugging Face Hub; Share Your Model Weights on the #1 AI Hub, Home of 700k+ PyTorch Models - Tiezhen Wang, Hugging Face

Saturday June 7, 2025 09:40 - 10:00 CST

TBA

Hugging Face Hub is a premier platform for hosting and sharing machine learning models. It offers a comprehensive suite of features—from model discovery and hosting to deployment and collaboration—that enhance the accessibility and impact of your work. Join this session to explore how these capabilities can streamline your workflow and amplify the reach of your research.

Speakers

Tiezhen Wang

Engineer, Hugging Face

Tiezhen Wang is an Engineer at Hugging Face, specializing in LLMs, open-source AI ecosystems, and cross-cultural AI development. Prior to joining Hugging Face, Tiezhen was a core developer of TensorFlow Lite Micro, an open-source machine learning inference framework designed for embedded... Read More →

Saturday June 7, 2025 09:40 - 10:00 CST
TBA

Open Source Collaboration + Education + Community Building

10:00 CST

verl: An Open Source Large Scale LLM RL Framework for Agentic Tasks - Yuxuan Tong, Bytedance

Saturday June 7, 2025 10:00 - 10:20 CST

TBA

Recent advances in reinforcement learning significantly boosts the reasoning capabilities of LLMs. Models such as OpenAI o3, DeepSeek r1, etc,. demonstrates magnificent performance in STEM and coding tasks. Yet, training such models requires complex infrastructures.
In this talk, we present verl (https://github.com/volcengine/verl), a comprehensive framework that utilizes HybridFlow programming abstraction to achieve both flexibility to implement various algorithms and high performance. verl has been adopted by various universities and companies for RL training, and is contributed by 100+ contributors from the community.
Through this talk, audiences will gain i) a basic understanding of various RL algorithms including GRPO; ii) best practices to implement tool calling and multi-turn rollout for agentic tasks, as well vision language model reasoning; iii) latest large scale performance optimization techniques for RL with MOE models such as DeepSeek v3.

Speakers

Yuxuan Tong

Researcher, Bytedance

Yuxuan is a student at Department of Computer Science and Technology, Tsinghua University, and a core contributor of verl project. Yuxuan led the infrastruture and contributed to the algorithm of DAPO: an open source advanced LLM reinforcement learning recipe at scale as a member... Read More →

Saturday June 7, 2025 10:00 - 10:20 CST
TBA

Open Source Collaboration + Education + Community Building

10:20 CST

PyTorch in China: Community Growth, Localization, and Interaction - Zesheng Zong, Huawei

Saturday June 7, 2025 10:20 - 10:30 CST

TBA

Chinese PyTorch Community Overview & Resources
Provide a comprehensive introduction to the Chinese PyTorch community and its resources.

Community Events & Future Vision
Discuss plans to scale the community, drive technical innovation, and position PyTorch as the go-to framework for AI development in China.

Localized Tutorials & Documentation
Address the challenge of limited access to translated materials for PyTorch 2.x. Present ongoing efforts to translate official documentation and tutorials to empower Chinese learners.

Speakers

Zesheng Zong

Senior Software Engineer, Huawei

Currently, trying to let Chinese users to have easier access to PyTorch resources and make a friendly user experiences for beginners.

Saturday June 7, 2025 10:20 - 10:30 CST
TBA

Open Source Collaboration + Education + Community Building

10:30 CST

Break

Saturday June 7, 2025 10:30 - 10:50 CST

TBA

Saturday June 7, 2025 10:30 - 10:50 CST
TBA

Breaks & Networking

11:10 CST

torch.accelerator: A Unified, Device-Agnostic Runtime API for Stream-Based Accelerators - Yu Guangye, Intel

Saturday June 7, 2025 11:10 - 11:30 CST

TBA

Motivation
PyTorch supports a wide range of acceleration hardware beyond CPUs, including CUDA, XPU, MPS, NPU, HPU, and more. Its architecture allows new backend integration through two key components: ATen operators and device runtime.
While ATen operators are device-agnostic, the runtime remains device-specific, relying on APIs like torch.cuda and torch.xpu. This fragmentation complicates writing portable, hardware-agnostic code across PyTorch and its ecosystem.
To address this challenge, we propose torch.accelerator: a unified, device-agnostic runtime API for stream-based accelerators.
Design
An Accelerator refers to a device that collaborates with the CPU to accelerate computation, typically via asynchronous execution using Stream and Event for synchronization. Our design assumes a single active accelerator per host.
The torch.accelerator API provides a consistent interface for device and stream management, with backend support integrated via the existing DeviceGuardImplInterface registration mechanism.
Further Work
We are actively working on a unified device memory API. These will streamline the library, model, and UTs.
Reference
https://github.com/pytorch/pytorch/pull/132204

Speakers

Yu Guangye

AI Framework Engineer, Intel

An AI framework engineer at Intel, dedicated to supporting Intel GPUs in PyTorch and advancing the generalization and improvement of PyTorch and its ecosystem. With deep experience in SYCL, XPU backend integration, and performance optimization across heterogeneous platforms, I’ve... Read More →

Saturday June 7, 2025 11:10 - 11:30 CST
TBA

Core PyTorch Framework

11:30 CST

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Kaichao You, Tsinghua University

Saturday June 7, 2025 11:30 - 11:50 CST

TBA

vLLM is a fast and easy-to-use library for LLM inference and serving. In this talk, I will briefly introduce the evolution of the vLLM project, the open-source community behind it, and highlight some features that are interesting to many users.

Speakers

KaiChao YOU

Student, Tsinghua University

Kaichao You is a fifth year Ph.D. student from Tsinghua University. He is working on the vLLM project, a high-throughput and memory-efficient inference and serving engine for LLMs. He is also an open-source contributor to PyTorch/Triton.

Saturday June 7, 2025 11:30 - 11:50 CST
TBA

Optimization for Training and Inferences

11:50 CST

A torch.fx Based Compression Toolkit Empowered by torch_musa - Fan Mo, Moore Threads

Saturday June 7, 2025 11:50 - 12:00 CST

TBA

An introduction of torch_musa, which is built by Moore Threads to provide PyTorch backend support for MUSA architecture, and with its customized support for quantization, we built a light-weight model compression toolkit named NeuroTrim, which can quantize a variety of deep learning models and run it on native PyTorch.
We expanded the graph tracing capability of PyTorch by integrating fx graph, torchInductor, and ExportedIR, and added many lowered operators in torch_musa, which can be easily used for debugging, and then export to ONNX or other formats for custom accelerator hardwares.

Speakers

Fan Mo

Machine Learning Engineer, Moore Threads

Fan Mo received a M.S degree in 2020 from Shanghai JiaoTong University, who has held machine learning engineer at Moore Threads. His current work focuses on AI Infrastructure, and is the principal maintainer of torch_musa, an open-source project that provides PyTorch backend support... Read More →

Saturday June 7, 2025 11:50 - 12:00 CST
TBA

PyTorch Ecosystem and Tools

14:00 CST

Efficient Training of Video Generation Foundation Model at ByteDance - Xiaonan Nie, ByteDance Seed & Yong Li, Bytedance

Saturday June 7, 2025 14:00 - 14:20 CST

TBA

Video generation models have emerged as powerful tools, enabling compelling multimedia content creation across various applications. At ByteDance, we tackled the substantial challenge of efficiently scaling large-scale video generation model training to thousands of GPUs using PyTorch's robust software ecosystem.

In this talk, we will present several key innovations enabled by PyTorch to significantly enhance training efficiency:
1. Efficient Kernel Fusion with PyTorch Compile for optimized GPU computations.
2. Large-Model Training enabled by PyTorch's Checkpointing and Offloading techniques to address memory constraints.
3. Scalable Distributed Training combining Sequence Parallelism (SP) and Hybrid Sharding Data Parallelism (HSDP) for near-linear GPU scalability.
4. Load-Balancing Strategy dynamically distributing workloads for variable-length videos to eliminate performance bottlenecks.

Our session will delve into the practical methodologies, share crucial insights gained, and provide actionable best practices, empowering attendees to optimize their own large-scale model training workflows using PyTorch.

Speakers

Xiaonan Nie

Research Scientist in Machine Learning System, ByteDance Seed

Xiaonan Nie is currently a research scientist in MLSys at ByteDance, within the TopSeed Program. He received his Ph.D from Peking University in 2024, supervised by Prof. Bin Cui. His research focuses on optimizing the training of deep learning models at large scale. He has published... Read More →

Yong Li

Research Scientist in MLSys, bytedance

2023-now bytedance

Saturday June 7, 2025 14:00 - 14:20 CST
TBA

Scaling Training and Inference

14:20 CST

torch.compile Practice and Optimization in Different Scenarios - Yichen Yan, Alibaba Cloud

Saturday June 7, 2025 14:20 - 14:40 CST

TBA

torch dynamo is the core compilation optimization component in pytorch 2.0. torch dynamo can optimize dynamic graphs with minor modifications, and fall back to the default implementation when it is not supported, which greatly reduces the difficulty of enabling compilation optimization.

However, in practice, due to the diversity of AI workloads, there is still a certain threshold for enabling torch dynamo. To address these issues, we have optimized torch.compile to make it easier for users of different models (LLM/others) and loads (inference/training) to benefit from compilation optimization.

Speakers

Yichen Yan

Senior Engineer, Alibaba Cloud

Yichen Yan now works as a senior software engineer at Alibaba, focusing on optimization of runtime (CPython, Java, Node.js) and machine learning frameworks.

Saturday June 7, 2025 14:20 - 14:40 CST
TBA

Core PyTorch Framework

14:40 CST

PyTorch in Production: Boosting LLM Training and Inferencing on Ascend NPU - Jiawei Li, Huawei Technologies Co., Ltd.

Saturday June 7, 2025 14:40 - 15:00 CST

TBA

Deploying PyTorch in large-scale heterogeneous computing have huge challenges in stability, usability, performance, and quality.

This session will showcase best practices using real-world Huawei Ascend application scenarios and community contributions, demonstrating PyTorch and ecosystem components in large-scale heterogeneous computing.

Through Ascend-optimized technologies and architecture improvements in PyTorch, we have greatly enhanced computing performance in traning and inference, meeting production standards.

We’ve proposed and implemented features for heterogeneous compute and accelerator quality standards in PyTorch. Driving PyTorch’s application in heterogeneous compute, like PyTorch ecosystem projects: DeepSpeed, vLLM.

Finally, we will share real user stories from large-scale training and distributed inference, presenting best practices on the Ascend platform and helping the community address challenges in diverse compute scenarios.

Speakers

Jiawei Li

Staff Engineer, Huawei Technologies Co., Ltd.

6+ years experience on open source, worked on OpenStack development in openEuler community and arm ecosystem. Currently work on contributing AI ecosystem. - ONNXRuntime Ascend Support Author - PyTorch Collaborator - PyTorch 2023 Nominess

Saturday June 7, 2025 14:40 - 15:00 CST
TBA

PyTorch on Accelerator Hardware

15:00 CST

Galvatron: An Automatic Distributed Training System for Efficient Large-Scale Transformer Training - Xinyi Liu & Fangcheng Fu, Peking University

Saturday June 7, 2025 15:00 - 15:20 CST

TBA

Galvatron is a PyTorch-native, open-source framework for the efficient distributed training of large-scale Transformer models, with specialized optimizations for automatic hybrid parallelism strategies. Given a Transformer model, Galvatron first employs PyTorch profiler to analyze the model execution workload characteristics, creating a precise cost model. Then, Galvatron uses decision trees and dynamic programming to automatically deduce the best combination of parallelism dimensions for each model layer, covering data, tensor, pipeline, sharded data, sequence parallelism, and recomputation. Finally, Galvatron leverages PyTorch features like FSDP and checkpointing—enjoying its seamless integration with various accelerators like NVIDIA GPUs and Ascend NPUs—to deploy and train the model. As an open-source project with comprehensive documentation, Galvatron is designed to be user-friendly, enabling easy integration with minimal code changes. Collaborations from both academia and industry, such as BAAI, Huawei, and ByteDance, highlight its practical applications and superior efficiency compared to existing frameworks. Discover more at https://github.com/PKU-DAIR/Hetu-Galvatron.

Speakers

Xinyi Liu

PhD Student, Peking University

Xinyi Liu is a Ph.D. student at the School of Computer Science, Peking University, and a member of the DAIR lab, which is led by Professor Bin Cui. His research is centered on distributed deep learning systems and the infrastructure for Large Language Models (LLMs). Currently, his... Read More →

Fangcheng Fu

Boya Postdoctoral Researcher, Peking University

Fangcheng Fu is currently a Boya Postdoctoral Researcher at the School of CS, Peking University, and a recipient of the China National Postdoctoral Program for Innovative Talent. Before that, he received his Bachelor's and Ph.D. degrees in computer science from Peking University in... Read More →

Saturday June 7, 2025 15:00 - 15:20 CST
TBA

Optimization for Training and Inferences

15:20 CST

Break

Saturday June 7, 2025 15:20 - 15:40 CST

TBA

Saturday June 7, 2025 15:20 - 15:40 CST
TBA

Breaks & Networking

15:40 CST

Intel's PyTorch Journey: Open Source Optimization Makes AI More Accessible - Mingfei Ma, Intel Asia-Pacific Research & Development Ltd.

Saturday June 7, 2025 15:40 - 16:00 CST

TBA

PyTorch is one of the most popular frameworks for deep learning and machine learning, and Intel has been a long-term contributor and advocate for the PyTorch community. In this talk, we will share our experience in contributing to PyTorch in the core framework and its ecosystem. We will show how to make AI applications more popular through the improvement of hardware computing power and the optimization of open source software. We will introduce the latest relevant progress on Intel GPUs on PyTorch, and the low-cost deployment solution of DeepSeek R1 671B full-blooded version on Intel Xeon CPU. We will also introduce some PyTorch ecosystem projects that we have participated in in the past, such as HuggingFace, vLLM, SGLang, etc. Finally, we will discuss future plans and visions and continue to work with the PyTorch Foundation to drive deep learning and machine learning in a better direction.

Speakers

Mingfei Ma

Job AI Frameworks Engineer, Intel Asia-Pacific Research & Development Ltd.

Mingfei Ma is a senior deep learning software engineer in Intel. He is also the maintainer of CPU performance module in PyTorch. Mingfei holds a Master degree from Harbin Institute of Technology where he majored in Control Science and Technology. Mingfei has a 12 years’ experience... Read More →

Saturday June 7, 2025 15:40 - 16:00 CST
TBA

Generative AI and Large Language Models (LLMs) with PyTorch

16:00 CST

FlagTree: Unified AI Compiler for Diverse AI Chips - Chunlei Men, Beijing Academy of Artificial Intelligence aka Zhiyuan Institute

Saturday June 7, 2025 16:00 - 16:20 CST

TBA

In the past two years, the Triton language has become the second most popular AI operator development language in the industry after CUDA C. Although there is still a gap in popularity compared to CUDA C, due to its open-source compiler ecosystem.
However, the current compilers supporting the Triton language still fail to support multiple hardware architectures in the main community. As a result, each hardware manufacturer has to maintain its own version of the Triton compiler separately, which brings new difficulties to upper - layer users, such as inconsistent compiler versions and functional differences.
To address this issue, we have created the FlagTree open - source project and ecological community that support the Triton language. We are committed to building an open - source and unified AI compiler for a variety of AI chips. By unifying the multi - backend compiler, we aim to build an ecosystem that can support other chip hardware architectures (such as GPGPU, DSA, and RISC - V AI), thus enabling cross - platform operation.

Speakers

chunlei men

R&D Manager, Beijing Academy of Artificial Intelligence aka Zhiyuan Institute

Men Chunlei is the R&D Manager and Senior Engineer at the Beijing Academy of Artificial Intelligence (BAAI). He is responsible for the research on intelligent computing power scheduling platforms, and AI compilers. He has been granted 13 invention patents. He successively served as... Read More →

Saturday June 7, 2025 16:00 - 16:20 CST
TBA

DL Compilers and Kernel Authoring

16:40 CST

SGLang: An Efficient Open-Source Framework for Large-Scale LLM Serving - Liangsheng Yin, Shanghai Jiao Tong University / LMSYS

Saturday June 7, 2025 16:40 - 17:00 CST

TBA

SGLang is an open-source Large Language Model (LLM) inference system that is highly efficient and widely adopted by many companies like xAI, Nvidia and AMD. In this session, I will introduce some key features of SGLang, including the design and implementation of PD disaggregation, large-scale expert parallelism and data parallelism for DeepSeek models, hierarchical KV cache offloading, and highly efficient speculative decoding. I will also share some insights into the future development of the SGLang community.

Speakers

Liangsheng Yin

Student / Developer, Shanghai Jiao Tong University / LMSYS

He is an undergraduate student at Shanghai Jiao Tong University and one of the earliest core developers of SGLang, a popular open-source inference engine with 15K+ GitHub stars and 20K+ monthly downloads. SGLang is used by xAI (Grok 3), Microsoft Azure (DeepSeek R1), NVIDIA, AMD... Read More →

Saturday June 7, 2025 16:40 - 17:00 CST
TBA

Open Source Collaboration + Education + Community Building