Event Preview | AI Computing, TileRT, Tencent, Huawei, and AI Computing Innovation Join Forces to Explore Multi-Level Collaborative Optimization

12 hours ago

Information

AI Compiler

TLE

Triton

Embodied Intelligence

Huawei Ascend

Zhiyuan Innovation

Academy of Artificial Intelligence

Matrix Multiplication

Operator Optimization

Tencent

Ultra-low Latency Inference

From the sweltering heat of Beijing to the frigid winter of Shanghai, the Meet AI Compiler technical salon, hosted by HyperAI, has accompanied the AI compilation ecosystem for three years. In these three years, we have witnessed countless engineers and researchers sharing cutting-edge findings and exchanging technical perspectives, jointly driving the continuous evolution of compilation technology in the era of large models, constantly pushing the boundaries of performance optimization, heterogeneous adaptation, and engineering implementation.

Technology is iterating, and our steps to connect with cutting-edge innovations have never stopped. On August 1st, the ninth Meet AI Compiler technical salon will set sail again in Beijing!In this issue, we invited several experts from the Academy of Artificial Intelligence, the TileRT team, Tencent, Huawei Ascend, and Zhiyuan Innovation. They will conduct an in-depth analysis of FlagTree language extension, TileRT ultra-low latency inference, FalconGEMM operator optimization, AscendNPU IR open source co-construction, and application practices for embodied intelligence, aiming to present a picture of the collaborative evolution of AI compilers at multiple levels, including language expression, operator computation, inference execution, and scenario application.

As always, seats are limited, so act fast! Grab your seats now, and we'll see you there!

Event Details

⏰ Time: August 1st (Saturday) 13:30-17:30

📍 Location: Multifunctional Hall, 5th Floor, Building 12, Zhongguancun Entrepreneurship Street, Haidian District, Beijing

👬 Number of participants: 150 (Limited seating available, please register as soon as possible)

🙌🏻 Registration link:https://hdxu.cn/1KkIr

Scan the QR code and remark "AI Compiler" to join the event group:

Guests and Agenda

Session 1: Guest speakers

Share topic:FlagTree: Triton-TLE Language Extensions, Tile IR Backend, and Compiler Optimization Practices

Contents:This presentation will be divided into three parts. The first part introduces the challenges faced by Triton and how TLE (Tracked Operators) progressively exposes hardware details through three levels of language extensions, achieving a better balance between portability, maintainability, and performance. The second part focuses on the engineering practice of integrating Tile IR into the Triton compiler FlagTree, introducing how it, as a new type of compiler backend, further expands the performance optimization space of Triton operators. The third part will systematically analyze key compiler optimization techniques such as layout optimization and instruction reordering, demonstrating a complete compilation optimization path for cross-chip high-performance operators.

Watch this sharing session and you will learn:

1. How does TLE control on-chip memory, how does it express distributed and producer-consumer models, and how does it inline the vendor's native language?

2. How can TLE and Tile IR backends improve the performance ceiling of Triton key operators?

3. How compiler optimization techniques reduce data layout transformation overhead, improve instruction execution efficiency, and further unleash the performance of the Triton operator.

Share topic:TileRT: Speed is Intelligence – Computational Exploration and Co-design for Ultra-Low Latency Large Model Inference

Contents:As large models reach trillions of parameters and enter the agentic era, extreme inference speed has become a key element in supporting complex task flows and fully unleashing the model's potential. However, when systems attempt to further push latency limits, traditional system architectures and execution bottlenecks often become insurmountable obstacles.

This report introduces the latest explorations of TileRT, demonstrating how to build a software stack for large-scale model computing with ultra-low latency, from the perspective of AI compilers, runtime architecture evolution to model-system co-design.

Watch this sharing session and you will learn:

1. Speed is Intelligence: Exploring why "speed" is gradually becoming a key indicator for the inference side of large models in the Agentic era.

2. System Architecture Exploration: This section introduces the architectural evolution of TileRT, using GLM-5 as an example to discuss how to significantly improve inference performance by refactoring the underlying computation scheduling.

3. Model-System Co-design and Production Practice: Sharing how to break through the 1000 TPS speed bottleneck in single-batch inference for trillion-parameter models through joint design of models and systems.

Share topic:FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

Contents:Matrix multiplication (GEMM) is the core of computational power for training and inference of large models. However, as model size grows exponentially, the O(N³) complexity of the standard algorithm is constantly approaching the physical peak of the hardware. How to continue to extract performance when chip computing power is at its peak has become a key issue for large model infrastructure. Understanding the principles, value, and engineering challenges of low-complexity matrix multiplication breaking through the performance ceiling in the context of operator optimization reaching its peak is crucial. One path explored by the mathematical community for more than 50 years is low-complexity matrix multiplication (LCMA, such as Strassen and AlphaTensor) – exchanging fewer multiplications for more memory accesses and additions, thus "breaking through" the hardware peak in an equivalent sense. However, the three major engineering challenges of memory access bloat, algorithm selection, and cross-platform portability have kept it at the theoretical level for a long time.

This report introduces the FalconGEMM project, which systematically brings LCMA from paper to the production-grade software stack, covering three levels: compiler-automated code generation, memory access optimization through group parallel fusion, and algorithm decision-making based on performance models. It also achieves a comprehensive surpassing of top-level official libraries on various GPU/CPU platforms and real large model workloads.

Watch this sharing session and you will learn:

1. Understand the principles, value, and engineering challenges of low-complexity matrix multiplication breaking through the performance ceiling when operator optimization reaches its limit.

FalconGEMM's technical solutions and cross-platform practices.

Share topic:AscendNPU IR: The compilation platform is open source and supports multi-language integration with Ascend.

Contents:AscendNPU IR, the Ascend compiler component, has been fully open-sourced. As the MLIR access layer for Ascend to third-party programming frameworks, it provides flexible integration, complete expression, and Ascend-friendly compilation optimization capabilities, and supports multiple front-end DSLs to improve the performance of Ascend operators.

Watch this sharing session and you will learn:

1. Understand the overall technical architecture and design philosophy of AscendNPU IR.

2. Understand the new features of Ascend NPU IR for Ascend 950 expansion.

3. Understand the AscendNPU IR community building activities and how to participate.

Share topic:A general-purpose AI compiler for the field of embodied intelligence

Contents:This report introduces a general-purpose compiler for embodied intelligence and multimodal large models, focusing on the capture, export, grouping, compilation, runtime deployment, and performance optimization of complete algorithm pipelines, addressing key issues in edge delivery, stable operation, cross-framework adaptation, and engineering scaling of robot models.

Watch this sharing session and you will learn:

1. Understand the core challenges that distinguish embodied intelligence model deployment from traditional model deployment, including the engineering complexity and maintenance costs brought about by multiple models, multiple frameworks, and multi-stage pipelines.

2. Master how a general-purpose compiler can capture the complete algorithm flow through dynamic tracing, and organize modules such as preprocessing, VLA model, LLM, and post-processing into a compilable, deployable, and deployable DAG template.

3. Understand how grouped compilation and a unified runtime architecture support different backends, leveraging the advantages of various chips while maintaining a unified delivery chain.

4. Understand the interface paradigm between the embodied domain compiler and the distribution platform.

Organizers and partners

HyperAI (hyper.ai) is an internationally leading artificial intelligence and high-performance computing community.It aims to help developers and enthusiasts in the global data science and artificial intelligence industry learn, understand and practice by providing a series of services such as industry information reports, accelerated data set downloads, online tutorial demonstrations, popular model performance evaluations, cutting-edge paper recommendations, high-value results interpretations, and top conference calendar integration, and build the future of artificial intelligence together with the community.

Visit the official website:https://hyper.ai/

OpenBayes Bayesian Computing is a leading high-performance computing service provider in ChinaBy grafting classic software ecosystems and machine learning models onto new-generation heterogeneous chips, it provides industrial enterprises and university scientific research with faster and easier-to-use data science computing products. Its products have been adopted by dozens of large industrial scenarios or leading scientific research institutes.

Visit the official website:https://openbayes.com/

The MLC.AI community was established in June 2022. Chen Tianqi, the main inventor of Apache TVM and a well-known young scholar in the field of machine learning, led the team to launch the MLC online course, which systematically introduced the key elements and core concepts of machine learning compilation.

In November 2022, with the joint efforts of MLC.AI community volunteers, the first complete TVM Chinese documentation was launched and successfully hosted on the HyperAI official website, further providing domestic developers interested in machine learning compilation with the basic settings for accessing and learning a new technology - documentation.

MLC Online Courses:https://mlc.ai/

TVM Chinese Documentation:https://tvm.hyper.ai/

Event venue support

The venue for this event was provided by the Administrative Committee of Zhongguancun Science City and Beijing Zhongguancun Science City Innovation Development Co., Ltd.

Active row:Scan the QR code to jump to the event registration

Scan the QR code and remark "AI Compiler" to join the event group

Given the limited space at the venue, we have only opened 150 seats available. We recommend that you register as soon as possible to secure your place.

See you there on August 1st, from 13:30 to 17:30!

Event Preview | AI Computing, TileRT, Tencent, Huawei, and AI Computing Innovation Join Forces to Explore Multi-Level Collaborative Optimization

12 hours ago

Embodied Intelligence

Huawei Ascend

Zhiyuan Innovation

Academy of Artificial Intelligence

Matrix Multiplication

Operator Optimization

Tencent

Ultra-low Latency Inference

As always, seats are limited, so act fast! Grab your seats now, and we'll see you there!

Event Details

⏰ Time: August 1st (Saturday) 13:30-17:30

📍 Location: Multifunctional Hall, 5th Floor, Building 12, Zhongguancun Entrepreneurship Street, Haidian District, Beijing

👬 Number of participants: 150 (Limited seating available, please register as soon as possible)

🙌🏻 Registration link:https://hdxu.cn/1KkIr

Scan the QR code and remark "AI Compiler" to join the event group:

Guests and Agenda

Session 1: Guest speakers

Share topic:FlagTree: Triton-TLE Language Extensions, Tile IR Backend, and Compiler Optimization Practices

Watch this sharing session and you will learn:

1. How does TLE control on-chip memory, how does it express distributed and producer-consumer models, and how does it inline the vendor's native language?

2. How can TLE and Tile IR backends improve the performance ceiling of Triton key operators?

3. How compiler optimization techniques reduce data layout transformation overhead, improve instruction execution efficiency, and further unleash the performance of the Triton operator.

Share topic:TileRT: Speed is Intelligence – Computational Exploration and Co-design for Ultra-Low Latency Large Model Inference

Watch this sharing session and you will learn:

1. Speed is Intelligence: Exploring why "speed" is gradually becoming a key indicator for the inference side of large models in the Agentic era.

Share topic:FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

Watch this sharing session and you will learn:

1. Understand the principles, value, and engineering challenges of low-complexity matrix multiplication breaking through the performance ceiling when operator optimization reaches its limit.

FalconGEMM's technical solutions and cross-platform practices.

Share topic:AscendNPU IR: The compilation platform is open source and supports multi-language integration with Ascend.

Watch this sharing session and you will learn:

1. Understand the overall technical architecture and design philosophy of AscendNPU IR.

2. Understand the new features of Ascend NPU IR for Ascend 950 expansion.

3. Understand the AscendNPU IR community building activities and how to participate.

Share topic:A general-purpose AI compiler for the field of embodied intelligence

Watch this sharing session and you will learn:

3. Understand how grouped compilation and a unified runtime architecture support different backends, leveraging the advantages of various chips while maintaining a unified delivery chain.

4. Understand the interface paradigm between the embodied domain compiler and the distribution platform.

Organizers and partners

Visit the official website:https://hyper.ai/

Visit the official website:https://openbayes.com/

MLC Online Courses:https://mlc.ai/

TVM Chinese Documentation:https://tvm.hyper.ai/

Event venue support

The venue for this event was provided by the Administrative Committee of Zhongguancun Science City and Beijing Zhongguancun Science City Innovation Development Co., Ltd.

Active row:Scan the QR code to jump to the event registration

Scan the QR code and remark "AI Compiler" to join the event group

Given the limited space at the venue, we have only opened 150 seats available. We recommend that you register as soon as possible to secure your place.

See you there on August 1st, from 13:30 to 17:30!

Command Palette

Event Preview | AI Computing, TileRT, Tencent, Huawei, and AI Computing Innovation Join Forces to Explore Multi-Level Collaborative Optimization

Event Details

Guests and Agenda

Organizers and partners

Event venue support

Command Palette

Event Preview | AI Computing, TileRT, Tencent, Huawei, and AI Computing Innovation Join Forces to Explore Multi-Level Collaborative Optimization

Event Details

Guests and Agenda

Organizers and partners

Event venue support

Related News

EnergAIzer, a GPU Power Estimation Framework Developed by MIT and Others, Completes Predictions in an Average of 1.8 Seconds With an Error of Approximately 81 TP3T.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Online Tutorial | 41k Stars Achieved: HKU Team open-sources ultra-lightweight AI Assistant Nanobot, Implementing OpenClaw Core Functionality in 4000 Lines of code.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Online Tutorial | In-depth Guide to Instruction Following/Inference/Coding: Mistral Medium 3.5 Brings Coding Agents to the Cloud

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

Command Palette

Event Preview | AI Computing, TileRT, Tencent, Huawei, and AI Computing Innovation Join Forces to Explore Multi-Level Collaborative Optimization

Event Details

Guests and Agenda

Organizers and partners

Event venue support

Related News

EnergAIzer, a GPU Power Estimation Framework Developed by MIT and Others, Completes Predictions in an Average of 1.8 Seconds With an Error of Approximately 81 TP3T.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Online Tutorial | 41k Stars Achieved: HKU Team open-sources ultra-lightweight AI Assistant Nanobot, Implementing OpenClaw Core Functionality in 4000 Lines of code.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Online Tutorial | In-depth Guide to Instruction Following/Inference/Coding: Mistral Medium 3.5 Brings Coding Agents to the Cloud

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

Related News

EnergAIzer, a GPU Power Estimation Framework Developed by MIT and Others, Completes Predictions in an Average of 1.8 Seconds With an Error of Approximately 81 TP3T.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Online Tutorial | 41k Stars Achieved: HKU Team open-sources ultra-lightweight AI Assistant Nanobot, Implementing OpenClaw Core Functionality in 4000 Lines of code.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Online Tutorial | In-depth Guide to Instruction Following/Inference/Coding: Mistral Medium 3.5 Brings Coding Agents to the Cloud

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning

Related News

EnergAIzer, a GPU Power Estimation Framework Developed by MIT and Others, Completes Predictions in an Average of 1.8 Seconds With an Error of Approximately 81 TP3T.

Free CPU Tutorial | Achieving 8.8k Stars, the Supertonic-3 TTS Model Has Only About 99M Parameters and Supports 31 languages.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Online Tutorial | HKU Team Open Sources DeepTutor, a Personal Learning Assistant That Enables Interactive Learning Covering Understanding, Reasoning, and Generation Through Multi-Agent Collaboration

Free CPU Online Tutorial | Hermes Agent: Learn Long-Term Memory? The Memory Enhancement Plugin TencentDB Agent Memory Can Store Facts, Preferences, Task States, etc., separately.

Online Tutorial | 41k Stars Achieved: HKU Team open-sources ultra-lightweight AI Assistant Nanobot, Implementing OpenClaw Core Functionality in 4000 Lines of code.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Online Tutorial | In-depth Guide to Instruction Following/Inference/Coding: Mistral Medium 3.5 Brings Coding Agents to the Cloud

Paper Roundup | Latest Advances in Large-Scale Reinforcement Learning: Microsoft, Google, Stanford, Renmin University, Xiaohongshu, and Others Release Major Achievements in Credit Allocation, Complex Reasoning, and Agent Reinforcement Learning