Transformer-Based Object Detection Architectures for Autonomous Driving Perception: A Comprehensive Review

Muhammad Hamza

doi:XX.XXX/IJARIIT-V12I3-1176

This paper is published in Volume-12, Issue-3, 2026

Paper Details
Abstract & PDF

Area

Artificial Intelligence

Author

Muhammad Hamza

Org/Univ

Qassim University, Saudi Arabia, Saudi Arabia

Pub. Date

25 May, 2026

Paper ID

V12I3-1176

Publisher

IJARIIT

Edition

Volume-12, Issue-3, 2026

Keywords

Autonomous Driving, Object Detection, Detection Transformers (DETR), Vision Transformers (ViT), Real-Time Perception, Edge AI, Multimodal Fusion.

Citations

IEEE
Muhammad Hamza. Transformer-Based Object Detection Architectures for Autonomous Driving Perception: A Comprehensive Review, International Journal of Advance Research, Ideas and Innovations in Technology, www.IJARIIT.com.

APA
Muhammad Hamza (2026). Transformer-Based Object Detection Architectures for Autonomous Driving Perception: A Comprehensive Review. International Journal of Advance Research, Ideas and Innovations in Technology, 12(3) www.IJARIIT.com.

MLA
Muhammad Hamza. "Transformer-Based Object Detection Architectures for Autonomous Driving Perception: A Comprehensive Review." International Journal of Advance Research, Ideas and Innovations in Technology 12.3 (2026). www.IJARIIT.com.

Give proper credits, use Citation.

Abstract

Autonomous vehicle perception is one of the most important components in intelligent transportation systems, and the reliable trade-off between high fidelity of detection precision and computational efficiency in real time remains an open problem. Deep learning has proven to be very accurate in controlled settings, but bringing CNN-based solutions to deployment with high latency and substantial memory overhead is often a challenge to the end-to-end deployed Transformer solution. This thorough review provides a systematic analysis of recent developments in transformer-based detection architectures, consolidating 2024–2026 transformer- and CNN-based architectures for detection. It is a thorough review that systematically analyzes recent transformer-based detection architectures, summarizing the current transformer- and CNN-based detection architectures from 2024 to 2026. From our analysis, we can see that there is a clear lack of theoretical sophistication and the real-life edge-deployability of the hardware. In addition, there is a clear disconnection between the 2D camera-based detection approach and the 3D multimodal fusion approach in the literature. The critical research dimensions that are not well met by the current state-of-the-art are identified in this review, including small object detection in dense urban environments and robust inference under challenging weather conditions. This review provides a structured path forward by mapping these interrelated gaps and paving the way for the creation of lightweight, accurate and robust transformer detectors that can be deployed on their own in the field.

All content is copyright protected.