This paper is published in Volume-12, Issue-3, 2026
Area
Artificial Intelligence
Author
Muhammad Hamza
Org/Univ
Qassim University, Saudi Arabia, Saudi Arabia
Pub. Date
25 May, 2026
Paper ID
V12I3-1176
Publisher
Keywords
Autonomous Driving, Object Detection, Detection Transformers (DETR), Vision Transformers (ViT), Real-Time Perception, Edge AI, Multimodal Fusion.

Citationsacebook

IEEE
Muhammad Hamza. Transformer-Based Object Detection Architectures for Autonomous Driving Perception: A Comprehensive Review, International Journal of Advance Research, Ideas and Innovations in Technology, www.IJARIIT.com.

APA
Muhammad Hamza (2026). Transformer-Based Object Detection Architectures for Autonomous Driving Perception: A Comprehensive Review. International Journal of Advance Research, Ideas and Innovations in Technology, 12(3) www.IJARIIT.com.

MLA
Muhammad Hamza. "Transformer-Based Object Detection Architectures for Autonomous Driving Perception: A Comprehensive Review." International Journal of Advance Research, Ideas and Innovations in Technology 12.3 (2026). www.IJARIIT.com.

Abstract

Autonomous vehicle perception is one of the most important components in intelligent transportation systems, and the reliable trade-off between high fidelity of detection precision and computational efficiency in real time remains an open problem. Deep learning has proven to be very accurate in controlled settings, but bringing CNN-based solutions to deployment with high latency and substantial memory overhead is often a challenge to the end-to-end deployed Transformer solution. This thorough review provides a systematic analysis of recent developments in transformer-based detection architectures, consolidating 2024–2026 transformer- and CNN-based architectures for detection. It is a thorough review that systematically analyzes recent transformer-based detection architectures, summarizing the current transformer- and CNN-based detection architectures from 2024 to 2026. From our analysis, we can see that there is a clear lack of theoretical sophistication and the real-life edge-deployability of the hardware. In addition, there is a clear disconnection between the 2D camera-based detection approach and the 3D multimodal fusion approach in the literature. The critical research dimensions that are not well met by the current state-of-the-art are identified in this review, including small object detection in dense urban environments and robust inference under challenging weather conditions. This review provides a structured path forward by mapping these interrelated gaps and paving the way for the creation of lightweight, accurate and robust transformer detectors that can be deployed on their own in the field.