A Scientific and Systems-Level Perspective on Modern Artificial Intelligence
Artificial intelligence evolves at a remarkable pace, and each new breakthrough tends to generate the same recurring narrative: that whatever came before is now obsolete. In recent years, the rise of Transformer-based architectures, Large Language Models, and multimodal foundation models has led many to question the relevance of classical neural networks such as Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). This perception, however, overlooks the nuanced way in which progress in machine learning actually occurs.
Classical architectures have not been replaced; rather, they have become specialized, embedded, and indispensable components of modern AI systems. Understanding their continuing relevance requires moving beyond benchmark hype and examining neural networks from scientific, architectural, and systems-engineering perspectives. The question is therefore not whether CNNs or MLPs are obsolete, but whether they are being evaluated in the correct context.
Multi-Layer Perceptrons (MLPs) represent the most basic form of neural computation. Mathematically, an MLP is a sequence of affine transformations followed by nonlinear activation functions, capable of approximating arbitrary continuous functions under mild conditions. This theoretical property, formalized by the Universal Approximation Theorem, establishes MLPs as powerful function approximators. However, expressive power alone does not guarantee efficiency or generalization. MLPs make almost no assumptions about the structure of the data, which is both their greatest strength and their primary limitation.
In domains where data is structured, low-dimensional, and already encoded in meaningful features, such as finance, healthcare records, industrial telemetry, or business analytics, this lack of inductive bias becomes an advantage. MLPs often outperform more complex architectures in these settings because they introduce less variance, require fewer samples to generalize, and are computationally efficient. For this reason, MLPs remain widely used in production systems, particularly for tabular data. Far from disappearing, they are also deeply embedded within modern architectures. Every Transformer block contains large feed-forward networks that are, in essence, MLPs. Classification heads, regression layers, reinforcement learning policies, and Mixture-of-Experts (MoE) components all rely on MLPs as fundamental building blocks. Declaring MLPs obsolete would therefore imply declaring most modern AI architectures obsolete as well.
Convolutional Neural Networks (CNNs) were introduced to address a different limitation of early neural models: the inability to exploit spatial structure. Images, videos, and other grid-like data exhibit strong local correlations and translational regularities. CNNs encode these assumptions directly through local receptive fields, weight sharing, and hierarchical feature extraction. This architectural bias allows CNNs to learn visual representations efficiently, using far fewer parameters than fully connected networks. From a biological and computational perspective, CNNs are remarkably well aligned with human visual processing. Early layers detect edges and simple patterns, intermediate layers capture textures and shapes, and deeper layers represent object-level semantics. This hierarchical organization enables CNNs to generalize well even when training data is limited. As a result, CNNs remain the backbone of countless real-world systems, including autonomous driving perception stacks, medical imaging diagnostics, satellite image analysis, industrial inspection pipelines, and mobile vision applications.
The emergence of Vision Transformers has led to renewed debate about the future of CNNs. Vision Transformers replace convolutional inductive bias with global self-attention, allowing every image patch to interact with every other patch. While this approach achieves impressive results on massive datasets, it comes at the cost of increased computational complexity and reduced data efficiency. In practice, CNNs often outperform Vision Transformers in small- and medium-scale regimes, especially when latency, energy consumption, and robustness matter. Many state-of-the-art vision systems now adopt hybrid designs, using CNNs for efficient feature extraction and Transformers for high-level reasoning. CNNs are not obsolete; they are efficient specialists integrated into larger systems.
Sequential data introduces yet another structural challenge. Language, speech, sensor streams, and time series depend on temporal order. Recurrent Neural Networks (RNNs) were designed to model this dependency by maintaining a hidden state that evolves over time. Vanilla RNNs, while conceptually elegant, suffer from optimization difficulties that limit their ability to capture long-range dependencies. Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) addressed these issues by introducing gating mechanisms that regulate information flow, enabling stable learning over long sequences. Although Transformers have largely replaced recurrent models in large-scale natural language processing, LSTMs and GRUs remain highly relevant in domains where streaming inference, low latency, or limited data are critical. In many real-time systems, such as speech recognition on edge devices or industrial time-series monitoring, recurrent models offer a better trade-off between performance and efficiency than attention-based architectures.
Transformers themselves represent a powerful but often misunderstood paradigm. By relying on self-attention, Transformers can model global dependencies without recurrence, enabling massive parallelization and scalability. This makes them ideal for large datasets and general-purpose modeling, which is why they form the foundation of modern language and multimodal models. Beyond Transformers, the modern AI landscape has grown to include several specialized architectures and training paradigms. Large Language Models (LLMs) are Transformer-based architectures trained on massive text corpora, often extended to multimodal capabilities. Mixture-of-Experts (MoE) architectures distribute computation across specialized subnetworks, improving scalability while retaining high performance. Vision-Language Models (VLMs) integrate visual and textual inputs, combining CNNs or Vision Transformers with MLPs and attention layers to achieve multimodal reasoning. Sequence Learning Models (SLMs) continue to process temporal data efficiently, building on recurrent or Transformer-based foundations, while Masked Language Models (MLMs) employ objectives in which certain tokens are hidden and predicted, as exemplified by BERT. Segment Anything Models (SAMs) demonstrate universal image segmentation capabilities using hybrid CNN and Transformer backbones. Even less standardized terms, such as Latent Concept Models (LCMs) or Language-Aware Models (LAMs), highlight ongoing efforts to adapt classical components to novel data representations and task-specific biases.
Beyond grids and sequences lies relational data, which is best modeled as graphs. Graph Neural Networks (GNNs) introduce a message-passing framework that allows node representations to evolve based on local neighborhood structure. This inductive bias is irreplaceable for tasks involving social networks, molecular structures, recommendation systems, and knowledge graphs. No amount of attention or convolution can fully substitute for architectures explicitly designed to operate on graphs. GNNs are therefore not alternatives to CNNs or Transformers, but complementary tools tailored to relational reasoning.
Generative modeling further illustrates the cumulative nature of AI progress. Autoencoders (AEs) and Variational Autoencoders (VAEs) introduced principled approaches to representation learning and probabilistic modeling. While newer techniques such as diffusion models have achieved superior generative quality, they rely heavily on classical components. Most diffusion models use convolutional U-Net architectures augmented with attention layers, reinforcing the idea that innovation builds on existing foundations rather than discarding them.
At the systems level, modern AI increasingly relies on modularity. Mixture-of-Experts architectures distribute computation across specialized subnetworks, improving scalability and efficiency. These experts are not novel primitives but combinations of Transformers, CNNs, and MLPs. Even cutting-edge Large Language Models depend on classical neural components at their core.
From a scientific perspective, the concept of architectural obsolescence is misguided. Neural networks encode assumptions about the structure of data. When those assumptions align with the problem domain, performance and efficiency follow. When they do not, even the most advanced model will struggle. Progress in artificial intelligence is therefore best understood as specialization and integration, not replacement. Classical networks, modern architectures, and contemporary paradigms such as LLMs, MoEs, VLMs, SLMs, MLMs, and SAMs each occupy a distinct and essential niche. Modern AI systems succeed not because one architecture dominates, but because multiple architectures are combined intelligently.
LLM – Large Language Model
What it is:
A neural network, usually Transformer-based, trained on massive text corpora.
What it does:Generates and understands text, answers questions, translates languages, summarizes, and can even reason over long documents.
Relation to classical models:Uses MLPs in feed-forward layers and attention mechanisms. Transformers are built on classical building blocks.
LCM – Latent Concept Model
What it is:
A model designed to extract and represent hidden (latent) concepts from data, often in an interpretable or structured latent space.
What it does:Learns abstract features or concepts that may not be directly observable, useful in tasks like recommendation systems or multimodal reasoning.
Relation to classical models:Often uses MLPs, autoencoders, or CNNs to extract latent features.
LAM – Language-Aware Model
What it is:
A model that incorporates language understanding into other tasks, e.g., reasoning over text plus another modality.
What it does:Enhances models in vision, graphs, or multimodal tasks by integrating textual context.
Relation to classical models:Combines MLPs, CNNs, Transformers, depending on the primary input type.
MoE – Mixture-of-Experts
What it is:
A modular architecture with multiple “expert” subnetworks, where only a subset is active for each input.
What it does:Improves scalability and efficiency, allowing very large models without fully computing every parameter for each input.
Relation to classical models:Each “expert” can be an MLP, CNN, or Transformer block—MoE is a system-level design, not a new primitive.
VLM – Vision-Language Model
What it is:
A model that combines visual input (images/video) with textual input.
What it does:Can answer questions about images, generate captions, or reason across modalities.
Relation to classical models:Uses CNNs or Vision Transformers for images, MLPs in classification heads, and attention layers for multimodal reasoning.
SLM – Sequence Learning Model
What it is:
A model designed specifically for temporal or sequential data (time series, language, sensor streams).
What it does:Learns patterns over sequences for prediction, anomaly detection, or control tasks.
Relation to classical models:May use RNNs, LSTMs, GRUs, or Transformers depending on scale and latency requirements.
MLM – Masked Language Model
What it is:
A model trained to predict missing tokens in text, e.g., BERT.
What it does:Learns contextual embeddings and deep language understanding.
Relation to classical models:Transformer-based, with MLPs in feed-forward layers for token prediction.
SAM – Segment Anything Model
What it is:
A universal image segmentation model that can identify objects in images with minimal guidance.
What it does:Can segment any object in an image, supports zero-shot segmentation for unseen classes.
Relation to classical models:Typically uses CNN backbones for feature extraction combined with Transformers for reasoning and mask prediction.
