From Mixed to Pure Understanding Qubit Purification and Decoherence

Quantum computing introduces a radically different way of thinking about information. At its core lies the qubit, the quantum analogue of the classical bit. Unlike classical bits, which are strictly 0 or 1, qubits can exist in superpositions of both states. However, this powerful feature comes with a fundamental challenge: qubits are extremely fragile.

In this article, we explore the concept of purification in quantum mechanics—what it means, why it matters, and how it connects to one of the biggest obstacles in quantum computing: decoherence.

Pure vs. Mixed States

To understand purification, we first need to distinguish between two types of quantum states.

Pure states are fully known quantum states. A qubit in a pure state can be described precisely by a wavefunction:

ψ = α|0⟩ + β|1⟩

Here, α and β are complex numbers that encode probabilities and phase information. This is the ideal situation in quantum computing, where the system is perfectly controlled.

Mixed states, on the other hand, represent uncertainty. They arise when we don’t have complete information about the system. For example, if a qubit has a 50% chance of being |0⟩ and 50% chance of being |1⟩—but not in superposition—then it is in a mixed state. This is not quantum uncertainty, but classical ignorance.

What Is Purification?

Purification is a powerful idea.

Every mixed state can be seen as part of a larger system that is actually in a pure state.

In other words, a qubit that appears noisy or impure may not be fundamentally disordered—it may simply be entangled with something else.

By expanding our view to include an additional system, often called an ancilla or external qubit, we can represent the combined system as a pure state.

A Simple Example

Consider a system of two qubits in this entangled state:

Ψ = (1/√2)(|00⟩ + |11⟩)

This is a perfectly pure state. However, if we look at only one of the two qubits and ignore the other, the remaining qubit appears to be in a mixed state: 50% |0⟩ and 50% |1⟩.

What happened?

The missing information is not gone. It is stored in the correlation between the two qubits. This is the essence of purification: the apparent randomness comes from ignoring part of a larger, structured system.

Decoherence The Real-World Challenge

In practice, qubits are never perfectly isolated. They constantly interact with their environment—vibrations, electromagnetic fields, temperature fluctuations, and more.

These interactions cause decoherence, a process where the qubit becomes entangled with its environment, quantum superposition is lost, and the system begins to behave like a classical mixture.

From the perspective of purification, the environment acts as the external system that carries away information. The total system, qubit plus environment, is still in a pure state, but the qubit alone appears mixed and degraded.

Why Purification Matters

Understanding purification is not just a theoretical exercise. It has practical implications.

Quantum error correction relies on tracking and managing entanglement with auxiliary systems. Quantum communication protocols use purification to restore high-quality entanglement. Quantum cryptography depends on distinguishing pure entangled states from mixed ones.

In essence, purification helps us reinterpret noise not as destruction of information, but as redistribution of it.

Key Insight

A mixed state is not necessarily a sign of disorder. It is often a sign that part of the system is hidden.

By identifying and including the external degrees of freedom, what seemed random becomes structured again.

Quantum systems challenge our classical intuition. What appears as noise may actually be hidden order. What seems like loss may be entanglement. Purification provides a lens through which we can better understand these phenomena and learn how to control them.

As quantum technologies continue to evolve, mastering concepts like purification and decoherence will be essential for building reliable and scalable quantum systems.

Text Generation with PyTorch: Character-Level vs Subword Tokenization

Artificial Intelligence has made enormous progress in Natural Language Processing (NLP), especially in the area of text generation. Modern systems can generate realistic text by learning patterns from large datasets.

In this tutorial, we will build a text generation model using PyTorch trained on a dataset containing works from Shakespeare. We will explore two different approaches:

Character-level text generation
Subword tokenization using Hugging Face

By the end of this article, you will understand how language models generate text token by token.

Dataset: Shakespeare Text

To train the model, we use a dataset containing Shakespeare's writings.

DATA_FILE = '../data/shakespeare_small.txt'

with open(DATA_FILE, 'r') as data_file:
    raw_text = data_file.read()

print(f'Number of characters in text file: {len(raw_text):,}')

This dataset contains thousands of characters that will be used to train our neural network to predict the next token in a sequence.

Character-Based Text Generation

In the first approach, each character is treated as a token.

Example:

"hello" → ['h', 'e', 'l', 'l', 'o']

The model learns to predict the next character based on the previous ones.

Step 1 — Text Normalization

Before tokenization, we normalize the text.

def normalize_text(text: str) -> str:
    normalized_text = text.lower()
    return normalized_text

This converts all text to lowercase while keeping punctuation and special characters.

Step 2 — Pretokenization

Next, we split the text into characters.

def pretokenize_text(text: str) -> list[str]:
    smaller_pieces = [char for char in text]
    return smaller_pieces

Step 3 — Tokenization

Now we combine normalization and pretokenization.

def tokenize_text(text: str) -> list[str]:
    normalized_text = normalize_text(text)
    pretokenized_text = pretokenize_text(normalized_text)
    tokenized_text = pretokenized_text
    return tokenized_text

Step 4 — Encoding Tokens into IDs

We convert tokens into integer IDs.

encoded_text, character_mapping = encode_text(raw_text, tokenize_text)

The mapping allows conversion between characters and integer IDs.

character → integer ID
integer ID → character

Preparing the Dataset

Next, we split the encoded text into sequences.

sequence_length = 32
batch_size = 32

train_dataset = ShakespeareDataset(encoded_text, sequence_length)

train_loader = DataLoader(
    train_dataset,
    shuffle=False,
    batch_size=batch_size,
)

Each sequence contains 32 characters, and the model learns to predict the next character.

Building the Model

We create a neural network using a helper function.

model = build_model(n_tokens)
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

CrossEntropyLoss is used for classification tasks, while the Adam optimizer updates model weights during training.

Text Generation Function (Character-Based)

The following function generates text character by character.

def generate_text_by_char(
    input_str: str,
    model,
    token_mapping,
    num_chars: int = 100,
    temperature: float = 1.0,
    topk: int | None = None,
):

    tokenized_text = tokenize_text(input_str)
    generated_tokens = []

    for _ in range(num_chars):

        new_char = next_token(
            tokenized_text=(tokenized_text + generated_tokens),
            model=model,
            token_mapping=token_mapping,
            temperature=temperature,
            topk=topk,
            device=device,
        )

        generated_tokens.append(new_char)

    full_text = ''.join(tokenized_text + generated_tokens)
    return full_text

Training the Model

Now we train the model.

TEST_PHRASE = 'To be or not to be'
epochs = 5 if device == 'cpu' else 25

start = start_time()

for epoch in range(epochs):

    model.train()
    total_loss = 0

    for X_batch, y_batch in train_loader:

        optimizer.zero_grad()

        output = model(X_batch.to(device))

        loss = criterion(output.transpose(1, 2), y_batch.to(device))

        loss.backward()

        optimizer.step()

        total_loss += loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(train_loader)}')

    gen_output = generate_text_by_char(
        input_str=TEST_PHRASE,
        model=model,
        num_chars=100,
    )

    print(gen_output)

During training, we generate text after each epoch to observe improvements.

Generating Text

Once the model is trained, we can generate new text.

output = generate_text_by_char(
    input_str='To be or not to be',
    model=model,
    num_chars=100,
    temperature=1.0,
    topk=None,
)

print(output)

Adjusting parameters changes creativity.

Temperature controls randomness, while top-k sampling restricts token selection.

Subword-Based Text Generation

Character-based models are simple but inefficient. Modern NLP models use subword tokenization.

We use a tokenizer from the Transformers library.

Loading a Hugging Face Tokenizer

from transformers import AutoTokenizer

model_name = 'bert-base-uncased'

my_tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)

This tokenizer splits words into subword units.

Example:

playing → play + ##ing

Encoding Text with the Tokenizer

encoded_text, token_mapping = encode_text_from_tokenizer(
    text=raw_text,
    tokenizer=my_tokenizer,
)

Each token is converted into an integer ID.

Preparing the Dataset

sequence_length = 32
batch_size = 32

train_dataset = ShakespeareDataset(encoded_text, sequence_length)

train_loader = DataLoader(
    train_dataset,
    shuffle=False,
    batch_size=batch_size,
)

Subword Text Generation Function

def generate_text_by_subword(
    input_str: str,
    model,
    token_mapping,
    tokenizer,
    num_tokens: int = 100,
    temperature: float = 1.0,
    topk: int | None = None,
):

    tokenized_text = tokenize_text_from_tokenizer(
        tokenizer=tokenizer,
        text=input_str,
    )

    generated_tokens = []

    for _ in range(num_tokens):

        new_token = next_token(
            tokenized_text=(tokenized_text + generated_tokens),
            model=model,
            token_mapping=token_mapping,
            temperature=temperature,
            topk=topk,
            device=device,
        )

        generated_tokens.append(new_token)

    output_ids = tokenizer.convert_tokens_to_ids(
        tokenized_text + generated_tokens
    )

    full_text = tokenizer.decode(output_ids)

    return full_text

Generating Subword Text

output = generate_text_by_subword(
    input_str='To be or not to be',
    model=model,
    token_mapping=token_mapping,
    tokenizer=my_tokenizer,
    num_tokens=30,
    temperature=1.5,
    topk=100,
)

print(output)

This approach generally produces more coherent text than character-level models.

Character vs Subword Models

Character-based models are simple and easy to implement but slower and less efficient.

Subword-based models are more powerful and scalable but require a tokenizer.

Most modern NLP models use subword tokenization because it balances vocabulary size and language coverage.

How to Create an Agent Ideation Partner for Copilot Studio

Purpose of the Agent Ideation Partner

This agent does not execute operational tasks.
It does not directly automate processes.
It does not run real-time analytics.

Instead, it performs one strategic function: it analyzes the user’s last 90 days of internal work context and generates tailored agent ideas that improve productivity.

How the Agent Works

The Agent Ideation Partner is designed to review internal data from the past 90 days, including emails, meeting transcripts, chat messages, and internal documents.

It identifies core responsibilities, recurring tasks, bottlenecks, frequently asked internal questions, and low-value manual activities.

Based on this understanding, it generates lightweight agent ideas compatible with Copilot Studio, evaluates user-proposed ideas for feasibility, and provides expert-level recommendations focused specifically on “lite” agents.

Example Prompt to Create the Agent

Prompt: Agent Ideation Partner

Create an Agent Ideation Partner focused on generating agent use cases specifically for the user’s role within their company.

This agent must analyze internal data from the last 90 days (emails, meeting transcripts, chats, internal documents) and understand primary responsibilities, recurring activities, frequent challenges, and key collaborations. It must identify repetitive patterns and inefficiencies.

Based on this analysis, the agent must suggest personalized agent use cases and propose only ideas suitable for Copilot Studio Agent Builder. It should prioritize lightweight automations, FAQ bots, smart reminders, automated draft generation, and meeting preparation assistants.

For each idea, it must explain the problem it solves, how the agent would work, what data it would require, and why it qualifies as a “lite” implementation.

If the user already has agent ideas, the agent must evaluate their feasibility in Copilot Studio, recommend simplifications, and highlight limitations or better alternatives.

The agent must not propose complex enterprise systems, custom software architectures, or advanced external integrations beyond standard connectors. The focus must remain exclusively on lightweight, quick-to-implement, high-value productivity agents.

Example Use Cases This Agent Might Generate

Meeting Preparation Agent
Problem: Preparing for meetings requires manually gathering emails, documents, and open action items.
Solution: An agent that automatically generates a meeting briefing.
Required data: Recent emails, calendar events, shared documents.
Why it’s lightweight: Simple aggregation and summarization logic.

Internal Team FAQ Agent
Problem: The same operational questions are repeatedly asked in chat.
Solution: An agent that answers common process, template, and policy questions.
Required data: Internal documents and knowledge base materials.
Why it’s lightweight: Structured conversation flows using existing documentation.

Task Follow-Up Assistant
Problem: Action items from meetings are not consistently tracked.
Solution: An agent that extracts action items and sends periodic reminders.
Required data: Meeting transcripts and calendar data.
Why it’s lightweight: Basic conditional logic and scheduled notifications.

Weekly Report Draft Generator
Problem: Weekly reports are repetitive and time-consuming.
Solution: An agent that aggregates recent activity and generates a structured draft.
Required data: Sent emails, completed tasks, updated documents.
Why it’s lightweight: Template-based generation with summarized inputs.

Why This Approach Works

Many teams aim to build transformational AI agents immediately. That often leads to overly complex initiatives.

The fastest return on investment comes from micro-automations, removing daily friction, reducing repetitive manual work, and supporting decision-making with contextual summaries.

A well-designed Agent Ideation Partner helps organizations avoid overly ambitious projects, identify quick wins, align AI initiatives with real job functions, and build a phased roadmap of practical agents.

You don’t need complex enterprise AI to create impact. You need context-aware, role-specific, lightweight agents.

An Agent Ideation Partner built with a structured prompt like this transforms internal work patterns into actionable automation opportunities and enables fast, measurable value using Copilot Studio.

Create an Agent Ideation Partner focused on generating agent use
cases specifically for the user's role at my company. This agent
should review all internal data from the 90 days —including emails,
meeting transcripts, chat messages, and internal documents—to
understand the user's responsibilities, challenges, and recurring
tasks. Based on this context, suggest personalized agent use cases
that would improve user's productivity, automate repetitive tasks,
or enhance collaboration. Prioritize ideas that align with user's
role and current projects and explain how each agent would work and
what data it would need.
This agent should ONLY create ideas for Copilot Studio Agent
Builder. Focus on ideas that are lightweight and simple to implement
and solve common productivity or FAQ scenarios.
If the users already have some agent ideas to build, this agent
should advise the users if the use case would be a good fit for
Copilot Studio Agent Builder and give recommendations and
suggestions with deep expertise in lite agents.

To make Generative Adversarial Networks (GANs) work well in practice

To make Generative Adversarial Networks (GANs) work well in practice, choosing the right architecture is crucial.

If you're working on a simple task—like generating 28×28 pixel handwritten digits from the MNIST dataset—you can often use a fully connected architecture. In this setup, each layer interacts through matrix multiplications only. There’s no convolution, no recurrence—just dense layers stacked together.

The key design rule? Both the generator and the discriminator must include at least one hidden layer. This ensures they have the universal approximator property, meaning that, given enough hidden units, they can represent any probability distribution.

Choosing Activation Functions

For hidden layers, many activation functions can work. However, Leaky ReLU is especially popular. The reason is simple: it helps gradients flow through the network more reliably.

Gradient flow is important in any neural network—but it’s absolutely critical in GANs. Why? Because the generator can only learn through the gradients it receives from the discriminator. If gradients vanish, learning stalls.

For the generator’s output layer, a common choice is the hyperbolic tangent (tanh) activation function. This means your training data should be scaled to the range [-1, 1].

For the discriminator, the output must represent a probability. To enforce this, we typically use a sigmoid activation function at the output layer.

Training GANs: Two Optimizers, Two Losses

GANs differ from most machine learning models because they require training two networks simultaneously:

One optimizer minimizes the discriminator’s loss.
Another optimizer minimizes the generator’s loss.

A widely used optimizer is Adam, a design choice popularized in the DCGAN architecture.

Designing the Discriminator Loss

The discriminator’s goal is straightforward:

Output values close to 1 for real data.
Output values close to 0 for fake (generated) data.

This is essentially a binary classification problem. Therefore, the correct loss function is sigmoid cross-entropy, just like in standard classifiers.

One common mistake is implementing cross-entropy incorrectly. You should always use the numerically stable version computed directly from the logits (the values before the sigmoid). Using probabilities after the sigmoid can cause numerical instability, especially when outputs are very close to 0 or 1.

Label Smoothing Trick

A GAN-specific trick is to slightly soften the real labels. Instead of labeling real examples as 1, use something like 0.9. Keep fake labels at 0.

This technique—called label smoothing—prevents the discriminator from becoming overly confident and improves generalization.

Designing the Generator Loss

For the generator, we also use cross-entropy—but with flipped labels. In other words, the generator tries to make the discriminator classify fake data as real.

Some implementations use the negative of the discriminator’s loss as the generator’s loss. While intuitive, this approach doesn’t work well in practice. When the discriminator is performing well, its gradients can vanish—leaving the generator with no meaningful signal to learn from.

Instead, the better approach is to let the generator minimize cross-entropy with flipped labels. This works because:

The derivative of cross-entropy remains non-zero unless the loss is fully minimized.
The losing player always receives gradient feedback.
Training remains stable and effective.

Final Thoughts

To summarize best practices for simple GAN architectures:

Use fully connected layers for simple datasets like MNIST.
Ensure both networks have at least one hidden layer.
Prefer Leaky ReLU for hidden activations.
Use tanh for the generator output (and scale data to [-1, 1]).
Use sigmoid for the discriminator output.
Apply numerically stable cross-entropy computed from logits.
Consider label smoothing for real samples.
Train generator and discriminator with separate optimizers (Adam is a strong choice).
Let both networks minimize cross-entropy—don’t rely on maximizing the discriminator’s loss.

Getting these architectural and optimization details right can make the difference between a GAN that fails to train and one that produces convincing results.

GANs Equilibriaerative

To understand GANs, we need to think about payoffs and equilibria in the context of machine learning. If we can identify an equilibrium in a GAN game, we can use that equilibrium as a defining characteristic to better understand how the game works.

Most machine learning models we have seen so far are based on optimization. We define some model parameters, write down a cost function of those parameters, and then minimize that cost. If we imagine plotting this, the parameters lie along the horizontal axes and the cost function is shown on the vertical axis. The goal of the learning algorithm is to find a very low value of the cost function. In practice, we usually do not reach a global or even a local minimum. Instead, we often get stuck in regions where numerical issues make further progress difficult, but the optimization algorithm still manages to find a point with reasonably low cost.

GANs are different because there are two players: the generator and the discriminator. Each has its own cost function. In a simple and easy-to-visualize version of GANs, the discriminator’s cost is just the negative of the generator’s cost. In this case, we can describe the entire interaction using a single value function. The generator tries to minimize this value function, while the discriminator tries to maximize it.

Recall that an equilibrium occurs when neither player can improve their outcome without the other changing their strategy. In this setting, that happens at a point that is a maximum for the discriminator and a minimum for the generator. Such a point is called a saddle point. If we imagine a graph of the game, the equilibrium lies in the middle.

One horizontal axis represents the generator’s parameters. If we take a cross-section along this axis, the equilibrium appears as a minimum of the value function. The other horizontal axis represents the discriminator’s parameters. Along that axis, the equilibrium appears as a maximum of the value function.

This perspective helps us understand games played by deep networks. We aim to find a point in the joint parameter space that is simultaneously a local minimum for each player’s cost with respect to that player’s own parameters.

When we analyze the GAN game, the discriminator reaches a local maximum when it correctly estimates the probability that an input is real rather than fake. This probability is given by the ratio between the data density at that input and the sum of the data density and the generator’s model density at that input. Intuitively, this ratio measures how much of the probability mass in a region comes from real data rather than from the generator.

Using tools from game theory, we can show that if both networks are sufficiently expressive, there exists an equilibrium where the generator’s density matches the true data density. At that point, the discriminator outputs a constant value everywhere (for example, 1/2 in the standard formulation).

However, even though such an equilibrium exists, finding it in practice is difficult. We usually train GANs by running two optimization algorithms simultaneously, each minimizing its own player’s cost with respect to that player’s parameters. Unfortunately, these optimization procedures do not necessarily converge to the game’s equilibrium.

A common failure case occurs when the data distribution has multiple clusters. The generator may learn to produce samples from only one cluster. The discriminator then learns to reject that cluster as fake, prompting the generator to switch to a different cluster, and so on. Instead of covering all clusters at once, the generator keeps cycling between them.

Ideally, we would like a training algorithm that reliably finds the equilibrium where the generator captures all clusters simultaneously. Designing such an algorithm for high-dimensional, continuous, non-convex games remains one of the key research challenges in GANs. Solving this problem could unlock new applications that depend on highly reliable learning algorithms.

Building a Cognitive Flow Engine: Moving Beyond Simple Prompting

The Limitation of Just Prompting

With a standard prompt, the interaction looks like this:

You ask a question.
The model responds.
You refine.
It answers again.

Each response is shaped by the latest message plus conversation memory, but there is no structured evolution of thinking. No designed arc. No cognitive progression. It works, but it feels flat. There’s no deliberate shift between exploration, connection, synthesis, and insight. That’s where a state-based engine makes the difference.

From Prompting to Cognitive States

Instead of treating every message equally, I introduced cognitive states:

Exploration → Generate perspectives and possibilities
Connection → Link ideas in unexpected but coherent ways
Synthesis → Distill and organize
Insight → Translate into action and reflection

Each state subtly changes the tone, structure, depth, and level of abstraction. The model doesn’t change its intelligence — but it changes its mode of reasoning. This small shift produces a surprisingly big experiential difference.

Why Not Just Write It in the Prompt?

Technically, you could write:

"Respond in four phases: explore, connect, synthesize, and provide insights."

And it would work. But a single prompt gives you a formatted output. A state engine gives you a designed thinking journey. With an engine:

The state persists.
The tone adapts automatically.
The interaction becomes modular.
You can switch modes intentionally.

It becomes programmable cognition instead of formatted text generation. The engine acts as a cognitive conductor guiding the rhythm of reasoning.

How the Cognitive Flow Engine Works

The structure is simple:

Define cognitive states.
Assign each state a system-level instruction.
Inject the appropriate instruction before each response.
Optionally adjust model parameters (temperature, top_p).
Allow manual or automatic state transitions.

No gimmicks. No artificial limits. Just controlled modulation of reasoning style.

Practical Applications

This approach is powerful for:

Creative Work: Brainstorming ideas, naming products, developing campaigns.
Strategic Thinking: Exploring options first, then synthesizing into a roadmap.
Writing: Generating raw material in exploration mode, then polishing in synthesis mode.
Decision-Making: Mapping perspectives, finding connections, extracting actionable next steps.

The Real Insight

The real breakthrough isn’t about creativity. It’s about structured cognitive progression. Most AI interactions are reactive. Human thinking is phased: we explore, connect, refine, and act. When your AI system mirrors that structure, it feels less like a tool and more like a thinking partner.

Where This Is Going

State-based engines can evolve into focus modes, contrarian modes, strategic compression modes, deep research modes, or reflective coaching modes. The future of AI interaction isn’t just better prompts. It’s programmable cognitive architecture. Once you start designing thinking states instead of writing single prompts, you won’t want to go back.

Example Code

Here’s a minimal implementation in JavaScript:

export const COGNITIVE_STATES = {
  EXPLORATION: "exploration",
  CONNECTION: "connection",
  SYNTHESIS: "synthesis",
  INSIGHT: "insight",
};

export default class CognitiveFlowEngine {
  constructor(agent) {
    this.agent = agent;
    this.currentState = COGNITIVE_STATES.EXPLORATION;
  }

  setState(state) {
    if (COGNITIVE_STATES[state.toUpperCase()]) {
      this.currentState = state;
    }
  }

  wrapUserText(userText) {
    return `
User request:
${userText}

Instructions:
- Respond according to the current cognitive state.
- Keep responses clear and practical.
- Add metaphor only if helpful.
`;
  }

  async respond(userText) {
    const prompt = this.wrapUserText(userText);
    return await this.agent.generate({ prompt });
  }
}

This code gives you a stateful wrapper around any LLM agent, letting you dynamically guide reasoning style without imposing limits or a strict turn system.

Wikipedia Code (Wikitext)

1. Basic Text Formatting

Bold Text

'''Bold text'''

Result: Bold text

Italic Text

''Italic text''

Result: Italic text

Bold and Italic

'''''Bold and Italic'''''

2. Headings

Headings structure your article.

= Main Title =
== Section ==
=== Subsection ===
==== Smaller Subsection ====

Wikipedia automatically generates a table of contents if there are multiple sections.

3. Internal Links

Internal links connect to other Wikipedia pages.

[[Italy]]

Custom text:

[[Italy|beautiful country]]

4. External Links

Link to external websites:

[https://www.example.com]

With text:

[https://www.example.com Official Website]

5. Lists

Bullet List

* First item
* Second item
* Third item

Numbered List

# First
# Second
# Third

Sub-lists

* Item
** Sub-item

6. References (Citations)

Citations are mandatory for reliability.

<ref>Author, Title, Website, Date</ref>

At the bottom of the page:

== References ==
{{Reflist}}

7. Citation Templates

Professional citation format:

<ref>{{Cite web
 |title=Article Title
 |url=https://example.com
 |website=Website Name
 |date=2024-01-01
 |access-date=2024-02-01
}}</ref>

Other citation templates include:

{{Cite book}}
{{Cite news}}
{{Cite journal}}

8. Infobox

Infoboxes appear on the right side of an article.

Example for a person:

{{Infobox person
 | name = John Doe
 | birth_date = 1990
 | occupation = Writer
}}

Each topic has its own infobox template.

9. Images

[[File:Example.jpg|thumb|Caption text]]

Options:

thumb → thumbnail
left / right → alignment
300px → size

Example:

[[File:Example.jpg|thumb|300px|Description]]

10. Categories

Categories go at the bottom of the page.

[[Category:Writers]]
[[Category:Living people]]

11. Templates

Templates reuse structured content.

Basic usage:

{{TemplateName}}

With parameters:

{{TemplateName|parameter=value}}

12. Tables

Basic table example:

{| class="wikitable"
! Name !! Age
|-
| John || 30
|-
| Anna || 25
|}

13. Comments (Hidden Text)

<!-- This is a comment -->

Visible only in edit mode.

14. Draft Submission (English Wikipedia)

To submit a draft for review:

{{subst:AFC submission/submit}}

Used only in the Draft namespace.

How CNNs Learn to “See”: Filters, Feature Maps, and Color Images

When we look at an image, even a simple photo of a dog, our brain can recognize many details at the same time: teeth, whiskers, tongue, eyes. A Convolutional Neural Network (CNN) works in a similar way, but it does so by using filters.

One Region, Many Patterns

A single region of an image can contain multiple visual features. To properly understand it, a CNN does not rely on just one filter, but on multiple filters operating in parallel, each designed to detect a different pattern such as edges, curves, or textures.

Each filter creates its own collection of nodes in the convolutional layer. These nodes share the same weights with each other, but their weights are different from those of other filters. In practice, modern CNNs often use dozens or even hundreds of filters in a single convolutional layer.

Feature Maps and Activation Maps

As each filter slides across the image, it produces a matrix of values. These matrices are known as:

Feature maps
Activation maps

Visually, feature maps look like filtered versions of the original image. They contain less information, but only what is relevant to the specific pattern the filter is designed to detect.

For example, if we apply four 4×4 filters to an image (such as one from a self-driving car dataset), we obtain four feature maps. Some may highlight vertical edges, while others emphasize horizontal edges.

Brighter values in a feature map indicate that the pattern defined by the filter was strongly detected in that region of the image.

Why Edge Detection Matters

Edges are one of the most important visual features in images. They often appear as a line of light pixels next to a line of darker pixels. Because of this, edge-detecting filters play a crucial role in CNNs, especially in the early layers.

By matching the bright regions in a feature map with the original image, we can see which parts of the image activated the filter.

What About Color Images?

So far, we have considered grayscale images, which computers interpret as 2D arrays (height × width).
Color images, however, are represented as 3D arrays:

height
width
depth (three channels: red, green, and blue)

An RGB image can be thought of as a stack of three 2D matrices, one for each color channel.

To convolve a color image, the filter must also be three-dimensional. Each filter contains weights for the red, green, and blue channels at every spatial location. The convolution process is the same as before, but the sum now includes values from all three channels.

Multiple Filters and Increasing Depth

When we apply multiple filters to a color image, we obtain multiple feature maps. These maps can be stacked together to form a new 3D array, which then becomes the input to the next convolutional layer.

This is how CNNs gradually learn:

patterns
patterns within patterns
and increasingly complex visual structures

CNNs vs. Dense Layers

In many ways, convolutional layers are similar to fully connected (dense) layers:

both use weights and biases
both rely on backpropagation
both minimize a loss function

The key difference is connectivity:

dense layers are fully connected
convolutional layers are locally connected and use shared weights

Learning Happens Automatically

When building a CNN:

filter weights are initialized randomly
we do not specify which patterns to detect
we only define a loss function (for example, categorical cross-entropy)

During training, backpropagation updates the filters to minimize the loss. As a result, the network learns on its own which patterns are useful for the task.

Are CNNs, MLPs, and Classical Neural Networks Obsolete?

A Scientific and Systems-Level Perspective on Modern Artificial Intelligence

Artificial intelligence evolves at a remarkable pace, and each new breakthrough tends to generate the same recurring narrative: that whatever came before is now obsolete. In recent years, the rise of Transformer-based architectures, Large Language Models, and multimodal foundation models has led many to question the relevance of classical neural networks such as Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). This perception, however, overlooks the nuanced way in which progress in machine learning actually occurs.

Classical architectures have not been replaced; rather, they have become specialized, embedded, and indispensable components of modern AI systems. Understanding their continuing relevance requires moving beyond benchmark hype and examining neural networks from scientific, architectural, and systems-engineering perspectives. The question is therefore not whether CNNs or MLPs are obsolete, but whether they are being evaluated in the correct context.

Multi-Layer Perceptrons (MLPs) represent the most basic form of neural computation. Mathematically, an MLP is a sequence of affine transformations followed by nonlinear activation functions, capable of approximating arbitrary continuous functions under mild conditions. This theoretical property, formalized by the Universal Approximation Theorem, establishes MLPs as powerful function approximators. However, expressive power alone does not guarantee efficiency or generalization. MLPs make almost no assumptions about the structure of the data, which is both their greatest strength and their primary limitation.

In domains where data is structured, low-dimensional, and already encoded in meaningful features, such as finance, healthcare records, industrial telemetry, or business analytics, this lack of inductive bias becomes an advantage. MLPs often outperform more complex architectures in these settings because they introduce less variance, require fewer samples to generalize, and are computationally efficient. For this reason, MLPs remain widely used in production systems, particularly for tabular data. Far from disappearing, they are also deeply embedded within modern architectures. Every Transformer block contains large feed-forward networks that are, in essence, MLPs. Classification heads, regression layers, reinforcement learning policies, and Mixture-of-Experts (MoE) components all rely on MLPs as fundamental building blocks. Declaring MLPs obsolete would therefore imply declaring most modern AI architectures obsolete as well.

Convolutional Neural Networks (CNNs) were introduced to address a different limitation of early neural models: the inability to exploit spatial structure. Images, videos, and other grid-like data exhibit strong local correlations and translational regularities. CNNs encode these assumptions directly through local receptive fields, weight sharing, and hierarchical feature extraction. This architectural bias allows CNNs to learn visual representations efficiently, using far fewer parameters than fully connected networks. From a biological and computational perspective, CNNs are remarkably well aligned with human visual processing. Early layers detect edges and simple patterns, intermediate layers capture textures and shapes, and deeper layers represent object-level semantics. This hierarchical organization enables CNNs to generalize well even when training data is limited. As a result, CNNs remain the backbone of countless real-world systems, including autonomous driving perception stacks, medical imaging diagnostics, satellite image analysis, industrial inspection pipelines, and mobile vision applications.

The emergence of Vision Transformers has led to renewed debate about the future of CNNs. Vision Transformers replace convolutional inductive bias with global self-attention, allowing every image patch to interact with every other patch. While this approach achieves impressive results on massive datasets, it comes at the cost of increased computational complexity and reduced data efficiency. In practice, CNNs often outperform Vision Transformers in small- and medium-scale regimes, especially when latency, energy consumption, and robustness matter. Many state-of-the-art vision systems now adopt hybrid designs, using CNNs for efficient feature extraction and Transformers for high-level reasoning. CNNs are not obsolete; they are efficient specialists integrated into larger systems.

Sequential data introduces yet another structural challenge. Language, speech, sensor streams, and time series depend on temporal order. Recurrent Neural Networks (RNNs) were designed to model this dependency by maintaining a hidden state that evolves over time. Vanilla RNNs, while conceptually elegant, suffer from optimization difficulties that limit their ability to capture long-range dependencies. Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) addressed these issues by introducing gating mechanisms that regulate information flow, enabling stable learning over long sequences. Although Transformers have largely replaced recurrent models in large-scale natural language processing, LSTMs and GRUs remain highly relevant in domains where streaming inference, low latency, or limited data are critical. In many real-time systems, such as speech recognition on edge devices or industrial time-series monitoring, recurrent models offer a better trade-off between performance and efficiency than attention-based architectures.

Transformers themselves represent a powerful but often misunderstood paradigm. By relying on self-attention, Transformers can model global dependencies without recurrence, enabling massive parallelization and scalability. This makes them ideal for large datasets and general-purpose modeling, which is why they form the foundation of modern language and multimodal models. Beyond Transformers, the modern AI landscape has grown to include several specialized architectures and training paradigms. Large Language Models (LLMs) are Transformer-based architectures trained on massive text corpora, often extended to multimodal capabilities. Mixture-of-Experts (MoE) architectures distribute computation across specialized subnetworks, improving scalability while retaining high performance. Vision-Language Models (VLMs) integrate visual and textual inputs, combining CNNs or Vision Transformers with MLPs and attention layers to achieve multimodal reasoning. Sequence Learning Models (SLMs) continue to process temporal data efficiently, building on recurrent or Transformer-based foundations, while Masked Language Models (MLMs) employ objectives in which certain tokens are hidden and predicted, as exemplified by BERT. Segment Anything Models (SAMs) demonstrate universal image segmentation capabilities using hybrid CNN and Transformer backbones. Even less standardized terms, such as Latent Concept Models (LCMs) or Language-Aware Models (LAMs), highlight ongoing efforts to adapt classical components to novel data representations and task-specific biases.

Beyond grids and sequences lies relational data, which is best modeled as graphs. Graph Neural Networks (GNNs) introduce a message-passing framework that allows node representations to evolve based on local neighborhood structure. This inductive bias is irreplaceable for tasks involving social networks, molecular structures, recommendation systems, and knowledge graphs. No amount of attention or convolution can fully substitute for architectures explicitly designed to operate on graphs. GNNs are therefore not alternatives to CNNs or Transformers, but complementary tools tailored to relational reasoning.

Generative modeling further illustrates the cumulative nature of AI progress. Autoencoders (AEs) and Variational Autoencoders (VAEs) introduced principled approaches to representation learning and probabilistic modeling. While newer techniques such as diffusion models have achieved superior generative quality, they rely heavily on classical components. Most diffusion models use convolutional U-Net architectures augmented with attention layers, reinforcing the idea that innovation builds on existing foundations rather than discarding them.

At the systems level, modern AI increasingly relies on modularity. Mixture-of-Experts architectures distribute computation across specialized subnetworks, improving scalability and efficiency. These experts are not novel primitives but combinations of Transformers, CNNs, and MLPs. Even cutting-edge Large Language Models depend on classical neural components at their core.

From a scientific perspective, the concept of architectural obsolescence is misguided. Neural networks encode assumptions about the structure of data. When those assumptions align with the problem domain, performance and efficiency follow. When they do not, even the most advanced model will struggle. Progress in artificial intelligence is therefore best understood as specialization and integration, not replacement. Classical networks, modern architectures, and contemporary paradigms such as LLMs, MoEs, VLMs, SLMs, MLMs, and SAMs each occupy a distinct and essential niche. Modern AI systems succeed not because one architecture dominates, but because multiple architectures are combined intelligently.

LLM – Large Language Model

What it is:

A neural network, usually Transformer-based, trained on massive text corpora.
What it does:
Generates and understands text, answers questions, translates languages, summarizes, and can even reason over long documents.
Relation to classical models:
Uses MLPs in feed-forward layers and attention mechanisms. Transformers are built on classical building blocks.

LCM – Latent Concept Model

What it is:

A model designed to extract and represent hidden (latent) concepts from data, often in an interpretable or structured latent space.
What it does:
Learns abstract features or concepts that may not be directly observable, useful in tasks like recommendation systems or multimodal reasoning.
Relation to classical models:
Often uses MLPs, autoencoders, or CNNs to extract latent features.

LAM – Language-Aware Model

What it is:

A model that incorporates language understanding into other tasks, e.g., reasoning over text plus another modality.
What it does:
Enhances models in vision, graphs, or multimodal tasks by integrating textual context.
Relation to classical models:
Combines MLPs, CNNs, Transformers, depending on the primary input type.

MoE – Mixture-of-Experts

What it is:

A modular architecture with multiple “expert” subnetworks, where only a subset is active for each input.
What it does:
Improves scalability and efficiency, allowing very large models without fully computing every parameter for each input.
Relation to classical models:
Each “expert” can be an MLP, CNN, or Transformer block—MoE is a system-level design, not a new primitive.

VLM – Vision-Language Model

What it is:

A model that combines visual input (images/video) with textual input.
What it does:
Can answer questions about images, generate captions, or reason across modalities.
Relation to classical models:
Uses CNNs or Vision Transformers for images, MLPs in classification heads, and attention layers for multimodal reasoning.

SLM – Sequence Learning Model

What it is:

A model designed specifically for temporal or sequential data (time series, language, sensor streams).
What it does:
Learns patterns over sequences for prediction, anomaly detection, or control tasks.
Relation to classical models:
May use RNNs, LSTMs, GRUs, or Transformers depending on scale and latency requirements.

MLM – Masked Language Model

What it is:

A model trained to predict missing tokens in text, e.g., BERT.
What it does:
Learns contextual embeddings and deep language understanding.
Relation to classical models:
Transformer-based, with MLPs in feed-forward layers for token prediction.

SAM – Segment Anything Model

What it is:

A universal image segmentation model that can identify objects in images with minimal guidance.
What it does:
Can segment any object in an image, supports zero-shot segmentation for unseen classes.
Relation to classical models:
Typically uses CNN backbones for feature extraction combined with Transformers for reasoning and mask prediction.

Recurrent Neural Network (RNN) is a type of deep learning model designed to process sequential data, such as text, speech, or time-series, by utilizing internal feedback loops as memory to remember previous inputs. Unlike feedforward networks, RNNs update their hidden state based on current input and past states, enabling them to make predictions based on data order.

Training Techniques in PyTorch: Early Stopping, Dropout, and Regularization

For our example, we use the Fashion-MNIST dataset, although the specific dataset is not particularly important for demonstrating these concepts.

Initial Setup

We begin by importing the required libraries and loading our dataset. The model is configured to train for eight epochs, which represents the maximum number of epochs the training process is allowed to run.

However, thanks to early stopping, the training may end before reaching this limit.

Implementing Early Stopping in PyTorch

Since PyTorch does not provide built-in early stopping functionality, we must implement it manually within the training loop.

Tracking the Best Validation Loss

Early stopping is based on monitoring the validation loss. We start by initializing the best validation loss to infinity. This guarantees that the first computed validation loss will always be an improvement.

Defining an Improvement Threshold

Not every improvement is meaningful. To avoid reacting to very small fluctuations, we define a minimum improvement threshold equal to 0.001. If the validation loss fails to improve by at least this value, the epoch is considered below the performance threshold.

Setting Patience

The patience parameter determines how many times the validation loss is allowed to fall below the threshold before training is stopped. In this example, patience is set to two, meaning that training will stop early if the validation loss does not sufficiently improve two times.

Early Stopping Logic in the Training Loop

At the end of each epoch, we compute the validation loss and calculate the difference between the best validation loss and the current validation loss. If the current validation loss is lower, we update the best value. If the improvement does not meet the threshold, a counter is increased. When this counter reaches the patience value, the training loop is interrupted.

This logic allows the model to stop training once it stops learning meaningful patterns.

Dropout for Regularization

Another important technique covered in this demo is dropout, which helps prevent overfitting. In PyTorch, dropout is implemented as a layer, where the parameter p represents the probability that a neuron is zeroed out.

With p set to 0.5, half of the inputs to the dropout layer are randomly set to zero during training. One advantage of PyTorch’s implementation is that the same dropout layer can be reused throughout the model without worrying about input or output sizes.

Regularization and Momentum in the Optimizer

Both L2 regularization and momentum are implemented directly in the optimizer. Regularization is controlled using the weight decay parameter, while momentum is controlled using the momentum parameter.

When using stochastic gradient descent, these options help stabilize training and improve the model’s ability to generalize.

Putting Everything Together

All of these techniques—early stopping, dropout, regularization, and momentum—can be used simultaneously in a single training pipeline.

After training the model, we observe that all eight epochs are completed, even though early stopping is enabled. This happens because the validation loss only fails to improve beyond the threshold in Epoch 6 and Epoch 8. Since the patience value is two, training is allowed to continue.

If the model had been configured to train for ten epochs instead of eight, early stopping would have triggered at Epoch 8

How to Write a Training Loop in PyTorch: A Practical Guide

Getting Started: Model, Data, and Training Function

To keep our code clean and reusable, we wrap the training loop inside a function called train_model.
This function takes the following parameters:

the model
the number of epochs (with a default value of 8)
the learning rate (default set to 0.05)

Inside this function, we can either define the optimizer and loss function directly or pass them in as parameters. In this example, we use:

Stochastic Gradient Descent (SGD) with momentum = 0.9
Cross Entropy Loss, implemented with nn.CrossEntropyLoss

This setup is commonly used for classification tasks.

The Core of the Training Loop

Now let’s dive into the most important part: the training loop.

For each epoch:

Set the model to training mode using model.train()
This ensures that gradients are computed correctly and that layers such as Dropout and Batch Normalization behave as expected during training.
Initialize tracking variables
- epoch loss
- number of correct predictions
Iterate over the training data
We use enumerate(train_loader) to keep track of the number of mini-batches processed. This is especially useful when working with large datasets or long training times, as it allows us to monitor running loss during each epoch.
GPU handling
We check whether a GPU is available with torch.cuda.is_available().
If so, we move both inputs and labels to the GPU using the .cuda() method.
Forward pass
- reset gradients with optimizer.zero_grad()
- pass the inputs through the model to obtain outputs
Loss computation
We compute the loss by passing the model outputs and the labels to the loss function (nn.CrossEntropyLoss).
Backward pass and parameter update
- compute gradients using loss.backward()
- update model parameters with optimizer.step()
Accuracy calculation
We obtain predictions using torch.max on the output tensor and compare them with the ground-truth labels to count the number of correct predictions.

Model Validation

After completing the training phase for an epoch, we move on to validation:

initialize validation loss and correct prediction counters
set the model to evaluation mode with model.eval()
This disables gradient computation and ensures consistent behavior during inference
perform a forward pass on the validation dataset
compute validation loss and accuracy in the same way as during training

Finally, we print the training and validation metrics to the notebook.

Running the Training Process

At this point, all that’s left to do is call the train_model function and pass in the instantiated model (for example, net).
Depending on the dataset size and model complexity, training may take some time.

Machine Learning Training Concepts Help

Training data is the dataset actually used to fit the model.

The model learns patterns, relationships, and parameters directly from this data.

Example

from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Validation Data

Validation data is used during training to evaluate how well the model generalizes to unseen data.
It helps tune hyperparameters and detect overfitting.

Example

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

Test Data

Test data is used only once, after training is complete, to evaluate final model performance.

Example

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy:", accuracy)

Underfitting (Bias Error)

Underfitting happens when the model is too simple to capture the underlying structure of the data.
The model performs poorly on both training and validation sets.

Example

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

print("Training score:", model.score(X_train, y_train))
print("Validation score:", model.score(X_val, y_val))

Overfitting (Variance Error)

Overfitting occurs when the model is too complex and fits the training data too closely.
It performs well on training data but poorly on new data.

Example

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    PolynomialFeatures(degree=10),
    LinearRegression()
)

model.fit(X_train, y_train)
print("Training score:", model.score(X_train, y_train))
print("Validation score:", model.score(X_val, y_val))

Early Stopping

Early stopping halts training when validation error stops improving, preventing overfitting.

Example

from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor="val_loss",
    patience=5,
    restore_best_weights=True
)

model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=[early_stop]
)

Dropout

Dropout randomly disables neurons during training, forcing the network to learn more robust features.

Example

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

model = Sequential([
    Dense(128, activation="relu"),
    Dropout(0.5),
    Dense(64, activation="relu"),
    Dropout(0.5),
    Dense(1)
])

Local Minima

A local minimum is a point where the loss is minimal in a small region, but not globally optimal.
Gradient-based optimizers can get stuck in these points.

Example

import numpy as np

def loss(x):
    return x**4 - 3*x**2 + 2

x = np.linspace(-3, 3, 100)
y = loss(x)

Momentum

Momentum improves gradient descent by adding a fraction (β between 0 and 1) of the previous update to the current one.

Example

from tensorflow.keras.optimizers import SGD

optimizer = SGD(
    learning_rate=0.01,
    momentum=0.9
)

model.compile(
    optimizer=optimizer,
    loss="mse"
)

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent updates model parameters using small subsets of the data (mini-batches), making training faster and noisier—but often more effective.

Example

model.compile(
    optimizer="sgd",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

model.fit(
    X_train, y_train,
    batch_size=32,
    epochs=50
)

How to Use TensorBoard with PyTorch

TensorBoard allows you to visualize:

Training and validation loss
Accuracy and other custom metrics
Images, model graphs, and embeddings
Comparisons between multiple training runs

Instead of relying only on printed logs, TensorBoard provides an interactive dashboard that makes it much easier to track what is happening during training.

Installing TensorBoard

If you are using PyTorch, TensorBoard is usually installed automatically. If not, you can install it manually:

pip install tensorboard

Make sure you also have PyTorch installed:

pip install torch torchvision

Launching TensorBoard

TensorBoard is started from the command line. First, choose a directory where your training logs will be stored (for example, runs/). Then run:

tensorboard --logdir=runs

After launching, TensorBoard will print a local URL, typically:

http://localhost:6006

Open this URL in your browser to access the TensorBoard dashboard.

Using SummaryWriter in PyTorch

PyTorch provides TensorBoard integration through the SummaryWriter class, which is located in:

from torch.utils.tensorboard import SummaryWriter

The SummaryWriter is responsible for writing event files that TensorBoard reads and visualizes.

Creating a SummaryWriter

writer = SummaryWriter(log_dir="runs/experiment_1")

Each experiment should ideally have its own log directory. This makes it easy to compare multiple runs in TensorBoard.

Logging Scalars (Loss, Accuracy, Metrics)

The most common use of TensorBoard is logging scalar values such as loss and accuracy during training.

Example: Logging Training Loss

for epoch in range(num_epochs):
    train_loss = 0.0
    
    # training loop
    for inputs, labels in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    avg_loss = train_loss / len(dataloader)
    writer.add_scalar("Loss/train", avg_loss, epoch)

Logging Accuracy

writer.add_scalar("Accuracy/train", accuracy, epoch)

In TensorBoard, these values will appear as smooth, interactive plots that update as training progresses.

Logging Images

TensorBoard can also display images, which is especially useful for computer vision tasks.

Example: Logging a Batch of Images

from torchvision.utils import make_grid

images, labels = next(iter(dataloader))
img_grid = make_grid(images)

writer.add_image("Training Images", img_grid)

You can view these images in the Images tab of TensorBoard.

Logging Figures

Sometimes you may want to log custom plots, such as confusion matrices or matplotlib charts. TensorBoard supports this using the add_figure method.

Example: Logging a Matplotlib Figure

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot([1, 2, 3], [4, 5, 6])
ax.set_title("Sample Plot")

writer.add_figure("My Figure", fig, global_step=epoch)

This is useful for visualizing advanced statistics or custom evaluation results.

Closing the Writer

After training is complete, always close the SummaryWriter to ensure all data is written to disk:

writer.close()

Best Practices

Use clear tag names (e.g., Loss/train, Loss/val)
Create separate log directories for different experiments
Log metrics at consistent intervals
Avoid logging too frequently to reduce disk usage

Understanding Neural Network Decision Boundaries

When it comes to data classification, one of the key concepts is the decision boundary: how an algorithm separates different classes in a dataset. In this article, we’ll explore what these boundaries look like, from simple lines to the complex hyperplanes of neural networks.

Linear Decision Boundaries

The simplest type of decision boundary is linear. Imagine a 2D dataset with two input variables. In this case, we can separate two classes using a straight line that divides the points.

This approach works well for binary classes in two dimensions. However, in practice, we rarely use a neural network for such simple problems: linear boundaries are easily handled with simpler methods like logistic regression or decision trees.

From Lines to Hyperplanes

As we increase the number of dimensions (the number of input variables), the concept of a linear decision boundary extends:

In 3D, the boundary becomes a plane.
In higher dimensions, it becomes a hyperplane, a concept that is hard to visualize but mathematically straightforward.

For a dataset with n variables, a linear decision boundary will be an (n-1)-dimensional hyperplane. Even though we can’t easily visualize beyond 3D, we can think of it as an extension of a plane that divides space.

The limitation of linear boundaries is that they cannot handle complex or non-linear data. When classes are not separable by a line or plane, more advanced methods are required.

Neural Network Decision Boundaries

Neural networks shine in cases where the data is non-linear and high-dimensional. Each neuron introduces a small hyperplane, and by combining these hyperplanes across the network, we get complex decision boundaries, often in hundreds of dimensions.

These boundaries are no longer linear, allowing the network to fit very intricate datasets. However, their high dimensionality makes them nearly impossible to visualize directly.

Visualizing Decision Boundaries

To better understand how neural networks separate classes, we can use dimensionality reduction techniques like t-SNE (t-Distributed Stochastic Neighbor Embedding).
t-SNE allows us to project high-dimensional data into 2D or 3D while preserving the class structure, giving us an approximate idea of the network’s decision boundaries.

When Large Neural Networks Are Needed

If the dataset is particularly complex, the required decision boundary will also be complex. In such cases, we need a larger neural network, with more neurons and layers, capable of modeling the intricate structure of the data.

In summary:

Linear boundaries are simple and interpretable but limited.
Neural networks create non-linear, high-dimensional boundaries, ideal for complex data.
Visualization is challenging but can be approximated using techniques like t-SNE.

Understanding decision boundaries is crucial for designing effective neural networks and choosing the right architecture for the problem you want to solve.

How to Choose Output Activation Functions in Neural Networks

Neural networks are powerful tools for tackling prediction and classification problems. However, one of the most important decisions when building a network is choosing the output activation function. This choice directly affects the shape and range of your model’s predictions, impacting overall performance.

In this article, we’ll guide you on how to select the right activation based on your problem type: binary classification, multi-class classification, or regression.

What is an Output Activation Function?

The output activation function transforms the values produced by the last layer of your network into a format that can be interpreted as predictions.

Specifically, it determines:

The shape of the output – a single number, a vector, or a tensor.
The range of the output – a continuous number or a probability between 0 and 1.

It is essential that your network’s output matches the shape of your labels.

Problem Types and Recommended Activations

1. Binary Classification

Binary classification is when your task is to choose between two possibilities (e.g., “Is this a cat or not?”).

The network predicts the probability that an example belongs to the positive class.
Recommended activation: sigmoid, which maps any input to a value between 0 and 1.
Loss function to minimize: binary cross-entropy.

Example: If you want to predict whether an image contains a cat, the output might be 0.87, meaning an 87% chance that it’s a cat.

2. Multi-Class Classification

When there are more than two classes (e.g., cat, dog, fish, rabbit), it’s a multi-class classification problem.

The output is a vector of probabilities, one for each class.
Recommended activation: softmax, which converts scores into a probability distribution that sums to 1.
Loss function: multi-class cross-entropy.
The predicted class is the one with the highest probability, often determined with the argmax function.

Example: For an image of a cat, the output might be [0.9, 0.05, 0.03, 0.02], indicating a 90% probability for the “cat” class.

3. Regression

Regression problems involve predicting a continuous numerical value (e.g., the price of a house).

The output is the predicted value itself.
Recommended activation: identity (no activation) or ReLU if you want to ensure positive outputs.
Common loss functions: mean squared error (MSE) or mean absolute error (MAE).

Example: Predicting the price of a house might yield an output of 250000.

Key Takeaways

The choice of output activation function is never arbitrary; it depends on your problem type and the shape of your labels:

Sigmoid → binary classification
Softmax → multi-class classification
Identity/ReLU → regression Mean Squared Error

Focus on Activation functions

Activation functions are a fascinating area of research in machine learning. They are a special class of mathematical functions characterized by several important properties. First of all, activation functions are usually non-linear. If all activation functions were linear, the entire neural network would collapse into a simple linear model, losing its expressive power.

Another key requirement is differentiability. Ideally, an activation function should be continuously differentiable, meaning it has a derivative everywhere. There is, however, one very popular activation function that is differentiable almost everywhere rather than everywhere.

Activation functions should also be monotonic, meaning that as the input increases, the output never decreases. In other words, when moving from left to right along the x-axis, the value of the function does not go down. Many commonly used activation functions satisfy this property.

A further important characteristic is that the activation function should approximate the identity function near the origin. This ensures that, around zero, the input is passed through without significant distortion.

Let us now look at some of the most common activation functions.

The most widely used activation function is the ReLU (Rectified Linear Unit). In many neural networks, ReLU is the default choice for hidden layers because it performs very well in practice and makes gradient computation efficient. However, ReLU lacks one of the desirable properties mentioned earlier: it is not differentiable at zero.

To address this limitation, the Leaky ReLU was introduced. Instead of outputting zero for negative inputs, Leaky ReLU outputs a small fraction of the input, allowing gradients to flow even for negative values.

In older research papers, the sigmoid activation function is frequently encountered. Sigmoid has some appealing properties: it smoothly maps inputs to values between 0 and 1 and has an easily computable derivative. However, in practice, neural networks trained with sigmoid tend to perform worse than those using ReLU, which is why sigmoid is rarely used in hidden layers today.

Another commonly used activation function is the hyperbolic tangent (tanh). Like sigmoid, tanh is bounded, but its outputs range from −1 to 1. One of its advantages is that it is centered at zero and has larger derivatives than sigmoid, making it a better choice in many cases.

There is also the step function, which is encountered in the perceptron model. While simple, it is not suitable for gradient-based learning.

Finally, in more advanced research, more esoteric activation functions may be encountered, such as the Gaussian Error Linear Unit (GELU). GELU has been adopted in several transformer-based models, including OpenAI’s GPT-3.

How to Add a Table of Contents to Your Jupyter Notebook

Working inside Jupyter notebooks is one of the most popular ways to write and share code, especially in data science, machine learning, and research workflows. As notebooks grow in length and complexity, navigating them can become frustrating. This is where a Table of Contents becomes extremely useful.

In this article, we’ll look at why a Table of Contents matters, how to create one manually, how anchors work in Jupyter, and how modern tools can generate a TOC automatically.

Why Your Notebook Needs a Table of Contents

Long notebooks with multiple sections such as data loading, preprocessing, analysis, and visualization are hard to navigate by scrolling alone. A Table of Contents helps by:

Improving navigation
Making the structure of the notebook clear
Enhancing readability for collaborators
Providing a quick overview of the workflow

This is particularly helpful when sharing notebooks or exporting them to HTML or PDF.

Creating a Manual Table of Contents with Markdown

One simple way to create a Table of Contents is by using Markdown links that point to section headers.

Example:

### Table of Contents

1. [Introduction](#introduction)
2. [Load Data](#load-data)
3. [Exploratory Analysis](#exploratory-data-analysis)
4. [Results](#results)
5. [Conclusion](#conclusion)

Each link refers to a specific section inside the notebook.

How Section Anchors Work in Jupyter Notebooks

In Jupyter notebooks, you usually do not need to manually define anchors using HTML. When you create a Markdown heading, Jupyter automatically generates an anchor for it.

Example:

## Introduction

Jupyter automatically creates the anchor:

#introduction

This means you can link to that section using:

[Introduction](#introduction)

Automatic Anchor Naming Rules

Jupyter follows these rules when creating anchors from headings:

All characters are converted to lowercase
Spaces are replaced with hyphens
Special characters are removed

Example:

## Exploratory Data Analysis

Generated anchor:

#exploratory-data-analysis

Link reference:

[Exploratory Analysis](#exploratory-data-analysis)

When to Use `<a id=""></a>` Manually

In most cases, Markdown headings are enough. However, manually defining an anchor using HTML can be useful in advanced scenarios, such as:

When you want a custom anchor name
When linking to a specific point that is not a heading
When you have duplicate section titles

Example:

<a id="custom-anchor"></a>
### My Section Title

You can then reference it like this:

[Go to section](#custom-anchor)

Automatic Table of Contents Options

If you prefer automation, there are powerful tools available.

Jupyter Notebook Extensions

The Table of Contents (2) extension from jupyter_contrib_nbextensions automatically generates a navigation panel based on notebook headings. It supports automatic updates, collapsible sections, and sidebar display.

JupyterLab Built-in Table of Contents

JupyterLab includes a built-in Table of Contents panel in the sidebar. It reads Markdown headers and allows quick navigation without installing additional extensions.

Real Video Streaming: Why Base64 Fails and How to Do It Properly

If you’ve ever been told, “You can just send video in Base64 inside JSON,” stop right there. Base64 works for small images, debugging, or prototypes, but for serious video—especially live or large files—it’s a disaster: +33% bandwidth, extra CPU, higher latency, wasted memory. No exceptions.

Here’s how real streaming works, with raw bytes, chunking, and codecs, plus a practical WebSocket example.

Why Base64 Doesn’t Work

Base64 introduces serious overhead:

Bandwidth: +33% compared to raw binary
CPU: Encoding/decoding is expensive
Latency: Every chunk must be converted
Memory: Extra buffers for the string
Scalability: Impossible for live or large files

Base64 is only useful for tiny assets or debugging, never production.

How Real Streaming Works

1. Raw Binary Stream

Send the video as pure bytes, never text. Options include:

TCP / UDP
WebSocket (binary)
HTTP chunked transfer
QUIC

2. Chunking

Break the video into blocks:

[chunk][chunk][chunk]

Typical size: 1–64 KB
Sequential order
No text encoding

3. Codec

You don’t send raw frames; codecs compress video efficiently:

H.264 / AVC (standard)
H.265 / HEVC (high efficiency)
VP9 / AV1 (open source, high quality per bandwidth)

The server sends compressed bytes, the client decodes them in real time.

Conceptual Example

Server (encoder):

Camera → H.264 encoder → byte stream → socket

Client (decoder):

socket → byte stream → decoder → frame → display

Web Options


WebRTC	Live, low latency
HLS	On-demand streaming
DASH	Adaptive streaming
WebSocket (binary)	Custom real-time prototypes

Minimal WebSocket Example (JS)

Server Node.js:

videoChunks.forEach(chunk => {
  ws.send(chunk); // Pure binary, no strings
});

Client Browser:

const socket = new WebSocket("wss://example.com/video");
socket.binaryType = "arraybuffer";

socket.onmessage = (event) => {
  const chunk = new Uint8Array(event.data);
  // Decode using MediaSource or WebCodecs
};

Notice: no Base64, only raw bytes.

Real Pipeline with FFmpeg (Conceptual)

ffmpeg -i input.mp4 \
       -c:v libx264 -preset ultrafast -f mpegts udp://127.0.0.1:1234

libx264 → codec
-f mpegts → streaming-friendly container
udp:// → raw stream transmission

The client receives packets and feeds them directly into the decoder.

Base64 is fine for prototypes. For large files or live streaming, you need pure binary, chunking, codecs, and a proper protocol. With WebRTC, HLS, DASH, or binary WebSocket, you get low latency, high quality, and real scalability.

In a follow-up, I can show a full demo:

Convert a photo/video into a raw byte stream
Simulate frame → chunk → stream
Build a real WebRTC + FFmpeg production-ready pipeline

This is the serious world of video streaming.

Why Base64 Is Killing Your App’s Performance (And What to Use Instead)

The Hidden Cost of Base64

Base64 was never designed for data storage or high-volume file transfers.
Its original purpose was to move binary data through systems that only understood plain text.

When you use it for large payloads today, you end up paying three major taxes.

1. The 33% Size Tax

Base64 encodes every 3 bytes of binary data as 4 ASCII characters.
The result is a ~33% increase in size, every single time.

A 100 MB video suddenly becomes 133 MB of text:

more bandwidth consumed
more time spent uploading
more storage wasted

All for zero functional benefit.

2. The Memory Bottleneck

Most Base64 encoders and decoders require the entire payload to be loaded into memory.

That means a 500 MB upload can easily cause:

1 GB RAM spikes
garbage collection pressure
process crashes under load

One large request can bring an otherwise healthy server to its knees.

3. The CPU Overhead

Encoding and decoding Base64 is not free.

Your CPU must:

parse large strings
convert them back to binary
allocate new buffers

All of this adds latency, increases response times, and reduces overall throughput—especially under concurrent load.

3 Better Alternatives for Large Files

If you’re building a scalable, production-grade system, stop treating files as strings.
Modern architectures are binary-first.

Here are three proven approaches.

1. The Cloud-Direct Pattern (Presigned URLs)

Your API should not be a middleman for raw bytes.

Instead of:

Client → API → Object Storage

Use presigned URLs.

How it works
The client asks your API for permission.
Your API returns a short-lived, secure upload URL from providers like AWS S3 or Google Cloud Storage.
The client uploads the file directly to the cloud.

Why it wins

Zero file data touches your server
No memory spikes
No CPU overhead
Massive scalability for free

Your backend stays fast and boring. Exactly how it should be.

II. Chunked & Resumable Uploads (TUS Protocol)

Uploading a 2 GB file in a single request is a gamble.

If the connection drops at 99%, the user starts over—and hates you for it.

How it works
Split the file into small chunks (e.g. 5 MB).
Upload them sequentially using a resumable protocol like TUS.

Why it wins

Fault-tolerant by design
Uploads can resume after failures
Ideal for unstable networks and large files

This is the standard for serious upload workflows.3. Binary Streaming

Sometimes you do need the file on your server—virus scanning, media processing, transformations.

In that case, stream it.

How it works
Use multipart/form-data and process the incoming request as a stream.
Pipe the data chunk-by-chunk directly to disk, cloud storage, or a processing pipeline.

Why it wins

Constant, predictable memory usage
Works with arbitrarily large files
Plays nicely with backpressure

Streams are how servers are meant to handle data.

Base64 is fine for:

small icons
tiny blobs
email attachments

But for large files, it’s pure technical debt.

If you want faster uploads, lower costs, and servers that don’t fall over when someone uploads a 4K video, move to presigned URLs, chunked uploads, or streaming.

Extracting Tables from PDFs in Python: A Practical Comparison of Tools

Working with PDFs is one of the most common (and frustrating) data engineering tasks. Unlike CSV or Excel files, PDFs are designed for visual presentation, not structured data extraction. Choosing the right Python library can save you hours of cleanup or completely break your pipeline.

In this article, we compare the most popular Python tools for extracting tables and text from PDFs, focusing on accuracy, complexity, and real-world use cases.

Quick Comparison Overview

Different tools shine in different scenarios. Here is a high-level summary before diving deeper:

Tabula-py is best for clean, well-structured tables.
Camelot is excellent for wide and complex layouts.
pdfplumber is flexible and powerful for irregular tables.
PyMuPDF is fast for text extraction but needs extra parsing.
Tesseract OCR is the only option for scanned PDFs.
pdfquery is perfect when exact coordinates are required.

Tabula-py

Type: Text-based
Output: Pandas DataFrame
Complexity: Medium

Tabula-py is a Python wrapper for Tabula (Java-based) and is one of the most popular tools for table extraction.

Pros:

Very easy to use
Direct output as Pandas DataFrames
Great results on clean, grid-based tables

Cons:

Struggles with complex layouts
Requires Java

Best use case: clean, well-formatted tables with clear borders.

Camelot

Type: Text-based
Output: Pandas DataFrame
Complexity: Medium

Camelot is often considered more accurate than Tabula, especially for wide tables or complex page layouts.

Pros:

Excellent precision
Handles complex table structures better than Tabula
Supports both lattice and stream parsing modes

Cons:

Slightly steeper learning curve
Can fail on very irregular tables

Best use case: wide tables and complex layouts where precision matters.

pdfplumber

Type: Text-based with parsing
Output: Requires processing
Complexity: Medium–High

pdfplumber offers low-level access to PDF elements and is extremely flexible.

Pros:

Highly customizable
Excellent for irregular or borderless tables
Can extract text, lines, and coordinates

Cons:

Requires manual parsing logic
More coding effort compared to Tabula or Camelot

Best use case: irregular tables or PDFs where automated tools fail.

PyMuPDF (fitz)

Type: Text-based
Output: Text only
Complexity: Medium

PyMuPDF is fast and efficient but does not natively extract tables.

Pros:

Very fast
High-quality text extraction
Good for preprocessing PDFs

Cons:

No built-in table extraction
Requires custom parsing

Best use case: fast text extraction when you plan to build your own table parser.

Tesseract OCR

Type: Image-based
Output: Text only
Complexity: High

When PDFs are scanned images, OCR is the only viable solution.

Pros:

Works with scanned PDFs
Supports multiple languages

Cons:

Lower accuracy than text-based tools
No table awareness
Requires image preprocessing

Best use case: scanned documents with no embedded text.

pdfquery

Type: Text-based
Output: Text with coordinates
Complexity: High

pdfquery is ideal when you need pixel-level control.

Pros:

Precise coordinate-based extraction
Ideal for fixed-layout documents
Powerful for automation

Cons:

Complex setup
Not beginner-friendly

Best use case: PDFs with consistent layouts where exact positioning matters.

Detecting Dynamics 365 Web Resource Updates with JavaScript Hashing

When working with Dynamics 365 / Dataverse Web Resources, one common challenge is ensuring users are aware when a JavaScript or HTML Web Resource has been updated. Browser caching can cause users to unknowingly run outdated versions, leading to unexpected behavior and hard-to-diagnose bugs.

In this article, we’ll look at a simple and effective technique to detect Web Resource changes at runtime using JavaScript hashing and notify users when an update is detected.

The Idea

The core idea is straightforward:

Download the Web Resource content
Calculate a hash (SHA-1) of the content
Store the hash in localStorage
Compare the current hash with the previously stored one
Notify the user if the hash has changed

This approach works entirely on the client side and requires no server-side customization.

Triggering the Check on Form Load

The check is executed when a Dynamics 365 form loads:

formContext.data.addOnLoad(
  Opportunity.checkWebResourceHash.bind(this, "opportunity_webresource")
);

This ensures the verification runs automatically whenever the form is opened.

The Hash Comparison Function

Here’s the full implementation of the function responsible for detecting changes:

checkWebResourceHash: async function (webResourceName) {
  const STORAGE_KEY = `WR_HASH_${webResourceName}`;
  const STORAGE_DATE_KEY = `WR_HASH_DATE_${webResourceName}`;

  try {
    const clientUrl = Xrm.Utility.getGlobalContext().getClientUrl();
    const url = `${clientUrl}/WebResources/${webResourceName}`;

    // Fetch the Web Resource content without using cache
    const response = await fetch(url, { cache: "no-store" });
    const text = await response.text();

    // Generate SHA-1 hash
    const encoder = new TextEncoder();
    const data = encoder.encode(text);
    const hashBuffer = await crypto.subtle.digest("SHA-1", data);

    const newHash = Array.from(new Uint8Array(hashBuffer))
      .map(b => b.toString(16).padStart(2, "0"))
      .join("");

    const oldHash = localStorage.getItem(STORAGE_KEY);

    // Detect changes
    if (oldHash && oldHash !== newHash) {
      const now = new Date().toISOString();

      localStorage.setItem(STORAGE_KEY, newHash);
      localStorage.setItem(STORAGE_DATE_KEY, now);

      Xrm.Navigation.openAlertDialog({
        title: "Web Resource Update",
        text:
          `The Web Resource "${webResourceName}" has been updated.\n\n` +
          `Date: ${new Date(now).toLocaleString()}`
      });
    }

    // First-time initialization
    if (!oldHash) {
      localStorage.setItem(STORAGE_KEY, newHash);
      localStorage.setItem(STORAGE_DATE_KEY, new Date().toISOString());
    }

  } catch (e) {
    console.error("Error while checking Web Resource", webResourceName, e);
  }
}

Why SHA-1?

SHA-1 is not recommended for security purposes, but in this scenario it’s perfectly adequate:

We’re not securing sensitive data
We only need a fast and consistent checksum
It’s widely supported by the Web Crypto API

If you prefer, you can easily switch to SHA-256 by replacing "SHA-1" with "SHA-256".

Benefits of This Approach

No server-side changes
Works with any JavaScript or HTML Web Resource
Prevents silent cache-related issues
Improves transparency for users and testers
Easy to reuse across multiple forms and entities

Possible Enhancements

Automatically reload the page after detection
Display the last update date in a custom notification
Store hashes per environment (Dev / Test / Prod)
Extend the logic to multiple Web Resources at once

SWE | Software Developer | Agile Team Player | Tech Enthusiast With a fervent passion for technology and a natural inclination towards collaboration, I thrive in Agile environments where innovation and teamwork converge. Technical: Languages: JavaScript (Node.js, React.js), Python, Java, C# Frameworks: .NET Databases: SQL Server Tools & Platforms: Git, Azure Methodologies: Agile Key Skills & Accomplishments: - Developed and maintained cutting-edge web applications. - Actively collaborated with cross-functional teams to ensure project success. - Implemented robust testing and development automation strategies. Internet is my playground for cognitive development.Just as a playground stimulates a child's mind and skills, the internet is my space to explore, learn, and grow. Every online resource is an opportunity to develop new skills, enhance creativity, and challenge my limits—just as each game helps a child evolve and programming learning software.

Pagine

Pure vs. Mixed States

What Is Purification?

A Simple Example

Decoherence The Real-World Challenge

Why Purification Matters

Key Insight

Dataset: Shakespeare Text

Character-Based Text Generation

Step 1 — Text Normalization

Step 2 — Pretokenization

Step 3 — Tokenization

Step 4 — Encoding Tokens into IDs

Preparing the Dataset

Building the Model

Text Generation Function (Character-Based)

Training the Model

Generating Text

Subword-Based Text Generation

Loading a Hugging Face Tokenizer

Encoding Text with the Tokenizer

Preparing the Dataset

Subword Text Generation Function

Generating Subword Text

Character vs Subword Models

Purpose of the Agent Ideation Partner

How the Agent Works

Example Prompt to Create the Agent

Example Use Cases This Agent Might Generate

Why This Approach Works

Choosing Activation Functions

Training GANs: Two Optimizers, Two Losses

Designing the Discriminator Loss

Label Smoothing Trick

Designing the Generator Loss

Final Thoughts

From Prompting to Cognitive States

Why Not Just Write It in the Prompt?

How the Cognitive Flow Engine Works

Practical Applications

The Real Insight

Where This Is Going

Example Code

Bold Text

Italic Text

Bold and Italic

2. Headings

3. Internal Links

4. External Links

5. Lists

Bullet List

Numbered List

Sub-lists

6. References (Citations)

7. Citation Templates

8. Infobox

9. Images

10. Categories

11. Templates

12. Tables

13. Comments (Hidden Text)

14. Draft Submission (English Wikipedia)

One Region, Many Patterns

Feature Maps and Activation Maps

Why Edge Detection Matters

What About Color Images?

Multiple Filters and Increasing Depth

CNNs vs. Dense Layers

Learning Happens Automatically

A Scientific and Systems-Level Perspective on Modern Artificial Intelligence

LLM – Large Language Model

LCM – Latent Concept Model

LAM – Language-Aware Model

MoE – Mixture-of-Experts

VLM – Vision-Language Model

SLM – Sequence Learning Model

MLM – Masked Language Model

SAM – Segment Anything Model

The Core of the Training Loop

Model Validation