Tips & Tricks Archives - Modular Technology Group

The Model That Barely Slows Down: Gemma 4 26B vs Qwen 3.6 35B at Long Context

Arthur — Wed, 22 Apr 2026 16:40:03 +0000

The Model That Barely Slows Down: Gemma 4 26B vs Qwen 3.6 35B at Long Context

Modular Technology Group · April 22, 2026

I’ve been thinking a lot about what it means to deploy a model in production versus benchmark it in a controlled setting. Most benchmarks pick short prompts — 1k, 2k tokens — and declare a winner. That’s fine for answering quick questions. It’s irrelevant if you’re building anything real: document analysis, long-thread summarization, multi-turn reasoning agents, whole-repo code review.

So we don’t benchmark that way.

Two days ago we published numbers for Qwen 3.6 35B-A3B across 32k, 64k, and 128k contexts on our dedicated AI server. Today we ran the same protocol against Google’s brand-new Gemma 4 26B — same hardware, same quantization, same prompts, same three-context sweep.

The headline: Gemma 4 26B is 3.7× faster than Qwen 3.6 35B-A3B at 32k context. At 128k, it’s 7.2× faster.

And unlike Qwen 3.6 — which we watched degrade from 26 tokens/sec at 32k to 9 tokens/sec at 128k — Gemma 4 barely moves. It went from 96 to 87 to 65 tokens/sec. The curve is nearly flat. That changes the infrastructure calculus entirely.

The Setup

Same hardware as Monday’s Qwen bench. Same server. Same protocol.

Platform	Hardware	Engine	Quantization
“Reach” — dedicated AI server	2× NVIDIA RTX 4070 Ti, 24 GB VRAM total	Ollama (llama.cpp/GGUF)	Q4_K_M

Models under test:

Gemma 4 26B (Google, MoE A4B — ~4B active parameters per token, 26B total)
Qwen 3.6 35B-A3B (Alibaba, MoE A3B — ~3B active parameters per token, 36B total)

Protocol matches our April 20 baseline bench:

Context windows: 32k, 64k, 128k tokens
Prompt: synthetic filler at 85% of target context budget — same bytes both models
Completion: 256 tokens, temperature 0.1
Trials: 3 measured per cell (+ 1 warm-up discarded per model×context)
Model unloaded between runs — no contamination from the other model’s KV cache
Explicit num_ctx override on every Ollama request (Ollama silently caps at 4,096 without it — we learned this the hard way and documented it)

18 total runs. 0 failures. Variance: under 1% across all trials.

The Numbers

Context	Gemma 4 26B	Qwen 3.6 35B-A3B	Gemma advantage
32k	96.4 tok/s · 3.5s wall	26.3 tok/s · 11.5s wall	3.7×
64k	86.7 tok/s · 4.1s wall	19.3 tok/s · 15.7s wall	4.5×
128k	65.2 tok/s · 5.9s wall	9.1 tok/s · 31.4s wall	7.2×

Let me frame the wall-clock numbers concretely. A 256-token response — roughly one dense paragraph — takes:

Gemma 4 at any context: under 6 seconds
Qwen 3.6 at 32k: 11.5 seconds
Qwen 3.6 at 64k: 15.7 seconds
Qwen 3.6 at 128k: 31.4 seconds

That’s the difference between a tool you hold a conversation with and one you fire off while you pour another cup of coffee.

The Surprise Finding: The Architecture Gap Widens With Context

Here’s where I want to spend more time, because this isn’t just a “new model is faster” story.

Both models are Mixture-of-Experts. Both use Q4_K_M quantization. Both run on the same two GPUs. At 32k context, the gap is already 3.7×. By 128k, it’s 7.2×. The gap nearly doubles as the context grows.

Why?

Qwen 3.6 35B-A3B:

36B total parameters, ~3B active per token
At 128k context, generation drops to 9.1 tok/s
Degradation from 32k to 128k: -65%

Gemma 4 26B:

26B total parameters, ~4B active per token
At 128k context, generation holds at 65.2 tok/s
Degradation from 32k to 128k: -32%

The KV cache grows linearly with context. At 128k, both models are operating under the same VRAM pressure we documented Monday — memory bandwidth is the bottleneck, not compute. The GPUs are reading enormous amounts of data per generated token.

The difference is the underlying architecture. Gemma 4’s A4B configuration activates more parameters per token than Qwen 3.6’s A3B, which would normally suggest higher compute overhead. But the total parameter count is smaller (26B vs 36B), meaning the weight tensors being loaded from VRAM on each generation step are physically smaller. Less data to move per token. Less memory bandwidth consumed per token. The gap widens with context precisely because the bandwidth-bound regime amplifies parameter-count differences.

In short: at long context, smaller total parameter count beats higher active parameter count when you’re memory-bandwidth constrained.

This is the kind of finding that doesn’t show up in a 2k-token benchmark.

What This Means for Infrastructure Selection

The previous bench taught us that hardware tier matters: dual mid-range GPUs on a dedicated server outperformed an M4 Max laptop by 5.3× at 128k. This bench teaches something different — that model architecture matters just as much as hardware for long-context workloads.

A few things I’m taking away from this:

Context length changes the whole model-selection calculus. Qwen 3.6 35B-A3B is an excellent model. For reasoning tasks at moderate contexts, it’s still compelling. But if your workload involves 64k+ prompts — and an increasing number of real workloads do — the throughput differential is severe enough to matter operationally. A 7.2× speed penalty at 128k context isn’t a marginal difference; it’s a different class of tool.

Model architecture is an infrastructure decision, not just a capability decision. When selecting a model for a production deployment, we now explicitly consider the active-parameter count, total parameter count, and their ratio alongside benchmark capability scores. Two MoE models with similar benchmark performance can behave completely differently under sustained long-context load.

The bandwidth-bottleneck pattern generalizes. We saw last week that at 128k context, the GPUs were running at 6–7% compute utilization with VRAM saturated at 91%. The compute was idle. The memory bus was the choke. Gemma 4 takes advantage of this constraint by keeping its weight tensor smaller — it’s effectively doing less memory I/O per token, which is exactly what you want when the memory bus is your ceiling.

Smaller isn’t always slower. The conventional wisdom is that a 36B model is “better” than a 26B model — more parameters, more capacity. For generation throughput under memory-bandwidth constraints, the relationship inverts. Whether Gemma 4 produces better *output quality* than Qwen 3.6 for a given task is a separate question — one worth benchmarking rigorously — but on pure throughput at long context, the smaller model wins decisively.

An Honest Note on Prompt Eval Telemetry

In our Qwen benchmark, Ollama reported prompt ingestion speeds of 20k–44k tokens/sec across the three context sizes — a useful data point for pipeline latency estimation.

For Gemma 4, Ollama’s prompt_eval_duration consistently reported 13–19ms across all three context windows, implying millions of tokens/sec. This is a KV-cache reuse artifact: the warm-up trial primes the cache, and subsequent trials appear to skip most or all of the ingestion phase. We’re reporting this honestly rather than publishing the inflated numbers. The wall-clock timing captures the full end-to-end latency accurately; the prompt_eval field in Ollama’s response for Gemma 4 requires more investigation before we’d cite it confidently.

What we can say: if Gemma 4 is achieving genuine KV-cache reuse across sequential requests with the same prompt prefix, that’s actually a meaningful throughput advantage for multi-turn workloads. We’ll dig into this in a follow-up run with cold-cache isolation.

What’s Next

Two tests I want to run before I’m satisfied this benchmark is complete:

1. Cold-cache prompt eval isolation for Gemma 4 — force model reload between every trial to get a clean first-ingestion measurement 2. Output quality comparison — throughput advantage is only relevant if the output quality holds up. We’ll run a structured evaluation comparing Gemma 4 and Qwen 3.6 on legal document analysis and long-form synthesis tasks — the actual workloads our clients care about

The throughput finding is real and significant. Whether Gemma 4 earns its place in the production stack depends on the quality side of the equation.

The Infrastructure View

For organizations evaluating private AI: the model landscape is moving fast, and the performance characteristics of new models don’t always fit the pattern of what came before. A model selection decision from six months ago might be suboptimal today — not because the old model got worse, but because the new options are sufficiently different architecturally.

This is part of why we run these benchmarks with our own hardware and real workloads rather than relying on published leaderboard numbers. Leaderboards optimize for benchmark performance. We care about throughput under the memory constraints of actual production hardware, at the context lengths real workloads require.

Modular doesn’t resell AI. We build, host, and run the infrastructure ourselves — which means we’re measuring what actually matters to us operationally. These numbers are real because they have to be.

If you’re working through a private AI infrastructure decision and want to compare notes, I’m always open to the conversation.

Appendix: Methodology & Caveats

Models:

Gemma 4 26B (Google DeepMind) — Ollama tag gemma4:26b, GGUF Q4_K_M, 17.9 GB, digest 5571076f3d70
Qwen 3.6 35B-A3B (Alibaba) — Ollama tag qwen3.6:35b-a3b, GGUF Q4_K_M, 23.9 GB, digest 07d35212591f

Hardware: ReachAI server, 2× NVIDIA RTX 4070 Ti (12 GB VRAM each, 24 GB total), Ubuntu 24.04, Ollama v0.11.10

Prompt construction: Same filler text (140-char repeating unit), calibrated to 85% of target token budget. Tokenizer calibration ran 2026-04-22: Gemma 4 measures 6.76 chars/token, Qwen 3.6 measures 6.81 chars/token — within 1%. Same prompt bytes sent to both models; both reported nearly identical prompt_eval_count (26,639 vs 26,632 at 32k; 53,259 vs 53,252 at 64k; 106,479 vs 106,472 at 128k), confirming tokenizer parity.

Trial structure: 1 warm-up trial discarded per model×context cell (model remained loaded for context-cache consistency), then 3 measured trials. Model explicitly unloaded between models using Ollama’s keep_alive: 0 mechanism to prevent cross-contamination.

Completion: 256 tokens, temperature=0.1.

Ollama: Explicit num_ctx override per request. Default silently caps at 4,096 tokens.

Variance: Under 1% across all trials for both models. Gemma 4: 96.4 / 96.6 / 96.4 at 32k; 65.2 / 65.3 / 65.2 at 128k. Rock solid.

Caveats:

Gemma 4 prompt eval timing is not reported due to KV-cache reuse masking cold-start latency in Ollama. Wall-clock timing is accurate.
Both models use Q4_K_M GGUF — same quantization scheme, though the underlying weight distributions differ.
Tests executed on a dedicated server with no competing workloads.
Output quality comparison not included in this benchmark — throughput only.

Reproducibility: All scripts and raw data archived at reports/bench-archive/2026-04-22-gemma4-vs-qwen36/. Benchmark harness is argparse-driven and can be re-run against any Ollama endpoint.

Cale Hollingsworth is the founder of Modular Technology Group, which builds and hosts private AI workspaces in a FedRAMP data center. He has been advising organizations on infrastructure strategy since 1993. LinkedIn: Fractional CTO | Private AI Infrastructure Strategist | Evangelist @ Modular Technology Group | RAG Architect | Future-Proofing Organizations Since 1993

#PrivateAI #DataPrivacy #yourdatayourrules

The post The Model That Barely Slows Down: Gemma 4 26B vs Qwen 3.6 35B at Long Context appeared first on Modular Technology Group.

Same AI Model, Two Hardware Tiers — And Why Context Length Is the Hidden Variable

Arthur — Tue, 21 Apr 2026 01:59:58 +0000

Same AI Model, Two Hardware Tiers — And Why Context Length Is the Hidden Variable

Modular Technology Group · April 20, 2026

Ask any AI vendor how fast their stack runs and you’ll get a single headline number. “40 tokens per second.” “Under a second to first token.” Impressive — until you realize the benchmark prompt was 200 words long and you’re planning to feed it a 300-page document.

This week we took Qwen 3.6 35B-A3B — a state-of-the-art Mixture-of-Experts model released a few days ago — and pointed it at two very different pieces of hardware. Same model. Same questions. Same quantization tier (4-bit). Only the hardware changed.

The result isn’t just a horse race. It’s a quiet lesson in why the specs that matter for AI aren’t always the specs that get advertised.

Why We Ran This

At Modular, we route the same model across different infrastructure depending on the workload. A developer laptop handles quick, short-context tasks. A dedicated AI server handles long-document analysis, multi-turn agent reasoning, and anything that needs a big context window.

The question isn’t “which is faster.” A server beats a laptop. That’s boring.

The real question: at what context length does routing to the dedicated server become worth it? Without numbers, every routing decision is a guess. So we measured.

The Setup

Platform	Hardware	Engine	Quantization
“Forge” — developer laptop	MacBook Pro M4 Max, 64 GB unified memory	LM Studio (MLX backend)	MLX 4-bit
“Reach” — dedicated AI server	2× NVIDIA RTX 4070 Ti, 24 GB VRAM total	Ollama v0.11.10 (llama.cpp/GGUF)	Q4_K_M GGUF

We ran the model at three context sizes — 32k, 64k, and 128k tokens — and measured how long each host took to generate a 256-token response. Three trials per cell. Temperature fixed at 0.1 for near-determinism. Prompt content matched byte-for-byte. Tokenizer output cross-checked. Apples to apples.

For the 128k results, we added six total trials across two independent sessions to nail the number down.

Result #1: One Host Stays Usable. The Other Doesn’t.

At 32k context, both platforms deliver workable performance. The MacBook runs at 10.8 tokens/sec — slower than the dedicated server, but perfectly fine for interactive chat.

Then the context grows.

At 64k, the MacBook drops to 4.8 tokens/sec. At 128k, it collapses to 1.7 tokens/sec.

The dedicated server, meanwhile, holds its shape:

Context	Forge (MBP)	Reach (Dual GPU)	Reach advantage
32k	10.8 tok/s	26.3 tok/s	2.4× faster
64k	4.8 tok/s	19.3 tok/s	4.0× faster
128k	1.7 tok/s	9.0 tok/s	5.3× faster

Notice the pattern: the gap widens with every doubling of context. This isn’t a flat advantage — it compounds. By the time you’re at 128k, the kind of window you need for whole-document analysis or agent reasoning, the server is over five times faster than the laptop.

Result #2: The Honest Metric Is Wall Time

Tokens-per-second is abstract. What does this actually feel like to a human waiting for an answer?

A 256-token reply — roughly one solid paragraph — takes:

Context	Forge	Reach
32k	23 seconds	12 seconds
64k	54 seconds	15 seconds
128k	2 minutes, 32 seconds	31 seconds

That’s the difference between a tool you can hold a conversation with and a tool you fire off and check back on later.

Result #3: The Bottleneck Isn’t What You Think

Here’s where it gets interesting.

During the 128k runs on the dedicated server, we monitored both GPUs continuously. The VRAM was pegged — 22.2 GB of 24 GB total, 91% saturation. So the GPUs must have been pegged too, right?

Not even close.

The two GPUs, theoretically capable of hundreds of trillions of operations per second, sat at 6–7% utilization. They weren’t waiting for work. They were waiting for memory.

At long context lengths, the model has to read the entire “KV cache” — every token it’s seen so far — to generate each new token. Enormous quantities of data move between VRAM and the compute cores every few milliseconds. The memory bus becomes the choke point long before the math does.

This is the single most important finding in the entire exercise, because it reframes how to evaluate future hardware.

More FLOPS won’t fix this. When the question becomes “should we buy the next card when it drops?” — the answer starts with its memory bandwidth spec, not its TFLOPS number. That’s the opposite of what most marketing collateral emphasizes.

The Same Story, Live From Production Telemetry

This is real production monitoring from our own dashboard during the benchmark — not synthetic charts. Three things worth noticing:

Both GPU panels are nearly identical. Both cards track the same 5–7% load pattern. That’s tensor parallelism working.
The staircase in “Total Memory Used.” Each step is a single 128k trial committing its KV cache, then holding it. Three trials, three plateaus, climbing toward the 24 GB ceiling.
Compute is flat. Memory is climbing. The shape of the real data tells the same story as the synthetic chart: this workload lives and dies by memory, not by compute.

This is the visibility that separates production AI infrastructure from “we installed it and hope it works.”

Result #4: Tensor Parallelism Done Right

One thing the dedicated server does exceptionally well: split the model cleanly across both GPUs.

At 128k context, the memory load is nearly identical on both cards — 11,101 MiB on GPU 0, 11,117 MiB on GPU 1. A difference of 16 MiB out of over 11,000. That’s Ollama’s tensor-parallel splitter working exactly as designed. No card is bearing extra load. No GPU is OOMing. No spillover to CPU.

Tensor parallelism isn’t automatic. It requires compatible hardware, deliberate configuration, and a runtime that actually supports it. It’s also invisible to the end user — which is exactly how it should be.

What This Means for How You Deploy AI

If you’re prototyping against 4k-to-16k prompts on a decent laptop, you’re fine. For a team running real AI applications against real-world documents, the math shifts quickly.

A few honest observations from this data:

Context length matters more than model size. A 35B-parameter model can feel snappy or geological depending entirely on how much context you feed it. Marketing benchmarks rarely mention this.
Hardware choice is a memory problem, not just a compute problem. Two mid-range GPUs with balanced VRAM can outperform much more expensive single-GPU setups for long-context work.
Consumer hardware has real limits. M-series Macs are remarkable for the price. But physics is physics. There’s a reason production AI workloads live on dedicated servers.
Private infrastructure isn’t only about sovereignty. It’s also about having the right hardware for the right context, predictable performance, and the ability to scale without a surprise cloud bill.

At Modular, we deploy private AI infrastructure that gets these details right — matching the model, the quantization, the hardware, and the runtime so answers come back in seconds, not minutes. Data stays private. Costs stay fixed. Performance stays predictable.

Your data, your rules. Your hardware, matched to your workload.

Appendix: Methodology & Caveats

Model: Qwen 3.6 35B-A3B (Mixture-of-Experts — 36B total parameters, 3B active per token)

Prompts: Synthetic filler text sized to 85% of target context, with a single consistent question appended. Byte-identical across both hosts. Tokenizer output verified to match (prompt_tokens reported identically on each side).

Trials: Three per context-size × host cell for the primary run. Six additional trials at 128k on the dedicated server across two independent sessions. Variance across all six 128k runs: under 2% (8.94–9.03 tok/s).

Completion target: 256 tokens, temperature=0.1.

Ollama configuration: Explicit num_ctx override on every request. Default caps context at 4,096 tokens — enough to silently invalidate every long-context test if you miss it.

Caveats:

Quantization formats differ (MLX 4-bit vs Q4_K_M GGUF). Both are 4-bit but not bit-identical.
The MacBook was running normal background workloads during the test, not dedicated. A clean bench would improve its numbers modestly but not flip the conclusion.
Single model tested. Different architectures — dense transformers, larger MoEs, specialized coding models — will scale differently.
The 6–7% GPU utilization figure reflects generation phase only. Prompt evaluation phase utilization was much higher, but brief.

Raw data and all benchmark scripts: Available on request. Fully reproducible.

Modular Technology Group builds and hosts private AI workspaces with open-source components, in a FedRAMP data center. We use what we sell.

The post Same AI Model, Two Hardware Tiers — And Why Context Length Is the Hidden Variable appeared first on Modular Technology Group.

The AWS Outage That Broke AI: What March’s Infrastructure Crisis Reveals About Cloud Dependencies

Cale Hollingsworth — Tue, 17 Mar 2026 02:46:43 +0000

On March 1st, 2026, a catastrophic failure at Amazon Web Services’ data centers in the United Arab Emirates sent shockwaves through the global AI ecosystem. What began as fires and emergency power shutdowns at AWS facilities quickly cascaded into a worldwide infrastructure crisis that exposed a uncomfortable truth: our AI future is built on dangerously fragile foundations.

The scale of disruption was staggering. More than 84 AWS services went down across the Middle East regions, but the damage didn’t stop there. The outage triggered a domino effect that took down Anthropic’s Claude AI, Snowflake’s data platforms, and dozens of AI-dependent services worldwide. For hours, companies that had entrusted their AI operations to the cloud found themselves completely cut off from the tools that now run their businesses.

The Hidden Risk of AI Concentration

This wasn’t just a regional outage—it was a wake-up call about the dangerous concentration of AI infrastructure. When a single cloud provider controls the computing resources that power the world’s most critical AI services, a localized disaster becomes a global catastrophe.

Consider the ripple effects:

Anthropic’s Claude AI became completely inaccessible, leaving thousands of businesses without their primary AI assistant
Snowflake’s AI-driven analytics went dark, crippling data operations for Fortune 500 companies
Countless SaaS platforms that rely on AWS-hosted AI APIs suddenly couldn’t serve their customers
Development teams working on AI applications found their entire workflows halted

The March 5th follow-up incident was even more telling. Amazon’s own e-commerce platform experienced a separate 5-6 hour outage directly linked to “faulty deployments stemming from generative AI-assisted code changes.” The very AI tools designed to improve reliability had become a source of instability.

The Illusion of Cloud Reliability

For years, we’ve been sold on the cloud’s promise of “99.9% uptime” and infinite scalability. But AI workloads have fundamentally changed the risk equation. Unlike traditional applications that might degrade gracefully during an outage, AI services tend to fail completely. When the models go down, entire business processes grind to a halt.

The March incidents revealed several critical vulnerabilities:

Single Points of Failure: Despite AWS’s geographical distribution, the reality is that many AI services still depend on centralized model hosting and API gateways. When these go down, redundancy doesn’t matter.

Cascade Effects: Modern AI applications don’t just use one service—they chain together multiple APIs, models, and data sources. A failure in one component can bring down entire AI workflows.

Vendor Lock-in: Companies that have deeply integrated with specific AI APIs find themselves unable to quickly switch to alternatives during an outage. The switching costs aren’t just financial—they’re architectural.

The Case for Private AI Infrastructure

The AWS outage offers a compelling argument for what we call “AI sovereignty”—the ability to maintain control over your AI infrastructure regardless of external failures. This doesn’t mean rejecting cloud services entirely, but rather building AI capabilities that can survive when someone else’s infrastructure fails.

Private AI workspaces offer several critical advantages that March’s events highlighted:

Isolation from External Failures: When your AI models run on dedicated infrastructure, a fire in Dubai doesn’t shut down your operations in Delaware. Your AI capabilities remain available when your competitors are scrambling.

Model Diversity: Private deployments can host multiple models from different providers, reducing dependence on any single AI vendor. If one model becomes unavailable, workflows can automatically failover to alternatives.

Predictable Performance: Shared cloud infrastructure means shared resources. During high-demand periods or outages, AI response times become unpredictable. Private infrastructure delivers consistent performance when you need it most.

Data Gravity: With private AI workspaces, your data doesn’t need to travel across the internet to reach your models. This reduces latency, improves reliability, and eliminates another potential failure point.

Lessons from the March Crisis

The engineering meeting that Amazon convened on March 10th to address “service outages connected to generative AI code changes” hints at a larger problem: our infrastructure wasn’t designed for the AI age. We’re retrofitting cloud architectures built for traditional applications to handle AI workloads they were never meant to support.

Smart organizations are learning from these failures and building more resilient AI strategies:

Hybrid Approaches: Use cloud services for development and experimentation, but maintain private infrastructure for production AI workloads that can’t afford downtime.

Multi-Provider Strategies: Don’t put all your AI eggs in one cloud basket. Distribute critical AI functions across multiple infrastructure providers and deployment models.

Failover Planning: Design AI workflows that can gracefully degrade or switch to alternative models when primary services become unavailable.

Local AI Capabilities: For truly critical applications, maintain on-premise or privately hosted AI models that can function independently of external services.

The Future of AI Infrastructure

The March 2026 AWS outage won’t be the last infrastructure crisis in the AI age—it’s the first of many. As AI becomes more central to business operations, the cost of these failures will only increase. Organizations that learn to build resilience into their AI strategies now will have a competitive advantage when the next crisis hits.

The question isn’t whether cloud AI services will fail again—it’s whether your business will be ready when they do. Private AI infrastructure isn’t about avoiding the cloud; it’s about ensuring you’re never at the mercy of someone else’s infrastructure decisions when your business is on the line.

Because when the fires start burning in someone else’s data center, you want to be the company that keeps running while your competitors wait for the lights to come back on.

Building AI Resilience

The path forward requires a fundamental shift in how we think about AI infrastructure. Instead of treating AI as just another cloud service, we need to recognize it as critical business infrastructure that demands the same reliability standards we apply to power, water, and network connectivity.

This means:

Investing in private AI capabilities for core business functions
Designing AI workflows with failure modes in mind
Building redundancy across different infrastructure providers
Maintaining data sovereignty to reduce external dependencies
Training teams to operate in hybrid public/private AI environments

The companies that emerge stronger from the next AI infrastructure crisis will be those that learned from March 2026: in the AI age, dependency is vulnerability. True AI strategy isn’t about finding the best cloud provider—it’s about building systems that can thrive regardless of who else’s infrastructure fails.

Your AI capabilities should be as reliable as the business processes they enable. Anything less is a bet you can’t afford to lose.

The post The AWS Outage That Broke AI: What March’s Infrastructure Crisis Reveals About Cloud Dependencies appeared first on Modular Technology Group.

Three Security Incidents in Three Weeks: Why Private AI Is No Longer Optional

Cale Hollingsworth — Sun, 15 Mar 2026 03:22:50 +0000

The last few weeks have delivered a masterclass in why trusting your most sensitive data to someone else’s cloud is a gamble — and the house is starting to win.

What Happened

Three separate incidents. Three different organizations. One common thread: loss of control over data sent to cloud AI platforms.

The Pentagon vs. Anthropic. The Department of Defense designated Anthropic — maker of Claude, one of the most capable AI models on the market — as a national security risk after a dispute over who ultimately controls the model and the data flowing through it. When a defense agency can’t get comfortable with the control dynamics, that’s a signal worth paying attention to.

OpenAI’s vendor breach. A third-party analytics provider working with OpenAI exposed business customer data. Not through a sophisticated attack — through the kind of supply-chain vulnerability that’s inevitable when your data passes through multiple hands you’ve never met.

CISA’s ChatGPT incident. The acting director of CISA — the federal agency literally responsible for cybersecurity — accidentally uploaded sensitive government documents to ChatGPT’s public platform. If the people whose job is protecting data can make this mistake, what about the rest of us?

The Real Problem Isn’t the Headlines

These aren’t edge cases. They’re the natural, predictable result of centralizing sensitive work inside infrastructure you don’t control.

Every time you send a prompt to a cloud AI service, you’re trusting:

That vendor’s security posture
Their subcontractors’ security posture
Their data retention policies (and whether those policies change tomorrow)
Whatever a court might compel them to preserve or disclose
Whatever a future acquirer might decide to do with the data

That’s a long chain of trust for organizations handling privileged, regulated, or confidential information. And every link in that chain is a potential point of failure.

There’s a Better Way

At Modular, we built Private AI Workspaces specifically to eliminate this chain of dependency.

Your prompts, embeddings, documents, and outputs never leave your environment. There’s no third-party analytics layer siphoning data to vendors you’ve never vetted. No silent retention policy buried in terms of service. No competing obligations between your privacy and a government subpoena aimed at your AI provider.

The infrastructure is yours — hosted on your own hardware, or in our FedRAMP-certified data center with full tenant isolation. Either way, the data stays exactly where you put it.

What That Looks Like in Practice

Wildcat — Entry-level private AI workspace. Shared infrastructure, isolated data. Perfect for firms getting started with AI who want privacy from day one.
Panther — Dedicated LLM server, isolated frontend, private document storage. For teams with real data to protect.
Grizzly — Fully dedicated hardware with air-gapped options. On-premise if you need it. For organizations where “good enough” security isn’t.

All tiers run on model-agnostic infrastructure. You’re not locked into one AI vendor — you choose the models that work best for your use case, and you can switch whenever you want. Fixed monthly pricing means no surprise bills when your team actually starts using AI the way they should be.

The Bottom Line

Private AI isn’t a luxury tier. It’s becoming the baseline for anyone who takes their data seriously.

The organizations that will thrive aren’t the ones with the flashiest AI tools — they’re the ones who maintain control over where their data lives, who can access it, and what happens to it tomorrow.

If your organization is rethinking where its AI workloads live, we’re happy to compare notes.

Modular Technology Group builds, hosts, and maintains private AI workspaces for organizations that need enterprise-grade capability without sacrificing data sovereignty. Get in touch or schedule a consultation.

The post Three Security Incidents in Three Weeks: Why Private AI Is No Longer Optional appeared first on Modular Technology Group.

The AI Workspace Your Clients Would Choose Themselves

Cale Hollingsworth — Mon, 16 Jun 2025 20:00:12 +0000

Modular AI Workspaces. GenAI that stays in chambers.

Your litigators will find facts in minutes.
Your compliance team will sleep at night.
Your leadership will control costs.

And your client data?
It stays entirely in your hands.

Last week, Cloudflare’s outage reminded us just how fragile central systems can be. Days earlier, OpenAI was compelled to preserve every chat—including API traffic. If your teams are building strategy, reviewing contracts, or researching sensitive issues with public LLMs, those sessions are now permanent legal records…somewhere in someone else’s cloud.

At Modular, we believe your most valuable asset, your data, deserves sovereign, private infrastructure. Modular AI Workspaces give law firms and enterprises a secure, closed GenAI environment that lives only in your data center or ours. There is no third-party access, no metadata harvesting, and no surprise subpoenas.

Just fast answers, trusted results, and total control.

Because for high-stakes work, privacy is productivity.

Hashtags:
#DataSovereignty
#PrivateAI
#LegalTech
#SecureByDesign
#ClientConfidentiality
#ComplianceFirst
#ModularAI

The post The AI Workspace Your Clients Would Choose Themselves appeared first on Modular Technology Group.

Compare multiple LLMs side by Side

Cale Hollingsworth — Sun, 04 May 2025 00:30:55 +0000

With our AI Workspaces, you can compare multiple LLMs side by side to find the right model for your use case.

The post Compare multiple LLMs side by Side appeared first on Modular Technology Group.