
For the past 18 months, the CISO’s playbook for generative AI has been relatively simple: control the browser.
Safety equipment enforced Cloud Access Security Broker (CASB) policies, blocked or monitored traffic to known AI endpoints, and routed usage through authorized gateways. The operating model was clear: if sensitive data leaves the network for an external API call, we can observe it, log it, and stop it. But that model is beginning to break down.
A quiet hardware shift is driving the use of the large language model (LLM) out of the network and toward the endpoint. Call it Shadow AI 2.0, or the era of “bring your own model” (BYOM): employees run capable models locally on laptops, offline, without API calls and without an obvious network signature. The governance conversation is still framed as “data exfiltration to the cloud,” but the more immediate business risk is increasingly “uninvestigated on-device inference.”"
When inference occurs locally, traditional data loss prevention (DLP) misses the interaction. And when security can’t see it, they can’t manage it.
Why local inference is suddenly practical
Two years ago, running a useful LLM on a work laptop was a niche gimmick. Today it is a routine for the technical teams.
Three things converged:
-
The consumption accelerators got serious: A MacBook Pro with 64GB unified memory can often run class 70B quantized models at usable speeds (with practical limits on context length). What once required multi-GPU servers is now feasible on a high-end laptop for many real-world workflows.
-
Quantization became widespread: It is now easy to compress models into smaller, faster formats that fit in laptop memory, often with acceptable quality tradeoffs for many tasks.
-
Distribution is frictionless: Open weight models are a single command away, and the ecosystem of tools makes “download → run → chat” trivial.
The result: An engineer can extract a multi-GB model artifact, turn off Wi-Fi and run sensitive workflows locally, review source code, summarize documents, draft customer communications, and even perform exploratory analysis on regulated data sets. No outbound packets, no proxy logs, no cloud audit trail.
From a network security perspectivethat activity may seem indistinguishable from “nothing happened.”
The risk is no longer just that data leaves the company
If the data doesn’t leave the laptop, why should a CISO care?
Because the dominant risks move from exfiltration to integrity, provenance and compliance. In practice, local inference creates three kinds of blind spots that most companies have not implemented.
1. Contamination of codes and decisions (integrity risk)
Local models are often adopted because they are fast, private, and “no approval required.”" The disadvantage is that they are often not vetted for the business environment.
A common scenario: A senior developer downloads a community-tailored coding model because it offers good benchmark results. They paste in internal authentication logic, payment flows, or infrastructure scripts to “clean it up.”" The model returns results that look competent, compiles and passes unit tests, but subtly degrades the security posture (weak input validation, insecure defaults, brittle concurrency changes, dependency options that are not allowed internally). The engineer confirms the change.
If that interaction occurred offline, you may not have any record of the AI influencing the code path. And when you then respond to the incident, you will be investigating the symptom (a vulnerability) without visibility into a key cause (uncontrolled use of the model).
2. Licensing and intellectual property exposure (compliance risk)
Many high-performance models ship with licenses that include commercial use restrictionsattribution requirements, field-of-use limits or obligations that may be incompatible with the development of proprietary products. When employees run models locally, that use can bypass the organization’s normal legal review and procurement process.
If a team uses a non-commercial model to generate production code, documentation, or product behavior, the company may inherit risk that surfaces later during M&A diligence, customer security reviews, or litigation. The difficult part is not only the license terms, but the lack of inventory and traceability. Without a governed model center or usage log, you may not be able to prove what was used where.
3. Supply chain exposure model (source risk)
Local inference also changes the software supply chain problem. Endpoints begin to accumulate large model artifacts and the toolchains that surround them: self-loaders, converters, runtimes, plugins, front-end shells, and Python packages.
There is a critical technical nuance here: the file format matters. While newer formats like safety tensioners are designed to prevent the execution of arbitrary, older code based on pickles PyTorch Files can execute malicious payloads simply when loaded. If your developers are taking unvetted checkpoints from Hugging Face or other repositories, they’re not just downloading data: they could be downloading an exploit.
Security teams have spent decades learning to treat unknown executables as hostile. BYOM requires extending that mindset to model artifacts and the surrounding runtime stack. The biggest organizational gap today is that most companies do not have the equivalent of a software bill of materials for models: provenance, hashes, allowed sources, scanning and lifecycle management.
BYOM Mitigation: Treat model weights as software artifacts
You can’t resolve local inference by blocking URLs. You need endpoint-aware controls and a developer experience that makes the safe path the easy path.
Here are three practical ways:
1. Move governance to the endpoint
Network DLP and CASB are still important for cloud usage, but they are not sufficient for BYOM. Start addressing local model usage as an endpoint governance issue by looking for specific signs:
-
Inventory and detection: Look for high fidelity indicators such as .gguf files larger than 2 GB, processes such as call.cpp o Ollama, and local listeners in common default port 11434.
-
Process and runtime awareness: Monitor repeated high GPU/NPU (neural processing unit) utilization from rogue runtimes or unknown local inference servers.
-
Device policy: Wear mobile device management (MDM) and endpoint detection and response (EDR) policies to control the installation of unapproved runtimes and apply basic hardening on engineered devices. The point is not to punish experimentation. It is recovering visibility.
2. Provide a paved path: an internal and curated model center
shadow AI It is often the result of friction. Approved tools are too restrictive, too generic, or too slow to be approved. A better approach is to offer a curated internal catalog that includes:
-
Approved models for common tasks (coding, summarizing, classification)
-
Verified licenses and usage guide
-
Hashed versions (prioritizing more secure formats like Safetensors)
-
Clear documentation for secure local use, including where sensitive data is and is not allowed. If you want developers to stop digging through the garbage, give them something better.
3. Update policy language: “cloud services” are no longer enough
The most acceptable usage policies talk about SaaS and cloud tools. BYOM requires a policy that explicitly covers:
-
Download and run model artifacts on corporate endpoints
-
Acceptable sources
-
License Compliance Requirements
-
Rules for using models with sensitive data
-
Retention and logging expectations for local inference tools. This doesn’t have to be too strict. It has to be unambiguous.
The perimeter returns to the device.
For a decade we moved security controls “up” to the cloud. Local inference is returning a significant portion of AI activity to the endpoint.
Five signs that shadow AI has moved to the endpoints:
-
Large model artifacts: Unexplained storage consumption by .gguf or .pt files.
-
Local inference servers: Processes listening on ports such as 11434 (Ollama).
-
GPU utilization patterns: Spikes in GPU usage offline or without a VPN connection.
-
Lack of model inventory: Inability to map code outputs to specific model versions.
-
License ambiguity: Presence of "non-commercial" Model weights in production builds.
Shadow AI 2.0 is not a hypothetical future, it is a predictable consequence of fast hardware, easy distribution, and developer demand. CISOs who focus solely on network controls will miss what happens in the silicon sitting on employees’ desks.
The next phase of AI governance is less about blocking websites and more about controlling artifacts, provenance, and policies at the endpoint, without killing productivity.
Jayachander Reddy Kandakatla is a Senior Engineer at MLOps.





