Open question no. 3: fundamentally insecure systems
Why do we keep trying to address vulnerabilities only through patching?
I think any of us that work in cybersecurity should add the verb ‘to Macgyver’ to our resume’s list of skills. It seems like half of what we do in cybersecurity is to layer technology over software and systems that we know are flawed; flaws for which we’ve paid good money. Yet it’s unclear to me that we’re doing much about the root issue, and this is why it’s no. 3 on my list of open questions: what to do about our seemingly hopeless situation.
Many of the items on my list are not challenging because of some technical roadblock. Typically they’re hard because they require moving regulators and our corporate overlords to do the right thing. However, I’ve been avoiding tackling no. 3 because I think it is technically difficult. Though that could simply reflect the limitations of my own knowledge. Any expertise I had in operating system design aged out over twenty years ago. To be sure, there’s also a need lurking in here for more forceful regulations, but let’s unpack the issue a bit.
I take it as axiomatic that every software product is fundamentally flawed. The bulk of OT and IoT is intrinsically insecure and likely unsecurable. None of it is designed within any sort of cohesive, holistic framework. Yet we must meet the obligation of running it securely. How? Is there a philosophy of architecting and operating our IT and data environments securely that can be effective despite the weaknesses in each element? Resilience needs to become more than having redundant systems.
We do a lot to protect systems from being attacked. From firewalls (a network ingress policy), to IPSs and IDSs, to micro-segmentation of networks, and of course host-based protections (generally more firewalling), anti-malware agents, and account protections. These both limit the attack surface of the underlying system and create detection mechanisms for identifiably malicious or anomalous behaviors. Operating systems themselves are designed to isolate accounts and processes, and restrict the ability of malware to further compromise user data and system processes1.
Yet despite this avalanche of technology, to be on a network is to be compromised. Breaches are the cost of doing business, as inevitable as death and taxes2. It’s why there’s an entire market for breach insurance. Many of these breaches are directly tied to the failure of some of these controls. A misconfigured firewall exposes systems; a weak password exposes an account; a lack of effective anti-malware software allows malicious software to run amok in a system. But just as often breaches are the result of defective software.
If my assumption about software vulnerabilities is correct (and it is), what do any of these controls do to address them? None of them do anything to address the vulnerability itself - they work to reduce access to the vulnerability, detect when it’s exploited, or limit the blast radius, i.e., constrain it from spreading. These are all good things, but none of them address the root cause. All we can do for software vulnerabilities is wait until it’s uncovered and the month+ until the vendor develops, tests, and makes available a patch.
The situation we find ourselves in is this: we’re going to deploy thousands or tens of thousands of systems (let’s pretend they’re all centrally managed) knowing that each system is rife with known and unknown flaws, so we then take on the burden of macgyvering a small army of tools that help us, not mitigate the problem, but mitigate the consequences of the inevitable breach. That burden is not just expensive, but cripplingly so. The problem of vulnerabilities is so egregious that most shops have staff and tools dedicated to simply informing them of newly uncovered flaws. Of course, at most universities the majority of systems aren’t centrally managed, but depend on users applying security and application patches at their own convenience.
So this brings me to the core of the open question: is layering all of these tools on top of systems and networks truly the best we can do, or is it our inability to address the fundamentally flawed nature of software a consequence of our dependence on this approach? If we are unable to address software flaws directly, whether due to commercial or political inertia, should we rethink our fundamental model for deploying and configuring systems?
The commercial sector seems to have settled on the term of art ‘assumed breach’ for this framing. However, what it ‘assumed breach’ means varies by vendor and includes everything from deploying deception networks to a kind of threat assessment3 to the adoption of a resilience mindset. While all of these are satisfyingly cool they still fall into the category of preparing and constraining the blast radius.
As I mentioned in an earlier piece, there’s some overlap with the problem I’m discussing and zero trust models. Zero trust models assume no system or account is trusted until that system or account passes some level of health confidence. At least in theory. Most of the zero trust deployments I’ve seen in higher education are a combination of micro-segmentation, multi-factor authn, and network access controls (NAC). Indeed, many are leveraging NAC as their core health check mechanism. This is great and a step forward - it was as unimaginable a few years ago as MFA deployed across an institution was a decade ago - yet it still sidesteps the core issue. I feel our approach is akin to refusing the vaccinate the population or provide clean drinking water, but putting a doctor and a nurse on every street corner4.
Software flaws cast a long shadow, a shadow in which most of our work takes place. I’m not naïve - the political and commercial barriers are high - but I do believe that if we shifted the ungodly amounts we’re spending on securing systems within that shadow to the secure development of software and system architecture we’d be living in a radically different world. One where deploying a new system on a network wasn’t the metaphorical equivalent of pulling the pin on a grenade without any idea of how long until it exploded. Of course, no application is a monolith. Deploying even a single purpose system (imagine a web server) requires a full stack of supporting applications and services. If you could purchase a web server of sufficient rigor to serve on a bastion host the hundreds or thousands of supporting services, drivers, and internal processes would remain flawed in their own right.
However, as a practitioner how do we respond to the current status quo? Where can we look or what can we explore that moves us closer to a solution? Should we extend the zero trust model of shrinking the trust perimeter from the organizational boundary to the host, by extending this into the host itself? I’ve seen at least one firm float the term nano-segmentation for encapsulating almost every workflow, which does seem like a reasonable response to assumption of breach5. Of course, many organizations struggle with simply maintaining a simple inventory of hosts; understanding and developing security rules for workflows within hosts is probably three orders of magnitude more difficult.
No, this model of process or workflow level segmentation may be intellectually satisfying yet strikes me as practically impossible to manage outside of highly restricted and limited environments. We truly need an operating system that natively enforces a meaningful level of nano segmentation.
Similarly, perhaps we should be looking more closely at systems that operate more comprehensively within an encryption envelope6. For example, using homomorphic encryption. One could imagine a system where a user’s data was stored and processed only while encrypted with hardened processes to handle the rare decryption7. How this would truly work in a contemporary office environment, using tools architected around the sharing and co-editing of data is beyond my imagination.
Another angle on this challenge may involve what are known as knowledge proofs. Mathematical techniques that allow for the testing of computations performed by others. Essentially these techniques permit you to ask whether the computation used by another was done correctly. Versions of this are widely used in cryptographic applications and my instinct is that it could prove to be another dimension to building an intrinsically resilient operating system. Could individual processes use these techniques to validate the data they receive from other processes and if they are valid, is that sufficient to attest that the other processes are trustworthy?
Obviously I’m just spitballing here, plucking interesting ideas that if woven together might help create a more robust and resilient system. Plenty of readers with more expertise in these areas will likely email me explaining in depth why I’m all wet here. My suspicion is that there is a pool of research that can get us much closer to what we’re hoping for but is viewed as too much of a paradigm change and/or too difficult for end users and thus prevents mass commercial adoption8.
If we as practitioners are unable to advance the idea of systems resilient to their own intrinsic flaws, what other options do we have? I do think there are several things we can do, though they remain responsive to these intrinsic flaws rather than directly addressing them. None of this is rocket science or particularly novel, but as with so many cybersecurity initiatives we’re simply moving too slowly.
- Accelerate and mature your zero trust initiatives and become much more aggressive about your posture checking of hosts. Simply checking for anti-malware software and that the host firewall is enabled is far too weak a measure. Ensuring unnecessary software has been removed and that administrative accounts require strong MFA should be included. Any changes in privileges should trigger alerts and be logged. Confirm that remote access is configured to only use another bastion host. Essentially, posture checking needs to ensure the system is configured to meet a hardening standard.
- Manage end user systems. At most universities, especially at larger ones, the vast majority of hosts are not centrally managed. Tricky due to the heterogeneity of systems? Sure. Expensive because there’s a large support dimension? Sure. Political hurdles since VIP faculty get to ignore policy? Sure. Essential because it’s not 1975 anymore? Absolutely.
- Aggressively use allowlisting of IPs for remote access. In the old days of international travel (actually only 20 or so years ago) you had to tell your credit card companies you were traveling overseas and would be using your credit card. VPN and authn activity from overseas should also require explicit acknowledgement of intended travel to specific countries. This could become part of your NSPM-33 foreign travel processes (registering an intent to travel to a specific country could trigger location-appropriate cybersecurity briefings).
As I said, these are all common sense and should be familiar to any cybersecurity practitioner. But we also need to fully acknowledge that these are countermeasures in response to the weak hand we’ve been dealt, that of living with flawed products. Everyone who buys from a major vendor meets with them periodically to, in the words of the vendor, ‘see how it’s going’ (i.e., try to upsell you) - how many of us come to those meetings with a list of known vulnerabilities in its product and ask for compensation or a reduction in price to account for the lost time and productivity of patching? How many of us have tried to get terms related to vulnerabilities, er, software defects, addressed in our contracts? How many CISOs or CTOs have ever even asked the question “how many serious vulnerabilities in the product can we tolerate before we abandon it?” Shouldn’t this be an element of every product risk assessment?
Year after year we see breaches caused by software defects and yet vendors are not held accountable, nor are we prepared to switch vendors. If switching costs are too high, and vendors have no accountability for the quality of their product, we’ve incentivized the status quo.
I think 2400 words is enough so I’ll wrap this up - particularly since I’m low on suggestions. But that’s why I’ve labeled this an open question. It would be great to see a workgroup tackle the issue more directly: how do we architect, build, manage, and deploy systems at scale that reflect their intrinsically flawed nature? Zero trust seems like a great start, but only a half measure.
As an aside, I have to wonder if we are in some way constrained by the observation that operating systems are generally developed from a single architectural model that has to serve both the professional market and the consumer market. I realize it’s not quite black and white, but it seems like limits on what consumers will tolerate bleed over and similarly constrain professional systems architecture. Though I suppose consumer demands on usability have a positive impact on the professional market as well.
Given the embrace of regressive taxation in the US, perhaps we need to replace ‘death and taxes’ with an alternative. ‘Death and fascism’? Or by author, Hunter Thompson: death and drugs; Carrie Nation: death and sobriety; Musk: death and failed businesses; Trump: death and death to everything he touches; Toni Morrison: death and memory; Ernest Hemingway: death and masculinity; Franz Kafka: death and paperwork; Albert Camus: death and absurdity. Now I’ve got to rethink what I put on my business cards.
The image of an armed commando going through the data center on Trustedsec’s website is, well, off-putting. We’re now marketing based on movie memes. Unless you live in a warzone or parts of the rural south, armed men in your data center isn’t even on your all hazards list.
I leave it as an exercise to the reader to identify who is trying to shut down vaccinations, clean water protections, and additionally make medical care more difficult to obtain. No analogy is too absurd for our current world.
https://www.illumio.com/blog/what-is-nano-segmentation. I’ve not looked at this product, nor done much of a market search for similar tech, but Google put it at the top of my search results.
Encryption offers an interesting opportunity in terms of managing compliance requirements. While regulations are often opaque when discussing encryption, I would hope that effectively managed encryption reduces the surface area in scope of compliance requirements. Homomorphic encryption approaches in particular might serve to ‘tamp down’ the compliance fever just as an antibiotic does to a person.
In-memory encryption is always critiqued for being too memory and processor-intensive. Somehow we can afford native AI-dedicated hardware to help me produce images of cute puppies but nothing to enhance system security? TPM was a step in the right direction but clearly isn’t sufficient on its own.
Of course, one can also imagine that anti-malware software evolves with the maturation of AI/ML into a kind of real-time cybersecurity agent, operating independent of any other system processes but with the ability to monitor and react to all other system activity. However, given that modern LLMs often make mistakes of the most basic form, are themselves largely proprietary black boxes, and of unknown software quality, it’s difficult to see this coming to fruition anytime soon. Let’s not even begin to discuss the privacy nightmare that would ensue.


