Password Management at Scale
Securing & Sharing Thousands of Secrets w/ Hundreds of People
We have lots of passwords, keys, and secrets. Thousands of them, for hundreds of customers, thousands of servers, services, and systems. As a large-scale MSP, this is our life.
Yet we still struggle with managing all these secrets, as there are no good methods, just less bad ways. We’ve been trying for nearly a decade.
This article is about these challenges, possible solutions, and ideas, as we think lots of folks have the same problems and some hopefully have better ways, too.
First, what kind of secrets do we have, or have to deal with?
Everything from Linux/Windows users and passwords or ssh keys (e.g. root, ec2-user, sudo), to DB users, AWS/Cloud console users, customer system passwords, CDN management consoles, ALOM/DRAC/IPMI cards, endless SaaS tools, SSL private keys, API keys, VPN secrets, NAS & SAN boxes, and hardware firewalls. Plus a lot more.
And all of that is spread over 100+ locations in many countries, with different OS distributions, versions, and customer security policies. So no SSO or simple integrated system will work.
Second, we’ve done the easy stuff
Of course we have LDAP, personal ssh keys, MFA everywhere we can, Federated & MFA’d AWS IAM, logging, audit, and more, but we still have a huge pile of secrets that must be both protected and shared with dozens of people, 7x24.
Last I looked, there were 10,000 entries in our password manager
Third, a key issue is how to remove access once employees leave
Changing a huge array of passwords every time someone leaves is not practical — in some cases we can remove access via bastion hosts, VPNs, etc. but in others our goal is merely to make sure the team doesn’t walk out with secrets on their laptops. It’s not great.
So what is the solution ?
There are several possible options, from trivial key managers (neither convenient nor secure), to tools like Keepass, which is great but hard to share and terminate employees from, to various web-oriented ‘secure’ applications.
In a perfect world, we’d have a nice web app that managed this, such as Manage Engine’s Password Manager Pro, which is not bad, but it’s expensive and has its own issues, including usability, complexity, and what happens if it’s compromised.
As a general rule, we have to assume all our systems are compromised, as in the recently-talked-about “zero-trust” networks and systems. We assume our PCs, networks, and core systems can be hacked, even though we work hard to avoid that. As we must protect our customers, this is the only way to think, in our view.
There are three levels of issue to think about:
- Secure Storage of Secrets
- Secure Processing of Secrets
- Secure Display of Secrets
Let’s address each of these, then together as a system.
First, storage, which may be the easiest. There are a few secret storage systems, from just encrypting into MySQL to using Hashicorp’s Vault, which is probably the right solution. Use Vault and its multi-part unsealing key to avoid any single-person dependency or risk vector.
But there is a second challenge, which is where and how to connect Vault to the system. Easiest is to run in the main web system, either on the App Servers or on a VM just like MySQL, etc.
But this leaves it vulnerable to both hacking from other connected servers and also leaves it fully exposed to the core app source code, i.e. any code that can talk to Vault can read/write secrets from it. Which leads to our second level, processing.
Second, processing of secrets is perhaps the hardest part, if you assume all our code is compromised. This for sure means not managing most of the secret encryption and storage in the main code.
That means putting Vault on a separate well-secured and guarded VM, in a separate VPC or Subnet, and building a VERY specific, yet simple API in front of it. Such as put, get, delete secret().
And that functionality can be secured, audited, throttled, and importantly, both required and utilize extra authentication and/or MFA-like info to function. It’s a very important point of control.
Of course you have to securely backup this Vault VM, plus probably encrypt its EBS/Snapshots, etc. and very carefully limit its in- and out-of-band access.
The basic challenge is how to process a single secret, such as the password for root user of server abc-web1. We have to type, paste, or API that into the core system, which will send to the storage API, which will use at least a static key to encrypt it and put it in to Vault.
But how to retrieve it when user needs it ?
First, the core system has to authenticate and authorize the user, which is not difficult. This includes making sure the user should have access to this server and password, keeping in mind a hacker can bypass or fake this, enabling their user to be authorized for some or all secrets.
This can only be managed by an end-to-end encryption channel with the end user, setup in advance, somewhat out-of-band. This can be a separate secret, TOTP MFA, PKI / Smartcards, etc.
This is very challenging for three reasons. First, this has to be done per user, which really complicates the secret extraction and transmission process. And this has to be done on the Vault VM, as the core system can’t be trusted.
Second, the new user enrollment process is messy and open to interception by the core system hackers. Whatever keying method we use, we need to get those keys to the backend storage system secretly, ideally out-of-band or in a way that’s hard to intercept. And no way we want to expose the backend storage system to the user directly via web, API, etc. as too dangerous.
Third and most importantly, we need a practical & mathematical way to do this per user, which has lots of complications.
For example, we could use the popular TOTP MFA (Google Authenticator) but this is just an authenticator, not an encryptor, thus even if the backend accepts your MFA, it still has to send the secret in the clear to you, allowing possible interception in the core system, browser, on your PC, and otherwise in transit.
And even if we accept this, TOTP MFA is has a 30+ second vulnerability window where a hacker can steal your MFA code and then do anything you are allowed to do for 30 seconds or more, including get all they keys they want. Unless you want an MFA per secret, which is not practical.
A better authenticator is challenge-response where the backend provides a challenge of some type, which you type or scan into an app, and get a response that you type in. This is common in secure banking systems, and prevents MFA re-use across secrets, but requires higher-end apps or tokens, and is much slower to use. And still no encryption.
The advantage of this approach is intermediate hackers, at least in the core system, cannot see or steal this key, as long as they have no valid private key.
On a practical level this is painful, though, as the end user has to have a way to authenticate in real-time, via some type of challenge-response that involves their private key, usually from a smart-card. This is fairly complicated to implement end-to-end, as we need to involve the backend storage system, the core web app, the browser, the PC/Phone OS, and the Smart Card.
Display & Use
Hackers in the core backend system is one thing and we try to avoid that, but hackers in the PC and/or Browser is more likely, and more dangerous as in the end, we have to have some way to show the secret to the user. If we just put a text string on the screen or clipboard like “sdfh238&@#” there are ways to steal this and send off to the hacker underground.
Some high-end banking apps use special Java or Active-X controls to help avoid this, and a dedicated phone app could also be used (unless the phone is compromised). This is a hard problem, and there are probably intermediate measures that can be taken, such as show images, not text, etc. but those are then very hard to cut & paste when needed (and useless for keys).
Another way is to use Virtual Desktops (VDI) for all interaction, to remove any local browser and OS use other than pasting into the ssh terminal or SaaS system.
A key thought component here is time-separation, i.e. we can separate the time and path used to enter/store secrets into the system and much later retrieve them. This reduces the vulnerability windows for hacked code to steal secrets.
In addition, of course all this has to be HA as we need access to all these systems 7x24. And trusted people need methods to manage, backup, update, and troubleshoot the system.
Plus add best-practice controls around the infrastructure, for access, SELinux, and various other controls, though making sure they don’t themselves open up avenues of attack.
We’ve worked on this for a long time and still have no built a great solution, and thus we rely on a mix of LDAP, SSO, and Keepass solutions, plus bastion hosts, federated access, and MFA all around.
It’s a very hard problem, especially for heterogeneous and customer-premises systems, one that we continue to work on, while hoping others publish best practices, tools, and ideas.