To ssh, or not to ssh
In today’s avant-garde DevOps world, it’s verboten to actually ever log in to server to look around, let alone do, or heaven forbid, change anything.
After all, in the world of immutable infrastructure, infrastructure-as-code, and more generally, DevOps, why would you ever need or want to do such a thing ?
However, in the real world, ranging from very traditional servers and VMs to the more agile cloud and cloud-native systems, to the advanced CI/CD, infra-as-code shops, ssh use remains a reality.
Why is this so ? And what can be done about it ?
First, most systems just don’t have the basic structure, processes, nor tooling to avoid ssh. They are built by hand or from simple images, after which various languages and services are installed. These servers are classic ‘pets’, far from immutable, and thus live a very long time, often though years of changes and upgrades.
These 1st generation systems almost certainly out-number all the rest, and are fraught with well-known problems of drift, maintenance, reliability, and a long list of anti-agile, anti-devops anti-patterns.
Ops teams often ssh to these machines to do anything and everything, and generally have many higher priorities than thinking about eliminating ssh.
Second generation systems are more agile or ‘devops’-oriented, where code is pushed often, servers are (mostly) retired often, containers might be used, etc. But the infrastructure is still mostly separate from the developers, code, or processes; still lots of ‘pets’ and long-lived ‘cattle-like’ servers.
Ops and Developers often ssh to these machines to troubleshoot and debug, maybe upgrade basic stuff, but also use various tools like Ansible, Puppet, etc. to manage more things, usually still close to the code, such as languages, modules, some services. Jenkins might show its face, too.
Third generation systems are the latest things, in full ‘devops’ mode, including infrastructure-as-code, immutable servers, lots of clouds services, containers and maybe functions-as-a-service.
In theory no one touches anything manually, as everything is a fully-observable grey box. Codes pushes ever few minutes after full auto testing plus canaries, etc. and nothing lives more than a few hours or days.
No one would think to ssh into these servers, if they even could. Except when they have to, maybe when no one is looking, to find and fix stuff, make things work, and keep the trains running on time. Just don’t tell anyone.
Which brings us to the ‘100 bitcoin’ question, why do we need to ssh to servers ? Let us count the reasons, and perhaps how these can be avoided.
First is to install and setup stuff, like adding Ruby, MySQL, or even Linux users and FTP servers. Very common, very messy, very error-prone. Avoid all this with AMIs, Ansible, or Docker.
Second is to investigate or debug things when something goes wrong. This might be looking at logs, checking configs, running strace or debuggers, or running various CLI tools such as the ‘tops’ to at high-resolution information on CPU, IRQs, network sockets, etc.
This second set of ssh tasks is very broad & diverse, and each requires its own process, tooling, and thinking to overcome; and many of these will be system and company-specific. But overcome them you can, starting with increasing observability by pushing logs, events, and as much as you can to external services such as Sumo Logic, Honeycomb.io, APM tools, or even your monitoring system.
Beyond that, there are other monitoring systems & CLI-related tools to help eliminate the bulk of other reasons — these mostly focus on abstracting CLI tools, including eventually strace, gdb, and everything else you think you need.
Overall, these observability & monitoring tools may slow you down initially, but they are far more powerful, scalable, and secure, able to answer questions impossible to see from a command prompt. Don’t stop until you have everything you want to know about your servers pushed out somewhere, as you’ll be amazed at what it can do for you.
The third reason to use ssh is to change something when something goes wrong, such as updating a configuration file, changing a parameter, restarting services, or even patching code. These are different than the investigation/debug reasons from above, because these are write operations and cannot so easily be done correctly or safely via remote centralized tools.
This is very challenging, though some people will use Ansible/Puppet, etc. for well-defined semi-ad hoc changes, while some SaaS tools are looking to update configs and services in real-time. Patching code or other changes closer to the developers’ world remain challenging, though Docker helps that in many cases as long as the boundaries are well-defined.
Overall, with reasonable planning, virtually all ssh access can be eliminated, and I say that as a guy who has lived in ssh for decades across very diverse systems. I love and hate it at the same time, but a correctly built ssh-free environment is surely the future for all of us.
Get started by first believing it’s possible, then building prioritized lists of what you do in ssh, and then finding alternatives that help you slowly reduce its use.
Alongside other agile and DevOps practices that help evolve your infrastructure, in a few years you won’t even remember where you left your ssh keys.