Top 10 Ops Engineers’ Dream Tools
Things we all wish we had
Operations and DevOps engineers all dream of great tools to manage their systems. They know their lives would be so much easier if these powerful tools existed, tools that could handle all types of systems, on all types of architectures, and in all phases of a system’s lifecycle. This is what engineer’s dream of:
1) Full-Stack Tools — Engineers need full-stack tools, from the bottom to the top, across all layers, services, and parts of a system.
Today’s tools are far too fragmented, often with a very clear split between clouds, services, and code. Numerous basic tools exist to help deal with AWS, Azure, etc. And there are point tools, such as Vivid Cortex for MySQL, and APM tools, like New Relic, for the code. But other than a few basic cross-over monitoring functions, these never even look at, let alone manage, the whole stack, from clouds, data centers, through services like Nginx, Tomcat, and MySQL, the code, load balancers, and application delivery. When everyone looks at just an isolated piece of the stack, the whole stack suffers, along with the poor engineers.
2) Read and Write — Engineers need tools that can read and write systems.
The vast majority of today’s tools are either read or write. Most are read-only, which means they are good at monitoring, but can rarely do anything. All ‘actions’ are by ssh, cloud consoles, or some other tools. The write-only tools are limited to orchestration engines like Puppet or Ansible, or build systems like Jenkins, whose input is just the code or configs, never real systems, environments, nor situations. A few monitor-response systems, like StackStorm, are better, such as the ability to restart services, but rarely can accomplish more complex updates, configurations, architecture, scaling, or other vital tasks. And no one dares to try to sync the read and write systems to a unified model of the system or its environment.
3) Design System — Engineers need a full-stack, visual design system that can forward and reverse-engineer any system, anywhere.
There are no actual design systems available for real systems. Editing an Ansible playbook or building cloud formation templates is not the same as designing a system. A real design system has a full-stack visual designer and templates that allow re-use. It has the ability to set up everything needed for every part of every system and at every level, including all the latest technologies, such as Docker, Clusters, Cloud Services, and Micro-services, like Lambda. And a real design system can reverse engineer an existing system into the designer, allowing an engineer to make changes and then push it back to update the actual system.
4) Build, Sync, and Change — Engineers need tools that can build and update systems of any complexity and scale.
The world lacks full-stack build tools. Today’s ‘automation’ tools are very script-driven and are a far cry from what could be considered full-stack. While they can create things, they really don’t build whole systems in any repeatable way, especially when including service configuration such as Nginx or MySQL.
Many current tools use AMIs or basic recipes/playbooks from version control but then lack any ability to sync with existing systems, to handle changes and updates, to clone or diff, or to really do more than single, from-scratch pushes.
New tools need to be able to build, sync, and change. And, most importantly, they need to be full-stack supported from cloud and architecture, all the way down to detailed vhost configuration options, MySQL options, Java GC modes, and more, plus of course any and all cloud services, hardware, and physical disk layouts, kernel settings, physical networks, and more.
5) Advanced Monitoring and Alerting — Engineers need advanced deep monitoring for cluster and service equipped with alerting, notifications, auto-remediation, and machine learning for predictive alerting based on anomaly detection.
Most people are monitoring with 1st generation tools like Nagios, Cacti, or mrtg, 2nd generation tools like Zabbix, or occasionally 3rd generation tools like SignalFX or Datadog. Most of these have limited per-service metrics and few advanced features, let alone cluster or service-level support.
Few have sophisticated alerting, muting management, root cause analysis, or easy-to-use troubleshooting tools, nor real integration with slow logs, Java logs, or multi-server events. None have automated repair, alert handling, or expert system help to troubleshoot and fix problems.
6) Auto-Discovery and Config — Engineers need tools that can auto-discover running services, auto-configuring them for monitoring, auto-find the various links between servers and services, and auto-draw the proper diagrams for any system.
While some tools can find running services, none are very complete nor do they have the ability to automatically configure running services for monitoring. They can’t write Java JMX configs, change Nginx to allow monitoring, or install required modules, let alone handle multiple instances of Tomcat, Redis, or other services per server.
Furthermore, the vast majority don’t find the links between systems and how they interact, nor can they auto-draw diagrams of how a system is architected or operates, especially if it includes Docker, Cloud Services, or Clusters. All that is left to the poor engineer.
7) Deep Tech Tools — Engineers need a broad and deep set of service-specific tools with key metrics, unique tools and analyses, troubleshooting expertise and rule systems, dynamic reporting, and specific actions and automation.
There are very few tools that actually help Ops Engineers or offer real support for experts, such as DBAs, SREs, DevOps, and individuals involved in security engineering. When deep problems, tuning, or troubleshooting of real systems is necessary, very few tools are available to readily help.
Some nice systems, like Vivid Cortex, exist for specific services, but they are expensive and limited to a few single services. Few tools do any real-time analysis and none really have views into, or knowledge of, how specific services work and act under heavy loads or faulty conditions.
8) Cloud Tools and Integration — Engineers need full cloud integration, modeling, and reporting, not only for the above-mentioned design & build processes, but also to find, fix, and configure various cloud components and services.
Cloud tools today mostly exist for cost management and reporting, monitoring metrics, and some limited build integrations, such as for Ansible or Terraform. There are a few compliance systems, such as Evident.io, though they don’t integrate with anything else. Very few tools really extend cloud management or add even add basic tools, like tagging and searching.
There are no real expert systems for troubleshooting, performance management, architecture wizards, security reporters, DBA tools, etc. to help build and manage these complex systems. New cloud tools that are able to fully integrate should have the capability to work with cloud components and systems, such as searching and managing tags, managing costs, and providing reports including compliance, audit, and security.
9) Runbooks and Automation — Engineers need valid best-practice runbooks and procedures, both manual and automated, across the full-stack and breadth of systems, services, and situations.
Alerts and problems happen 24/7 and teams need to know what to do, how to solve the problem, and the key steps for getting things back on-line and performing as expected. But few monitoring or management tools have real runbooks, let alone root-cause-analysis or expert systems to help them.
Plus nearly no system can take automated action based on advanced alerting, especially with proper security and protection for systems-under-management. A few systems monitor auto-scaling, a couple can manage it, but auto-action and auto-healing is currently non-existent.
10) CMDB and Audit — Engineers need deep and powerful CMDBs for parsing, inspecting, and versioning each and every configuration and component on every level.
Configuration Management Databases are important components of ITIL-based IT Management Tools, but are often quite limited to basic hardware specifications and lists of installed software and versions. Very rarely do they dive more deeply into the real configurations of the OS, Clouds, or Services, such as Apache, Redis, PostgreSQL, and other services. They have weak versioning and history and usually have no idea who made what changes or when.
Generally, current systems cannot compare servers, find configuration drift, or alert when various changes occur. Plus they perform poorly at dynamic infrastructure, Docker, auto-scaling, micro-services, and other modern stack elements. New CMDBs need to incorporate alerting, differencing, and cloning, along with integration into all the other tools to assist in audits, compliance, monitoring, root-cause analysis, and troubleshooting.
On top of the CMDB, engineers also need comprehensive full-stack best-practice audit tools, with recommendations and, ideally, automated correction of most issues across a wide variety of clouds, services, configurations, and systems.
In general, today’s audit systems are overly simplistic and limited to only the cloud layers or specific services, like MySQL (i.e. not full-stack). Real systems are complex and dynamic, and it’s often hard to know if best practices and proper processes are being followed and if systems are properly configured for optimum reliability, performance, and security.
If engineers had their way, these the Top 10 dream tools for managing modern systems would be a reality. Unfortunately, none of the current systems really exist the way they need to or function at optimal levels. They aren’t fully integrated the way they should be.