DevOps for Early Stage Startups
Advice for non-technical founders to get going
This is the second article in my series of Software Developer & DevOps for Early Stage Startups (see the first article), focused on non-technical founders and what they should do as they get to MVP.
Infrastructure & Deployment
Once you have some working code and an application to deploy, you need some place to actually run it, usually on cloud infrastructures such as AWS, Azure, or Google GCP.
This can start simple, but will rapidly get complicated as you get near any sort of production deployment, with a lot of small, and not so small, items that need to come together correctly to deploy a system.
More importantly, once some infrastructure is built, it will rapidly and often change as you grow, find new needs, change elements of your tech stack, etc. This change is usually where the problems start to creep in, with security, cost, and stability usually becoming issues.
Getting Started
Start with a set of cloud accounts that you own (not owned by your 3rd party developers, nor by employees). The exact structure will vary by cloud, but one ‘account’ is probably good to start, with provisions for at least three or our completely separate environments: dev, test, and production, plus maybe staging later.
To get things going, teams often manually create a dev environment to get some VMs, Docker, databases, etc. up and running on day one so they can deploy and test stuff.
That’s fine, but you should move to a separate, and carefully controlled, set of environments as soon as possible, ideally in within a month or two, and definitely long before you do production deployments.
As always, give each developer and 3rd party system or API separate users and credentials, so you can later control their access, deactivate when they depart, etc.
You might also set some cost alerts on the account so you have some sense of money being spent, and in case a mistake suddenly sets up a $5–10,000 per month resource that you’re unaware of.
Infrastructure as Code
Modern infrastructure is managed as code, using the IaC model, or infrastructure-as-code, usually using either a cloud’s native system, e.g. Cloud Formation on AWS, or an industry-standard cross-cloud tool like the ever-popular Terraform from Hashicorp.
Ideally, one of your DevOps engineers familiar with the tools and the chosen cloud will start to code, build, and test some infrastructure, usually for the dev system first, but sometimes starting with the production system and working backwards (especially if there is a simple manual development system already in use).
If no one on your team knows how to do this, get some outside help to get started, as then it’s usually fairly easy to maintain and make simple changes from there.
If you use Terraform, be sure to use their free cloud product, too, so your critical ‘state’ files are stored in the cloud (and backed up). This lets multiple engineers update your infrastructure, provides additional security, etc.
This process will both end up with a nice, controlled, documented system, and for a number of issues to the surface, such as overall security, URL and SSL list, per-service resource requirements, database options, secrets management, log collection, networking challenges, and monitoring pans.
All of these are best solved early and often, lest they blow up into major challenges just when you want to go live or have important milestones to meet.
Also, making frequent changes to the environments and expanding them for dev, test, and production will force the configurations and code to improve as it deals with assumptions, changes, update issues, etc.
Secrets Management
Managing secrets in cloud environments can be quite challenging. It’s very important to get this right or at least well-secured, or else hackers or others can easily compromise, or even destroy your entire system.
Note that secrets include all users, passwords, keys, authentication data, plus most critical configuration data such as host names, IP addresses, database names, etc. If you don’t want it broadcast on the Internet, it’s a secret.
Generally, try to use the secrets system your cloud provider has, and find ways to integrate it with your run-time environment. This can be challenging, but most systems can at least use environment variables, which can be connected with secrets managers in a variety of ways.
Whatever you do, never, ever hard-code secrets in your source code (and use tools to alert you if you do), but also keep them out of configuration, deployment, CI/CD, and other file and systems.
Keep them in one place and centrally managed, if possible, or at least spread to only your CI/CD and cloud systems, and keep it simple as complexity is the enemy of security. Regardless, force your developers to have a secrets process and tooling right from day one.
Note you can also use more sophisticated products like Vault from Hashicorp, but they are often complex to use, manage, and secure correctly, and thus are best left to later phases when you need more powerful tools.
Kubernetes
You will undoubtedly be pushed to use Docker containers and Kubernetes very early on. This may or may not make sense, depending on your team’s experience level and the size or complexity of what you are building.
Docker is very good to use, and use it you should, for reasons covered in th first article in this series. Just be sure you use best practices to build secure and maintainable Docker images.
Generally, you probably don’t need Kubernetes, and if you do need it, you probably should not try to run it yourself. There are many cloud-managed options available now that remove most of the burden, but not always the overhead of designing, configuring, and running a fairly complex system. Kubernetes is a real bear to run and manage, so even if you need it, really try hard to avoid running and managing it yourself.
A lot also depends on your architecture. For example, if you have a basic web or SaaS application with a JavaScript front-end and a backend consisting mostly of APIs from a single code base, it’s often easier to launch with a few backend containers and a static front-end, a couple of load balancers, and that’s it. Your code can run in various ways, e.g. on VMs, Docker, or various simple container deployment systems such as Amazon Elastic Container Service or Google Cloud Run.
That type of simple deployment will leverage all the benefits of containers and fairly dynamic run-time environments without the complex overhead of Kubernetes, which you can always move to over time as your needs grow.
Monitoring, Logs, Tracing
Once your system is deployed and has any users, you’ll need to monitor and manage it. This means different things to different people, but you should strive for basic Monitoring, Log collection, and ideally some distributed tracing (usually much harder to do and often not needed).
Start with the built-in cloud services on your platform of choice, but realize most of these are not very capable, and are often hard to use effectively. It’s usually better to use a full-stack service such as Datadog that can collect and combine these all into one place, especially if you have a very small (or non-existent) team. Just watch the cost, especially on logs, as it can easily reach thousands of dollars per month if you have debug code spewing logs everywhere.
Monitoring
Monitoring can get overly complex and unwieldy as folks like to monitor everything. Start by keeping it simple, and focused on key metrics that impact your users or the system. Basic monitoring services will automatically pick up stuff like out of memory or disk space, etc. and you should add a 3rd party service to monitor your public web app and APIs, as that’s what the users are looking at, and you need to know if they are down or broken.
You’ll likely want basic service monitoring such as for MySQL databases, to at least see queries per second and other load or scaling metrics that may be helpful at some point.
Beyond that, some tech stacks, notably Java, need extra monitoring of the JVM, especially the heap size and use, as many developers use inadequate default settings as it’s easy to overload or exhaust the server resources and not even know it.
Logs
Good logging is an art form, but get developers in the habit of good logs from the beginning, using JSON formatted data with useful messages plus context such as what server, user, URL, task, etc. This will really help solve problems quickly, especially as you won’t have much of a support or operations team early on. Logs really are your friend, especially good ones.
Work hard to get logs from your applications into a centralized collection system, as this will help developers and operations teams really understand and troubleshoot what is doing on, especially in production when unexpected things happen. This is really important for production troubleshooting.
Tracing
More advanced than monitoring or logging, tracing, or what is now called Observability, is about knowing what’s going on inside your application. This is especially useful when there are many services or other moving parts, which make it very hard to know where and why something failed.
Look at Honeycomb.io as the top tool in this area.
APM, JS Errors & RUM
While monitoring and logging are important for your backend services, today’s modern applications tend to have very complex front-end systems, usually based on JavaScript. Since this code actually runs on the users’ laptops or phones, it’s very hard to get good error reporting or troubleshooting info.
To solve this, a number of tools and technologies have evolved, mostly focused on the end user’s experience and what’s happening in the browser on the user’s device.
These include the broad area of Application Performance Monitoring (APM), JavaScript error reporting (tools like Sentry), and Real User Monitoring (RUM) tools, from companies like Datadog. Note RUM has now become End-user experience monitoring (EUEM), which has evolved into Digital experience monitoring (DEM).
Many of these services have also merged, with new acronyms, and they can get overly complicated, but try to get basic error reporting, response time tracking, and some form of user screen recording, which can be invaluable for troubleshooting (especially with consumer-level users).
Conclusion
This concludes part 2 of this series (see part 1), having covered some challenging areas where many early-stage startups with non-technical founders make mistakes.
Steve Mushero is a Fractional CTO for Silicon Valley and other startups. He’s been a founder, CEO, CTO, Architect, and consultant over the years, along with writing books, speaking, flying helicopters and more. He’s easy to find on the Internet, including at SteveMushero.com