Building & Deploying a Mid-Sized SaaS System

How & why we build our OpsStack platform & application

Steve Mushero
7 min readOct 13, 2020

Lots of people are building SaaS systems these days, often starting from scratch, hunting for tools & procedures, and making things up as they go.

So I thought I’d outline how and why we do it our way, based on best-practices and good tools, while keeping it simple and manageable.

First, a bit about OpsStack — it’s a mid-sized SaaS system for IT & Cloud operations, to build, monitor, manage, automation, etc. on-line systems at scale (dozens to thousands of servers).

It consists mainly of a large and fairly complex SaaS web application, with various server agents, APIs, schedulers, and lots of moving parts. But our main interest for this article is the core SaaS application, which is quite similar to most B2B SaaS systems people are building today.

Goals

Our goal with the system is to keep it simple and manageable, with basic, but good processes. Our developer team has varied in size from 5–10 core engineers over the last few years, with other folks for QA, project and product management, etc.

Scrum/Kanban

The team is fairly experienced, so while we started out early with hard-core tightly-managed Scrum sprints, this evolved into a much lighter-weight and more efficient Kanban process over time.

Scrum is a good process, and has good guidelines, tools, etc. for getting up and running as a new development shop. However, for our core team of about 10 engineers, it was taking up many hours per week in overhead, even for two-week sprints.

At least a half-day and perhaps up to a day per week was being lost in planning, estimations, retrospectives, and the like. The team could have managed it better, but all that overhead was not really adding value, so after a year or so, we migrated to a Kanban approach with simple stages such as Needs Requirements, Pull Queue, WIP, Needs Testing, etc.

Core Languages

The system is built in PHP, which is less sexy these days, but still probably the most common language for on-line systems, especially when coupled to a modern framework AND when carefully managed to high-quality standards. We have about 500,000 lines of PHP code.

JavaScript is, of course, used throughout the user interface, mostly via React.js which makes up most of our UI, all of which has had a fairly steep learning curve for us. We use JS only in the front-end, and really only in React. So while node.js is used for building, it’s not used in production.

Speaking of JS, we use ‘vanilla’ JS, whatever that is, plus JSX for React.

Frameworks

Everything is built up on Laravel, which we chose after looking carefully at all the options in early 2016. It seems we chose well, as Laravel is now the framework of choice for PHP apps, and we are very happy with it.

However, we have found Laravel can be a challenge for run-of-the-mill PHP developers, who may be used to basic PHP with no frameworks, or working in platforms like Magento, Drupal, or Wordpress.

Laravel is often too structured and complex for them; in fact, we’ve found Java experience to be quite helpful as Laravel is much more Java app-like in its approaches, complex inheritance, etc. This is evolving, though, as Laravel and similar frameworks become standard.

Style & Rules

PHP is very strictly controlled in terms or style, rules, and checks. We follow the best practice PSR standards, plus many of our own, all enforced by our style checkers including phpcs, phpmd, and phpcpd. We do not phpstan, but we should.

Tools

We use a variety of basic tools including:

  • Pivotal Tracker - Our core feature, story, and developer management platform is Pivotal, where we have thousands of items, history, releases, and much more. It’s far from perfect, but works well for us, though its Kanban and drag-and-drop UI could be better.
  • PHPStorm - Most people are using PHPStorm which we’ve found very powerful and empowering for developing, debugging, documenting, etc.
  • Gitlab - We generally prefer the much easier-to-use all-in-one approach to git servers from Gitlab. The UI is very usable, and allows a lot of operations for releasing, branching, tagging, and other basics that are easy to use, especially across the 100+ projects we have for this and other things.
  • Selenium - Testing at the UI level is done via Selenium and the phpunit framework.
  • Readme.io - Our end-user documentation is hosted in Readm.io, and directly linked to from inside the application.
  • Sentry.io - Crash and error reporting is done by Sentry.io, which has worked quite well over the years, including with integrations to Pivotal to auto create stories.
  • Jenkins - Building and deployment is all run via a fairly large Jenkins system, since builds and tests take a LOT of RAM each.

Git Use

We are of course bit Git fans and use it in fairly traditional ways. We use the feature branch model, so each Pivotal story has a git branch, with the story number and title as part of the branch.

Branches are fairly short lived, usually a few hours or at most a few days, and rarely carry over across releases (and are back-merged if they do).

Our overall conflict rate is extremely low, such that a developer may run into a conflict every few months. This is mostly due to modularity and developers mostly ‘owning’ and working on different areas of the system with strict API and other boundaries in-between.

Releases are done from the trunk as their own branch, and of course any post-merge fixes are done from and on that branch if needed. We do hot patch production from time-to-time for easily updated PHP code, though or JS or deeper PHP issues we’ll do a full branch, build, and release cycle.

CI/CD

We use a multi-stage process where each branch is first created from the trunk, then worked on by the developer. Once it’s ready to merge, a merge request (a pull-request in normal git-speak) is created, which triggers a Jenkins job called ‘quick-check’ that’s not really very quick, as it does a full build and some tests over 45 minutes or so.

Only when that test completes can the merge proceed in gitlab, and depending on developer seniority, may auto-merge, be manually merged by them, or manually merged by an approving senior engineer after final review. This balances speed with safety and sanity checks on new people.

Code Reviews

We have not done as good a job as we should on this. Most, if any, review happens at merge time, with senior engineers reviewing the pending merge request for any issues, using the Gitlab comment and review system for tracking and final resolution.

We occasionally do in-person reviews and really should perform these more often, both as educational opportunities for junior engineers, and as a way to get everyone increasing on the same page on how all the moving parts work.

Testing & QA

Our testing is not as good as we’d like it to be and is pretty focused on unit and feature testing before release. We also have modest UI coverage via Selenium, though keeping this up to date with the UI changes is a full-time job.

The product owner and/or product managers try hard to test features, changes, and fixes before they are released, though some things really have to be tested in production on real data and systems, especially for bugs related to data, which is a lot of them.

Bugs

We never have any bugs. Well, maybe a few.

They get discovered testing, Sentry crash reports, or sometimes by customers. All are entered into Pivotal and assigned priorities as needed. The highest priority bugs need fixing ASAP or before the next release, and some have to be hot-patched, especially if and when a release breaks something important in production, which has happened.

Releases

We release on a weekly basis now, which serves us well, especially given our limited testing resources. We used to release every two weeks, but this created too long a delay between development and our ability to test and delivery to our end-users.

We could release more often, though weekly releases usually mean a new feature or fix is only a few days away, which has been a good balance for us.

The release process is semi-automated, and takes 2–3 hours of clock time, as the builds and tests take 45 minutes and are run at each merging step. We should improve this, though it would take a lot of effort for the many small steps and we have not gotten to it.

Environments

We have development, staging, and production environments, all VM-based on AWS. The development and staging systems are essentially identical to the production system, using older copies of production data, etc. but on smaller scales. We’ve never had issues with these environments.

Since this is an operational system that manages other system, maintaining ‘managed’ target systems for the development environment has proven challenging, including all the data feeds, cloud environments, 3rd party systems, etc.

Over time, we’ve used the staging system less and less, which is not ideal, but reality has meant developing in the development environment and then deploying from Gitlab to production, usually without issue.

One reason we do this is the system has become increasingly harder to test without production data and especially all the complex configurations, rules, data processing, machine learning models, alarm logic, etc. that is very hard to push back to development.

We also build & deploy via Docker containers for systems that will be on-site. This is fairly easy as we use a single integrated core codebase, for all the web, monitoring, API, batch, and other processing. So normally different VMs run various parts of the code, but it always deploys as a unit.

This makes it very easy to build a single core Docker container that can provide many different services. The container includes web and app runtimes, schedulers, etc. and each is only started when needed.

Infrastructure

The core production infrastructure is VMs on AWS. Originally built manually, this has mostly migrated to Terraform, a fairly painful process. We do not build new VMs or make other infrastructure changes for various releases; in fact the infrastructure is quite static other than OS and package upgrades.

This article is about OpsStack, the unified infrastructure operation platform. Developed by our team China, and offered there, plus in the USA at Siglos.io

--

--

Steve Mushero

CEO of ChinaNetCloud & Siglos.io — Global Entrepreneur in Shanghai & Silicon Valley