Stable and healthy infrastructure



Absolutely all companies who decided to build or already built a hosted solutions will get to a very important phase - deploy a stable, flexible, and fault-tolerant infrastructure to run their software and serve the business needs. This process requires to receive answers on some common questions like:

  • Cloud or on-premise deployment?

  • Which 3rd party services will be used (Databases, caches, queues, etc.)?

  • What kind of Configuration and Deployment process?

  • Approaches and tools for continuous Maintainance and Provisioning?

And all these questions should be answered as soon as possible to avoid issues and pitfalls when real users will start using the solution.


Which common issues you may get if you won't think about each part of your infrastructure? Here is a small list of them:


The downtime due to the maintenance process

We should be honest, there no ideal software in the world, and bugs may happen at any time. Have you ever thought about how you will redeploy your system to deliver fixes, improvements, or just new versions? What if in some period of time you will need a migration? Migration to another solution or database? Or probably to another cloud provider? 

One of our clients, airline company, told us that they lost about $500K during 30 minutes of the downtime while their support delivering a new UI version that contained small improvements. That's was a huge challenge for us knowing this, because, our responsibilities were to migrate their solution from Monolith to Micro-services, and of course without any downtime. 


Performance reducing

Performance is a measurable thing, but how many factors do you know in your system that directly affect this value? For web applications the most important performance values are about:

- How many people may use the platform at the same time?

- How long is each user waiting for a system response? 

These are understandable. The best way to manage and improve it is through the development of different optimizations. But what if such issues happen times to times, during a "hot peaks". Your flexible infrastructure will definitely help a lot and be the best insurance.


Failover and poor availability

It is a real-world, and in this real world, the bad things may happen - Electricity outage, poor internet connection, server failovers, bugs, and bottlenecks. The question is how to deal with all of this? Which strategies do you develop in your DevOps and IT teams? How fast you can back everything to live and how much your clients will lose during this time?


The complicated process of issue investigation

We already mentioned that issues may happen, but, actually, they will happen. How fast you are able to detect the problem, determine the reason, introduce and implement the fix? Which data quality do you have in your logs and metrics? How do you manage sensitive and identifiable data? Any aggregation and data visualization? 


Data losing

Probably the most important point. Data, nowadays, is the most important part of any business. If your system has an option to lose data or even just a small part of it (concurrency modification or unavailable endpoint or broken communication between services) - it is a disaster. 


All these issues are just the tip of the iceberg, and if at least one of them will happen - it might bring a lot of issues to your client's business or even kill it. To avoid them we will describe a few common approaches which we usually use to prevent such issues and move solution in the right direction.


The downtime due to the maintenance process

It is all about your continuous development approach. You should think about this in a very detailed way, because, deployment, even into production, shouldn't be a rare process. Bugs, improvements, custom client requests - all this is a continuous process, like your deployment. 


At Jappware we practice the Blue-Green deployment approach and highly recommend it to everyone. With this approach, you will avoid any downtimes and make your system availability as a constant during any maintenance procedures. 


Also, migration is a one on other aspects that might require a downtime until you making a data copy and switch infrastructure into a new way. To proceed with this without any downtime you need much more that one solution, you need a strategy. One of the possible concepts might be this instruction:

  • Build a proxy implementing the Canary Deployment process

  • Fix database scope and make a copy

  • Migrate database and switch traffic into two directions (supporting Old and New)

  • Create a commit-log to track mutation changes- Create a delta dump of the database to cover all missed parts which you might miss during the main database recreation

  • Apply the commit log over final database version to reply changes


Performance reducing

The best way to handle such issues is to be "ready", and to become "ready" - you should conduct load and stress testing through the whole system starting from external API to internal services communication and 3rd party system integration (Databases, etc.). Collect metrics and build a data sheet with testing results. Using this information your DevOps team should build the infrastructure that is able to check the current system state (requests latency, memory, and CPU consumption, etc.) and based on well know data make a decision when upscale and downscale the infrastructure behind the load balancer.


Failover and poor availability

This is a place where automated monitoring becomes your best friend. We know a lot of real work cases when different cluster nodes, services or workers just stop responding, yeah, there is an exact reason why it is happening (Memory issues, lack of space, CPU usage, etc.), but, it happens, and it might make damage on your client business and reflects on customer feedback. Automated monitoring is a tool that can manage such situations and trigger any actions to fix such issues asap.


The second tool is an Infrastructure As a Code. Right, having an automated way to handle failover and issues with service availability - you need a tool to quickly restore the environment and bring things to work without wasting time. Think about it. IAAC (Infrastructure As a Code) is a very powerful mechanism that is able to restore the environment through the seconds, even if it requires installation on 3rd party tools and different network and VPC manipulations.


The complicated process of issue investigation 

Long aggregation and metrics analyzing, as well as different performance notifications and limitations. Yeas, all sounds much easier than it is in the real world. Longs should be structured, if you are logging in all places, even with a correct priority level, it doesn’t mean that it follows any structure. The structure is when you can build a use-case picture reading the logs. Each log message should contain some sequence, detailed case, and, of course, it should avoid any sensitive information. Log messages should contain especial use case details instead of pointing who did what. 


Metrics are even more important than logs. Using metrics in the right places you can automate their monitoring and apply different rules to manage alerts and notifications. Your metrics should describe actions, parameters, or status of the things which are continuously happening in your system. Metrics should be short and smart. We use the term “Key Things”, and everyone should explain - what is a key thing in our system, what is critical and what is make sense, what might be ignored while others might bring issues. Understanding of the system brings all answers on all of these questions, ask your team, and check if you are comfortable with the answer (answer should contain a "message" that contains a reference to a "key thing")


Data losing

This is a very sensitive topic. It is about your understanding of the data distribution, partitioning, consistency, and availability. We hope that you know and always use the term “CAP theorem”. Right, you may use a little bit different term called “ACID”, however, this term is not about big data, which means, actually, a big data processing. ACID is all about RDBS, which is so great for simple solutions, but, the rest is all about the NoSQL world. CAP theorem should explain the right mindset, ideology, but, it can’t solve or even it doesn’t explain how to solve any problem. 


Think about your data, do it continuously, this will answer all your questions:

  • How to store and query it?

  • Which DB should I use?

  • How to optimize reading or writing?

  • How to deal with updates and deleting?

  • etc.


You are not able to lose even a single character or digit. Think about replication - make this factor minimum of 3 and continuously check the consistency of the cluster. Think about distribution and partitioning key - this will make your cluster more robust and efficiently loaded. Your data nature order and clustering key - will help you top reach an exact data in preferable order. And, the thing about the consensus algorithm, it means how your cluster will decide which node contains the right results, or event is the result is correct about your client can trust it.


Summary

The client’s business and interests should receive not only a product or solution - it should receive a continuous strategy that will move things forward in the correct way from days to days. Modern approaches require attention on each it steps, there are no, or shouldn't be, any “single point of failure” - there should be a system, the mechanism, where even if some piece of it becomes broken - then the rest parts will keep serving business needs. 


Nowadays modern software requires an understanding of the current business needs, future business needs, and possible business needs. Using these characteristics the software development process should include CI/CD, monitoring, provisioning, and maintenance strategies into this usual daily workflow to make clients businesses work Non-Stop, save or earn money and be safe. 

70 views

© 2020 LLC JAPPWARE 

info@jappware.com   |   Oleksy Novakivs'koho St, 3, Lviv, Ukraine, 79007

  • Black LinkedIn Icon
  • Facebook Basic Black
  • Twitter Basic Black
  • Black Instagram Icon