Moving to Azure: why we did it & what we learned
- We are satisfied that we can ensure an operational system for the web application, our API and all other parts of the HQ.
- Temporary files and file uploads, for example, were stored on the file system of the server.
- Azure provides a distributed file storage that we use on all servers to store those files in.
- To separate customer data, the HQ keeps each customer’s data in a different database.
- The next step is to move the customer’s data and files to Azure.
We are changing our hosting infrastructure by the end of this year to Azure. Read more of the reasons, progress and learnings in this post.
As we announced earlier, we are changing our hosting infrastructure by the end of this year. Read more about the reasons, progress and learnings in this post.
Already a year ago we at HQLabs discussed how to host the HQ in a more reliable way. One of the most important aspects of hosting an application reliably is the fail-safety of the servers. This is typically achieved by running multiple identical servers with a load balancer before them that would route the incoming user traffic to the servers. If one server fails, the load balancer would notice it and forward the requests only to the other running servers.
This fail-safety can be applied to the server that runs the application but is also important for the database server and basically any other part of the HQ.
We have since been experimenting with different load-balancing scenarios for the HQ application, the database server and the long-running background jobs. We noticed, however, that setting up and continuously running and maintaining such a server architecture requires a lot of time and effort that we think we can invest better elsewhere, like in developing new features.
Cloud platform providers like Azure typically provide load-balanced, scalable and most of all redundant components like application and database servers as a service. Those services can also scale up automatically during the day when the load on the servers increases and scale down at night. The fail-safety comes out of the box. Azure is a cloud platform designed by Microsoft and is not limited to – but has a very strong support for – .NET applications like the HQ.
Over the past years, Azure has been growing to over 30 global data and computing centers that always come in pairs in one region, with the European centers in Dublin and Amsterdam. Due to data privacy concerns, many German companies are however required to have their data hosted within Germany and by a German legal entity to avoid the illegal transfer of data to other countries. The Azure data centers operated by Microsoft could not guarantee this, especially since Microsoft is based in the U.S.
In the beginning of this year, Microsoft announced the plan to build two data centers in Germany which would be operated by T-Systems in order to fulfill all legal requirements of German customers. We joined the trial phase of Azure in Germany early on and have been hard at work to make the HQ Azure-ready. Microsoft Azure Germany has been live and operational since October and we are planning to migrate most of our customers by the end of the year.
To make use of all the great features of a cloud platform, like scalability and reliability, we needed to make some changes to the HQ.
First of all, the HQ was built to run on a single machine in the past and relied on the current state of the machine. Temporary files and file uploads, for example, were stored on the file system of the server. Now that we want to run the HQ on distributed servers, that no longer works because the other servers have a different file system and therefore a different state. If a user’s request were directed to another server, it might fail due to the files missing.
Azure provides a distributed file storage that we use on all servers to store those files in. This way, all files are directly available on all servers and can be accessed even in such a load-balanced scenario. We are using this distributed storage for temporary uploads, customer-specific CSS files as well as .NET view states.
While talking about files, there is another issue with the way the HQ previously stored files. In order to be able to back up the files along with the customer data, the HQ held the binary files in so-called file streams of the database. This is a very convenient file storage but has some disadvantages: the files increase the database size, which is expensive, and Azure does not support file streams.
Instead, we store the files in blob storages on Azure, which is a very fast, reliable, locally redundant and inexpensive storage mechanism for lots of large files. So instead of storing the file in the database, we add it to the blob storage and save a link in the database. It took some effort to rework the file logic but we are very satisfied with the result. Azure provides many ways to access the blob files and we have only started to make use of that feature.
To separate customer data, the HQ keeps each customer’s data in a different database. This logic is rooted deeply in the HQ and has proven very reliable. On Azure, one would normally pay for each database and get guaranteed performance. However, paying for hundreds of individual databases can become very expensive even if the load on an individual database is rather low.
Azure elastic database pools provide a way to contain many databases that then share the guaranteed load. We then pay for a pool of hundreds of databases where the costs per database are acceptable. Databases can be managed together and moved between pools to offer the best performance level for different loads. Currently, we use multiple pools with different performance characteristics, for example for customer databases and internal tools. The Azure API makes it possible to monitor the pools so that we can take appropriate measures and increase the performance, for example.
In the end, our hosting infrastructure looks like this. We have a couple of infrastructure components like the virtual network that connects to our office as well as some virtual machines where we run some utilities. Our storage consists of a SQL server with several elastic database pools and some additional blob storages and the distributed file system. The application, API and background jobs run in a productive environment with a cloned environment for testing and QA.
We have been testing the HQ on Azure for a while and created the entire infrastructure that we need to host, manage and maintain the software. All components have been stable for a while now and we have used the Azure-hosted HQ productively for more than a month. Therefore, we are satisfied that we can ensure an operational system for the web application, our API and all other parts of the HQ.
The next step is to move the customer’s data and files to Azure. To do that, we have created a migration routine that creates a backup of the database and then uploads the data from the file streams to the blobs first. After that, several schema migrations, adjustments and improvements are run on the database. The data is then restored on the Azure SQL Server and added to the corresponding elastic pool. At the end, all configuration data and settings for each customer are moved from the old database server to the new one. For the actual data migration, we used the command-line tools of SQL Azure Migration Wizard, which worked like a charm.
Uploading the files to the blob storage is typically the longest task and therefore depends on the amount and size of files in the customer database. Normally, migrating a customer system to Azure takes about 2 minutes. This process will happen during a maintenance interval at night, as announced to our customers earlier. Some demo systems have already been migrated to Azure and some of our customers agreed to preview the new infrastructure and help us test it. All others will be migrated in the following days and weeks.
Over the past weeks, we have been busy moving all customer systems to our Azure infrastructure. Except for a few DNS propagation issues the migration has gone smoothly, as expected.
Moving to Azure, or any cloud platform for that matter, takes time. Especially since we had a few state-full parts in the HQ, we needed to refactor those for a load-balanced scenario. We wanted the HQ to be a first-class citizen of the cloud and therefore needed to make the state-full components not only cloud-ready but embrace the concepts of a cloud architecture. However, changing core components of a large application like the HQ always takes time and requires long testing.
Also, our customers needed some time to prepare for the move. Since our application accesses a couple of on-premise systems of our customers, we needed to coordinate with their IT to secure remote access by updating VPN settings and firewalls. Azure supports a range of, but not all, VPN gateways and we are still trying to find a solution that works for all customers.
Finding the right Azure components for each task was not always easy, and we will keep on looking for ways to improve all parts of the HQ. Add to that the fact that Azure Germany currently is only a subset of the features of a global Azure data center. It is already very powerful and most of the main components are available in Germany, but some detailed features are missing that prevent us from using a specific component. For example, Azure-hosted web apps cannot be integrated into virtual networks, which required some workarounds to send traffic between web apps and machines in the virtual network. And until a few days ago, the Redis Cache-as-a-Service is available only with very limited resources (up to 100MB where we need several GB), so that we have to host it ourselves for the time being. We are sure that those features will be available in the German cloud in the near future and we will make use of those wherever possible to improve the HQ.
When we moved the first test systems to Azure, we closely monitored the performance of the load balancer and the application machines behind it. But only when a larger amount of productive systems were hosted on Azure did we get enough traffic to actually test this scenario. It turned out that the load balancer did not distribute the incoming traffic as equally as expected, which lead to high CPU loads on one server while the others were almost idle. We investigated this further and found that the classic load balancer is not designed to distribute load randomly but according to fixed rules, which are not very useful in our design. To find a way to solve this issue, we consulted the Azure Support team, which recommended to use a different load balancing type instead which has been released recently. Application Gateways are the correct way to route load-balanced traffic for web applications hosted on virtual machines on Azure. We set it up together and quickly had very satisfying results.
From the very beginning, the team at Microsoft has been very helpful and supported us along the way. From workshops, on-site and remote technical support to organizational and legal questions regarding the data protection structure, there are many ways Microsoft did help us. And it only makes sense to involve the experts in such a task. We are very thankful for the good contacts, the great support and having such a motivated technology partner.
We are satisfied that we can offer the best hosting solution for the HQ available in Germany right now. We are now looking forward to all the benefits a cloud platform offers us. We made the first big step towards the cloud, and now only the sky is the limit. Stay tuned for more amazing things!
Leave your questions in the comments or tell us your experience.