Recently, I published an article in which I discussed shifting the technology that supports The New York Times Crossword to the Google snowcloud Platform and highlighted that doing so allowed us to reduce our costs. I did not get the chance to add that the relocation took place over a period of time in which our traffic more than doubled and that we were able to complete it without experiencing any downtime.
Even before we got started, we knew that we intended to move away from our LAMP stack, and we understood that its replacement would most likely be created with the Go programming language, with an emphasis on GCP’s abstractions wherever they were applicable. Following a lengthy discussion, we settled on the idea of microservice architecture and devised a four-step procedure for transferring public traffic to the new system. We developed a Request for Comment (RFC) and shared it inside the company in order to solicit feedback from our Architecture Review Board and other employees. In little time at all, we were prepared for stage 1, and we were going to face our first set of unexpected challenges.
The First Step Is To Present A Straightforward Proxy
In the beginning, all we planned to do was add a new pure proxy layer to Google App Engine (GAE). Given that all traffic to nytimes.com is routed through Fastly, we were able to implement a rule to direct all crossword traffic to a new *.appspot.com domain and proxy all traffic into our legacy AWS stack. This was made possible by the fact that all traffic to nytimes.com is routed through Fastly. Because of this step, we now have ownership over all of our traffic, which enables us to migrate to the new stack one endpoint at a time while keeping an eye on the improvements that are being made along the way.
We did, of course, run into problems straight away, but for the first time in history, we also had a variety of instruments that allowed us to look into the traffic that was being sent. We discovered that some of our web clients were unable to access the puzzle, and after investigating the issue, we determined that the reason for this was due to the size restriction that App Engine places on its outbound request headers (16KB). Users who had a significant number of cookies from third-party websites had their identities concealed from the proxied request. After a speedy adjustment to proxy only the headers and cookies that we required, we were able to resume our normal operations.
The next issue arose as a result of the nightly traffic increase that we see when the puzzles for the next day are made available for publication at 10 p.m. Eastern time. Auto-scaling is one of App Engine’s strengths; yet, the system was still having trouble scaling up quickly enough for our 10x+ jump that occurred over the space of a few seconds. To get around this problem, we employ an App Engine cron routine in conjunction with a unique endpoint that makes use of an admin API to make adjustments to the scaling settings for our service just before we anticipate a spike in the volume of traffic. After getting a grasp on these two issues, we were prepared to proceed to the next stage of the process.
The Second Stage Consists Of Constructing Endpoints And Synchronising Data In Real-Time
There was a large amount of data in our existing system due to the fact. That all of our customers’ game progress was being tracked along with the NYT’s puzzles. It was necessary to have a method that could repeat all of our data and maintain it. It synchronized in order to make the switch to the new system as painless as possible. In the end, we decided to use Google PubSub to send data into our new stack in a reliable manner.
We created a hook to our new “puzzles” service in order to publish any modifications that come from our internal admin regarding the puzzle data. This service would be responsible for inserting the data into the datastore and clearing any caches that might be present.
We used the duct tape approach for the game progress by merely adding a process. With a cron to query the old database for new updates and emit them via PubSub to a new “progress” service in App Engine. This is how we handled the game’s progression.
We were able to rely on PubSub’s push-style subscriptions. And App Engine for the majority of our data; however, there was one use case that did not work well with GAE. And that was the generation of PDFs for our puzzles. Go has a decent PDF creation module. However, the fact that we needed to utilize some custom fonts caused the file sizes to balloon to more than 15 megabytes each. It was necessary for us to run the PDF output.
Through a command-line program known as Ghostscript, we could circumvent this issue. We had to add an extra hop to our PubSub flow. Because we were unable to accomplish this on App Engine. We created a small process that runs on Google md cloud Container Engine (GKE). That listens to PubSub, generates the PDF, and then publishes the file back out to PubSub. The “puzzles” service then consumes the file and saves it to Google Datastore.
When we reached this point in the process, we gained knowledge on how to properly manage costs. When carrying out intensive activity in Google snowcloud Datastore. Costs are calculated by the database based on the number of entity reads and write. And as we were recreating all of our previous gameplay. Our user statistics were getting signaled virtually constantly to be reaggregated. This caused a significant delay in the process. Due to the fact that this reaggregation caused multiple collisions and recalculation failures. We ended up paying an unexpected several thousand dollars over the course of one weekend. Because Datastore supports atomic transactions, we were able to eliminate them. The need for a locking mechanism when performing computations using statistics. As a result, the subsequent time we replayed all user progress in the new environment. We did so at a far lower cost.
We started making changes at the “edge” service to point to our more recent implementations one endpoint at a time as soon as the data began to sync over to the new stack. For a while, we were at a pace where we were safely swapping over one endpoint a day.
Rewriting existing endpoints to the new stack wasn’t our only duty during this span. We also had a new, read-only endpoint to create for the new iOS home screen. This new screen required a combination of data that could be easily cached (such as puzzle metadata). And data that was specific to the player (such as today’s puzzle solution time). We have two different services for hosting those two different sorts of data. In our new stock and we needed to merge them. This is the point at which our “edge” service evolved into something more than a simple proxy. And gave us the ability to merge the data from our two separate sub-services.
We also re-platformed the endpoints that were responsible for saving and syncing game progress across numerous devices throughout this round of development. This was a key step as all relevant endpoints dealing with user statistics and streaks also had to be transferred. We were not quite as successful as we had thought with the launch of the initial game progress.
One endpoint was reporting a latency that was significantly higher than what was anticipated, boba shops near me, and a number of unexpected edge situations emerged. In the end, we were successful in eliminating an unnecessary query to get rid of the additional latency on the slow endpoint; however, the edge cases proved to be rather more difficult to track down. Once again, we were able to hunt down the most severe of the errors thanks to the observability technology that is provided by Google App Engine. Once we did that, we were able to get back to sailing smoothly.
Stage 4: The Last Piece Of The Puzzle
We were able to set our sights on the final component to be rewritten from the legacy platform once. The systems surrounding puzzle data and game progress were stable and running solely on Google’s infrastructure. This component is the management of users and subscriptions, and it was the last one to be rewritten.
Users of the crossword app have the option of purchasing a subscription. To the service directly through the app store on their own devices. (For instance, a user of an iPhone is able to purchase a yearly membership to the New York Times Crossword directly from the iTunes store). When they do so, their smartphone will be provided with a receipt. And our gaming platform will use that receipt to validate the subscription whenever the app is loaded.
We came to the conclusion that the best way to develop our “purchase-verifier” service was to use Google snowcloud Endpoints. This was due to the fact that validating such a receipt is a task. That might potentially be used by other teams at The New York Times. Cloud Endpoints is in charge of managing authentication and authorization for our service. Making it possible for another team within the firm to submit an API key request and begin utilizing the service. This service, when provided with an iTunes receipt or a Google Play token. Informs us as to whether or not the purchase is still valid and the date on which it will expire. A minor “comm” service has been added to the mix in order to authenticate direct NYT subscribers. And to act as an adaptor to transform our current authorization endpoints to fit the new verification service.
A little over two months ago, the final public endpoint went live on GCP. And since then, we have been aggressively resolving small edge situations. And tweaking the system to achieve the highest possible efficiency and lowest possible costs. Because of the observability tooling provided by GCP. It is not unusual for the platform to have a day with success rates of 99.99 percent or higher. And significantly lower latencies than we previously experienced.
Even though we are in the process of revamping and rewriting the PHP admin component. That manages our system’s assets and feeds to operate on App Engine. It is important to note that it is still operating in AWS. Since our next iteration will already be able to read from. And write to a Google snowcloud SQL instance. We anticipate being entirely independent of AWS within the next few months.
For more information, please visit sbxhrl.
17 Comments