Efforts to gradually improve the large-scale legacy system of tabelog

Hello, this is KyoKazu of eating log system Headquarters. I became the general manager from April of this year. Another daughter was born in April: tada:

This entry introduces the gradual improvement of a large-scale legacy system that Tabelog has worked on throughout the year. [\ Translation ] Migrating to Modular Monolith in Shopify is the second advent calendar.

How to proceed step by step

Tabelog is the 15th year of service this year, and 13 years have passed since it became Rails. With this much history, it is natural that there will be rattling around, and it was necessary to first decide where and how to tackle the myriad of issues.

First of all, I thought as follows as the first premise.

-Does not adversely affect existing business or development. Rather, I want to have a positive impact as soon as possible. ――It takes a long time to improve a system with such a long history, so it is necessary to gain an understanding from the management and business side about the value of the initiative. ――Initially, it will be promoted by a specialized team, but eventually I want to be able to involve all the engineers who develop on the tabelog. ――In order to make continuous improvements, it is necessary for service development engineers to have ownership and create a system where they can make improvements on their own. ――It takes too much time to proceed with a specialized team alone ――Based on these, it seems better to gradually improve the current system rather than creating a completely new system and migrating to it.

Based on the above idea, we decided to proceed with the following policy.

Roughly speaking, the order is infrastructure improvement, application improvement, and finally microservices if needed.

The first thing I decided to work on infrastructure was that the legacy of infrastructure increased the risk of system changes and the cost of investigating developers, which resulted in a conservative development attitude, which was the code quality. This is because it creates a vicious cycle that leads to deterioration, and I first thought that it was necessary to break the structure of this ** vicious cycle **. The current infrastructure is risky for advancing application design changes and refactoring, and even if it was known to be necessary, it was difficult to do so.

Also, regarding microservices, it is extremely risky to move to microservices when the infrastructure architecture and application architecture are not in place, and it is unclear whether the expected results will be achieved, so we are considering it as an option. , I decided that the priority should be put back.

Now, let's take a look at the initiatives of each infrastructure application.

infrastructure

First, we decided to use Kubernetes (K8s) for our infrastructure. However, there were voices saying that K8s has a high learning cost, it is updated quickly and it is difficult to follow, so I was not sure whether it should really be K8s, so I decided to adopt GKE in a new business first. If it's a new business, the technical challenges are rational, and if it's a fully managed environment, the initial cost is low, so it was the best place to start with a PoC position.

After operating GKE in production for a while, I confirmed the usability and got a response, and regarding the frequency of updates, I positively think that keeping up with the latest ecosystem will lead to a technical competitive advantage, K8s I decided to adopt.

Not limited to this, but I think that starting small at first, quickly verifying it, gaining a successful experience, and then expanding it to a large extent is an important process in every area. In addition, CNCF's Cloud Native Trail Map and Cloud Native Interactive Landscape were very helpful when considering the infrastructure configuration.

Deploy a little project

At that time, Tabelog used a push-type deployment method using Capistrano. In push-type deployment, files are distributed from the central host to each server, so the deployment time increases as the number of servers increases. At that time, it took about 30 minutes to deploy even with various improvements, which was too long. K8s will solve this problem, but deployment performance has a large impact on the entire organization and cannot be overlooked.

Therefore, I decided to switch to pull type deployment prior to K8s conversion. Name and deploy It's a little project. Originally not intended for long-term operation, he switched from S3 to pulling his deployment image with minimal changes while keeping the base Capistrano, reducing deployment time by about a tenth. did. Even in K8s, deployment is done by pull type, so the assets and knowledge gained here are inherited without waste.

Batch job

Argo CD is selected as the CD tool in order to fully introduce GitOps after K8s conversion. In that process, we decided to use Argo Workflow, the same Argo project, as the workflow engine for batches. Currently, WebSAM is used for tabelog, and job settings are changed based on requests to the SRE team, and the SRE team is forced to manually change the settings, which is quite a strict operation. I would like to realize IaC with Argo Workflow and realize self-service of batch job management. I'll skip a bit, but I strongly sympathize with the idea of ​​Netflix's Full Cycle Developer, and I'm working to get the development team to take ownership and improve and operate their products. We would like to actively promote self-service, not limited to.

Observability

For metrics, we use Prometheus / Grafana, which is now the de facto standard. Also, for the practice of Full Cycle Developer, we would like to make Grafana's dashboard freely operational not only by the SRE team but also by each service development team, and Thanos so that the server that can hit PromQL can be scaled out. We plan to introduce it.

For logging, we have configured it to be aggregated in GCP Stackdriver Logging. Until now, it was a radical operation of bringing logs to the production server by SSH or SCP, so the improvement here was very well received.

In addition, we are considering the introduction of Datadog APM. ** Observability is seen as one of the foundations for making data-based decisions and running feedback loops, and is an extremely important area for the realization of a data-driven organizational culture. ** **

Messaging

In Tabelog, there are some middleware that we have implemented ourselves as needed before the ecosystem matures. Messaging is one of them, and we implemented our own messaging server, but repairs and investigations at the time of failure have become personalized, and it is one of the issues as the business importance of the messaging infrastructure increases. It was.

Unlike in the past, I now think it's better to get into the ecosystem in most cases, and I think it's only necessary to implement it yourself when there are demands that can only be achieved by yourself. Therefore, as OKR, we have set Objective to "make the message infrastructure resistant to obstacles and facilitate follow-up investigation", and are proceeding with the replacement with Apache Kafka. We also decided to use Elastic Cloud to search for messages. In Tabelog, the server group that processes user traffic is operated on-premise, but for systems that are mainly used in-house, we are reducing the cost of operation management by actively adopting the cloud.

Information dissemination within the company

At the stage when the rough verification of the new infrastructure infrastructure is completed and the realization is visible, the two themes of "the future of tabelog infrastructure" and "becoming a tabelog of technology" are 3 of @tsukasa_oishi, @weakboson and @yzusa. From the name, we held an in-house presentation for all tabelog engineers. With a wonderful presentation that makes full use of the demo power cultivated in the OKR win session [^ 1] so far, it fosters expectations that the environment will be easier for service development engineers to develop in the future, and ** "We are a technology It was also an opportunity to reconsider the origin of "I am here to make people who use tabelog by force happy."

_ Diagram of people who are impressed and impressed by the announcement _

We didn't get the K8s into production by the end of this year, but we're looking forward to most of the tabelog going to K8s over the next few months.

application

Next is the application. But as I said in the big picture, working on the application is the next step and still in the conceptual stage. So I will write down what I am thinking at the moment.

Architecture technology selection

"Whether or not to make microservices, and (if any) when and how to promote microservices" is a point that many organizations are worried about. [\ Translation ] Migrating to Modular Monoliths in Shopify introduced Modular Monoliths, but not Modular Monoliths or Microservices. Sees as a form of gradual transition from monoliths to microservices.

As pointed out in Modular Monoliths — A Gateway to Microservices (https://medium.com/design-and-tech-co/modular-monoliths-a-gateway-to-microservices-946f2cbdf382), microservices do not improve your code. Rather, if the domain boundaries of the application and the components and interfaces are not properly defined in advance and a highly cohesive and loosely coupled state cannot be realized, the probability of microservices failure will be high (what is the failure here? It refers to a state in which the development productivity of an organization does not improve). Tabelog intends to gradually improve the Monolith application before splitting the system with microservices.

A tightly coupled, technically debted application is like a messy room. There aren't many people who look at a messy room and say "OK, let's divide the room!", So you can throw away what you don't need, clean up dust and dirt, buy storage goods, and use them for different purposes. It is natural to think about organizing the room by deciding where to put it. I think the system is similar.

It is necessary to learn the latest methodologies as a trend, but we should consider whether it is appropriate as a solution to our current problems. In addition, ** whether the methodology is over-engineering **, that is, whether it takes more time than necessary or adds new problems, is one of the extremely important viewpoints in technology selection.

As a sequel to Shopify's translated article, a blog called Under Deconstruction: The State of Shopify ’s Monolith has been published. It seems that the expected effect was not obtained for Wedge, and instead, a tool called Packwerk that analyzes only references to static constants [^ 2] has been developed and released as OSS. Packwerk is also introduced in the entry Enforcing Modularity in Rails Apps with Packwerk. The translated article was more responsive than I expected, so I'll do my best to translate the sequel **.

The limits of the monolith

The concept of microservices was advocated in 2012, but compared to that time, the ecosystem for realizing cloud native has now greatly developed. They can benefit not only in distributed systems like microservices, but also in monolithic applications. Recently, products such as Launchable launched by Jenkins author Kawaguchi that improve test efficiency using machine learning have begun to appear.

Based on Shopify's case studies and these, I think the limits of monoliths aren't reaching as quickly as we've ever thought. The current feeling is that microservices should be introduced for the purpose of organizational scale rather than system scale. Therefore, in order to design the appropriate timing and architecture for microservices, it is essential to understand the business structure, organizational structure, and the direction that the business should aim for in the medium to long term (that is, reverse Conway's). It's a law). Ultimately, it may be ideal to be able to flexibly cut out as a microservice to the area where the business should focus most.

front end

For the front end, we have been working on replacing jQuery with React / TypeScript for some time. This is actively disseminated on the Tabelog Front End Engineer Blog, so please see the blog for details.

By the way, there has been a lot of discussion between Rails and the front end recently, but I personally don't think it's a confrontational structure. At the application layer, we think that changing to a highly cohesive and loosely coupled code base is an important and difficult issue rather than whether the processing entity is the front end or the back end, and regarding the modernization of the front end. We believe that the main value is to have the flexibility to adapt to changes in the technical environment and the option to provide better UI/UX to users. The front end is an area that has a significant impact on the user experience, so we will continue to focus on it.

Other initiatives

As an organizational approach, we introduced a matrix-type organization in part in October, and have begun the challenge of creating a self-organized, cross-functional team. In addition, Guild Works is an agile coach who coaches hypothesis verification type agile development. The engineering department continues to operate OKR, and the idea of ​​setting stretch goals and working with back-calculation thinking has become widespread. These days I have little to do, and I only have to brag about my daughter's growth in a win session.

Even for iOS/Android apps, we are improving and automating the release process in order to release it every other week, and we are also making progress in re-architecting the layered architecture. Also, since I started working from home, many teams have been using Miro to look back and brainstorm. Miro is very convenient, and I personally feel that Miro is more convenient than real sticky notes. New initiatives have begun in various other fields, and we are building a development organization that can quickly and stably deliver value to users and restaurants.

Finally

The food service industry is now in great trouble with the new coronavirus. I would like to save as many restaurants as possible with the power of technology, and create a new standard for the food service industry in the future with tabelog. The Japanese restaurant industry is a world-class culture. We believe that protecting that culture and developing it into a new form is the mission that we, as Japan's number one gourmet site, should fulfill in society.

If you are interested in tabelog or can sympathize with it, please feel free to contact us on Twitter or Facebook. Our battle is about to begin!

[^ 1]: OKR is written in the 2019 Advent calendar, so please have a look if you like. The story that the atmosphere of the club became messed up in 3 months after introducing OKR in the technical department --Qiita [^ 2]: In Ruby, the class is also expressed as an instance of the Class class, and in the class definition expression, the Class object is assigned to the constant of the class name.

Recommended Posts

Efforts to gradually improve the large-scale legacy system of tabelog
Efforts to improve the development efficiency of legacy in-house Web systems as much as possible
Gradually improve legacy code
The secret to the success of IntelliJ IDEA
How to determine the number of parallels
How to sort the List of SelectItem
How to install the legacy version [Java]
Output of the book "Introduction to Java"
The process of introducing Vuetify to Rails