Careers

19, Aug 2020 . 14 min read

Site Reliability Engineering — How to prepare for the interview

Site Reliability Engineering — How to prepare for the interview

by Douglas Quintanilha

14 min read

Find out what we're looking for during the selection process of SREs and the steps of each interview

As an ever-growing company, Wildlife is one of the top 10 mobile game developers and publishers in the world, having released more than 60 games that have been played by billions of people around the globe.

To run that many games, we have many systems for game servers, chat rooms, clans features, in-game messaging, matchmaking, authentication, authorization, player support, and many others, all of them running on top of our many Kubernetes Clusters.

And to support this global, interconnected, gaming infrastructure we're building a world-class SRE team that not only aids our engineers in deploying highly scalable distributed systems but is also structuring an amazing cloud foundation to support the continued growth of the company for the next few years.

SITE RELIABILITY ENGINEERING AT WILDLIFE

As we showed in this post at Wildlife's Tech Blog, the cloud infrastructure that runs our games plays a crucial role in their perceived quality. One of the main objectives of the SRE team is to provide a fantastic experience for our customers.

For internal clients — such as Game Developers, Backend Developers, Data Scientists, Data Engineers, Game Artists, and others — we provide the necessary tools and automation to deploy our games in a fast, secure, and cost-effective way.

For our external clients, the players, we make sure they can play our games anytime, with reliable servers and with low latency connections.

WHAT DO WE EXPECT FROM AN SRE AT WILDLIFE

To meet the needs of these many different clients, we always look for Engineers that are well versed in the common skills that you expect an infrastructure engineer to have, like understanding of how an Operational System works, proper Network, Security, and Protocols knowledge, and expertise on how to use the common Infrastructure as Code tools, like Terraform, Ansible, Packer to provide reliable cloud infrastructure.

In addition to that, we also look for people who are able to:

Develop in at least one programming language;
Read an existing codebase and make contributions to it;
Review pull requests constructively;
Troubleshoot issues that originate from possible bugs in the source code of our many applications.

We understand that in the current Cloud-native world, we need to build a lot of tooling around major technologies like Kubernetes or HashiCorp’s Vault. Some examples are Flux Gitlab Controller and Helm Generate, tools developed by our SRE’s around FluxCD that enhance its capabilities and enable the interactions with other tools we use internally.

We need engineers who are not only passionate about Infrastructure but are also reliable developers, who can use their coding skills to add value to the company by improving existing tools or building new ones from scratch

And as on the job title, reliability is an expertise that we expect from every SRE on the team. So it's important to know how to accurately monitor systems, how to choose appropriate SLO’s, and measure each SLI correctly, how to deploy scalable distributed applications, and in case everything goes south, have the proper troubleshooting and incident management skills to handle incidents in a calm, controlled and pragmatic way.

We also don’t expect you to know every tool out there or master any specific programming language. Technology, and mainly infrastructure, is always changing and evolving, so we firmly believe that if you have the proper fundamentals allied with good problem-solving skills and are passionate about infrastructure, you will be able to adapt to any stack that we're running at the moment.

Finally, as a team, we're curious, we want to work with great technologies, we want to test and research better approaches to everything we do. We want to solve problems that only happen on a global level scale, and we want to automate everything to keep any toil-related work low, so we can focus on the fun stuff!

STEP BY STEP OF OUR SELECTION PROCESS

Our interview process is based on three operating principles:

- A series of structured, clearly defined steps because we believe this makes it more reliable as a predictor of future performance and more inclusive in the sense of not leaving much room for unconscious bias;

- A high bar for both talent and culture fit. Wildlife aims to be the place where every outstanding contributor can achieve their personal best. We know that one of the things that make talented, driven people flourish is to be surrounded by other talented, driven people.

- It's fast because everyone hates waiting for several weeks for feedback. Indeed, we believe speed is key to providing candidates with a superior experience.

Small details can vary by location, hiring manager, and level of seniority, but overall we invest a lot of effort into making sure our hiring standards are consistent, given how fundamental this is to build a high-performance distributed organization.

1. Recruitment Strategy
Once we’ve defined a new job position, we advertise it on our page and on other job boards that we consider relevant. Of course, we also actively look for professionals in different networks. Once we find someone that we wish to bring to the team, we get in touch and invite them to learn more about us and go through the recruiting process. This usually takes place on LinkedIn or GitHub, but other sources such as referrals from our employees or candidates are also a common practice.

2. Sourcing and/or screening
This is the first part of the selection process itself. The sourcing consists of us actively searching for ideal candidates for our positions, while the screening is a selection of all the applications/resumes received. Those who are selected (or who we’ve previously contacted) are asked to have an initial phone conversation with us.

3. Pre-interview
This is what we call this first conversation, which takes place over the phone or through videoconference, with a recruiter or a team member (or both, depending on the seniority of the position). This chat not only explains the position, the company, and the selection process in more detail but also evaluates the candidate’s motivation, their passion for what they do, and may be used to assess some soft and technical skills. These calls usually last about 40 to 60 minutes.

4. Interviews
After the initial conversation, the selected candidates are called in for the interviews with our SRE team, which take place at Wildlife or remotely via video conference. There are three interviews, lasting one and a half hours each — and if onsite, all of them can be done on the same day. As each stage has the potential to rule out a candidate, some may only participate in one or two of them.

What we evaluate in each interview:

Team Interview. The first interview is done with a potential future teammate. You will solve a number of problems that test your incident management and troubleshooting skills, your automation and infrastructure as code skills, and also your coding skills, in collaboration with the interviewer.

Manager Interview. In the second interview, with the team manager, you will talk a little about your career and solve some of the company’s day-to-day system design problems with the interviewer.

Director Interview. The third interview takes place with the head of the SRE Organization or one of our principal SREs. This is the opportunity to describe your career with more details, your side projects, and your professional goals. Your alignment with the company’s values will be evaluated, as well as some of the work you delivered in previous companies.

After each interview, the candidate is scored against the predefined criteria that we use to evaluate each skill that we considered.

5. Offer
After the interviews, all people involved meet to discuss the performance of the candidate and decide if we’ll extend a job offer. If approved, you'll be informed quickly. If you cannot respond immediately, we give you some time to take the offer home and think about it, hoping you’ll decide to join our team.

6. You’ve become Wilder!
And that’s it! Candidates who accept the offer become new Wilders and are welcomed with open arms by our team. In this final stage, we’ll arrange all the details for the starting date, and once you’ve started, the onboarding phase will provide you all the tools and context necessary to begin helping us tackle our biggest challenges!

HOW TO PREPARE FOR THE SRE INTERVIEW

Generally, in the interviews, we're trying to assess your skills in specific areas and to be able to properly show them it’s a good idea to sharpen up your communication skills. It’s important to present your ideas clearly, in a structured, and cohesive way.

We know that sometimes candidates can get nervous, and we take that into consideration, but try staying calm, truthful and be yourself because we're not only evaluating your technical skills but also your alignment with our values, and 'We are Candid' is one of them.


On the technical side, you can expect to be tested on the skills we discussed in this post, so it’s a good idea to review them. You can review your infrastructure skills by taking a look at guides like this one or the “Sysadmin interview questions”, and even though we don’t ask direct questions like those, they serve a list of topics that you can study.

Posts like this one and the “SRE interview preparation guide” on Github also contain a good overview of the expected SRE skills and can serve as good guidance. If you’re more of a book person, the Site Reliability Engineering and The Site Reliability Workbook are great literature to understand the fundamentals of the SRE job from a more high-level view.

Specifically, about the system design question, there are really good guides online on how to prepare for them, for example, the System Design Primer, the Crack the System Design Interview, and this small system design course. They are full of examples and tips on how to approach system design interviews, which should be really helpful even if you’re already used to this type of interview.

On coding questions, we don’t ask questions that require deep algorithmic knowledge, so don’t expect to be asked to code the algorithm to invert a binary tree. Because of that, books like “Cracking the Coding Interview” are useful to teach you how to walk through a problem, but not necessarily for the problems themselves.

We do consider algorithms knowledge useful in our day to day job, but our interview questions are much more focused on implementing small features or single functions of a bigger project. It’s expected that you can deliver working code in at least one programming language, following proper good coding practices.

It’s also not a bad idea to get familiarized with our company, and it’s games. We can learn about our values on this page. Try downloading Tennis Clash, Zooba, or Sniper 3D. Look for them in the App Store, Google Play Store, and on our website. And also, read about our history, check some of our other technical blog posts at Medium, or check some of our open-source projects on Github.

I hope this post was informative and interesting to you, and that it fulfilled its purpose to showcase what we expect from an SRE at Wildlife and how is our recruiting process. If you want to join the team, check out our open positions!


Share