Sr. Site Reliability Engineer

Frame.io

| Remote

Sorry, this job was removed at 9:38 a.m. (EST) on Tuesday, November 23, 2021

View 1589 Jobs

Find out who's hiring in Greater NYC Area.

See all Developer + Engineer jobs in Greater NYC Area

View 1589 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

See all roles

We’re looking for someone to join our Infrastructure team who can work closely with Backend Services to create more reliable and robust cloud infrastructure as we scale our product.

About Frame.io

Frame.io is changing the future of how videos are made by helping over 1 million creative professionals seamlessly collaborate from all over the world.

We’re backed by Accel, FirstMark, Insight Partners, SignalFire, Jared Leto, and a host of other amazing investors. Our market-leading product is used and loved by companies such as Turner, Disney, NASA, Snapchat, BBC, BuzzFeed, TED, Adobe, Udemy, and many more.

We’re in an exciting period of growth and are always seeking extremely talented and passionate individuals who share our vision for helping visual content creators produce their best work.

About the Role

As a Senior member of a Site Reliability Engineering team at Frame.io, you will work to transform and perfect our Kubernetes platform, develop multi-cloud strategy, reduce infrastructure cost, and make our infrastructure reliable, performant, and competitive. You will have the opportunity to work cross functionally to transform and maintain monitorable and reliable software systems, serving millions of users everyday. We’re looking for someone that has deep technical expertise and experience to join a fast-paced, growing team of SREs tackling challenging problems at scale.

Requirements

8+ years of experience in managing cloud infrastructure, including hands-on experience with AWS (or another public cloud), Kubernetes, GitOps, Terraform, Docker, CI/CD
You have worked in multi-cloud environments and developed migration and deployment strategies around it
You have experience in setting up SLAs/SLOs/SLIs for key services and establishing the monitoring around them
You have deep experience in collaborating with engineering teams and developing tools and technologies for them
You have broad knowledge of Cloud Security and facilitate close collaboration between our security and infrastructure teams
You’ll be just as passionate about troubleshooting issues with distributed systems at scale as you are to automate, code and collaborate to solve problems
You have materially improved the operability of the systems you've run - through monitoring, service level management, lifecycle management, performance tuning, and documentation
You are passionate about reliable, scalable, observable software with strong sense of ownership
You have substantial experience with a programming language like Python and Golang
You have good knowledge of a centralized configuration tool like Chef, Puppet, or Ansible
Experience in storage technologies and developing cost-effective storage solutions is a plus

Responsibilities

Be a thought leader in the SRE team to generate new ideas to build next generation cost-efficient infrastructure to host Frame.io services
Develop multi-cloud/storage provider strategy to increase availability and reduce cost
Identify and bridge gaps to ensure Frame.io cloud infrastructure is reliable, scalable and secure
Continue building, maintaining, and improving our Kubernetes and ECS platforms
Run ChaosDays to continuously iterate on how we handle and respond to failure
Ensure our platform's reliability by taking part in our periodic on-call duty
Partner with product & engineering teams on design, development, and capacity planning to ensure Frame.io continues to scale and maximize availability + observability
Ensure sufficient logging, monitoring and alerting strategies around availability, latency and overall system health
Scale systems sustainably through automation, and evolve systems by pushing for changes that improve reliability and velocity
Continuously improve Incident Response policies, procedures, tools, automation, and implementation
Reduce waste in the infrastructure by leading initiatives to cut cost without compromising the reliability and security of cloud systems
Design and implement tools for engineering to interact with the infrastructure and deploy services in an easy fashion
We stay active within the infrastructure + security communities by attending or talking at industry events like Kubecon and AWS:reinvent, and would love for you to join in, if you were interested as well

Benefits

Competitive salary and equity
Paid parental leave for primary or secondary caregivers
Unlimited PTO and designated Volunteering paid time off
Yearly stipend for learning and development
Medical, Dental, Vision Insurance and OneMedical membership
Flexible Spending Account
Monthly Work from Home Stipend
1 paid company-wide holiday for each month in the calendar year
All-company week-long winter and summer breaks

Our Philosophy

Our philosophy is simple. At Frame.io, we believe that working with people of different backgrounds and perspectives allows us to elevate each other and helps us build a better product for our users.

We’re proud to be an equal opportunity employer, and are committed to providing all employees with a work environment that celebrates individuality and remains free from any form of discrimination and harassment. We base our employment decisions on the needs of our business, job requirements, and applicants' qualifications. In other words, we only care that you’re the best person for the job.

#LI-DNI

Read Full Job Description

Sr. Site Reliability Engineer

Location

Similar Jobs