Senior Site Reliability Engineer
Our mission is to make communities more resilient. We do this by pairing external data with artificial intelligence to identify areas of high risk and prevent catastrophic loss for utilities across the country. We are a team of close-knit engineers, entrepreneurs, and data geeks who obsess over problem-solving, new technologies and making a positive impact in our communities.
We are seeking an experienced Site Reliability Engineer to take charge of our servers, deployments and overall systems.
You will have a passion for the practical side of managing large, complex systems and services and planning for maximum uptime leveraging modern tools. Urbint has a mix of self-hosted services deployed within Google Cloud with most managed through Google Container Engine (Kubernetes) and a need to support on-premise deployments to address specific security postures of some clients.
What You'll Do
- Design High-Availability Systems - ensure that all of the systems that we deploy and depend on are configured to maintain full uptime. Planning out deployment strategies to ensure that uptime is maintained during upgrades and maintenance. Designing and building out an infrastructure-as-code project.
- Guiding Development Team with Best Practices - working with the Development team to ensure that the software being built will be practical to deploy and maintain.
- Maintaining System and Network Security - patch management, ensuring that dependencies are kept up to date. Staying informed about zero-day vulnerabilities and any risks that cannot be immediately patched and coming up with alternative methods to mitigate their risk.
- Logging, Metrics and Alerting - managing and organizing an on-call schedule through Pagerduty, connected to metrics and log events. On-call responsibilities will be shared.
- 5+ years of experience designing and maintaining application systems
- A deep understanding of operating systems and computer architecture
- Strong experience with:
- Containerization (Docker, Kubernetes, Terraform)
- Queueing & ETL systems (Kafka, RabbitMQ, Airflow)
- Database systems (Postgres, Mongo, Elastic Search)
- Big data storage and access (Hadoop / HDFS / Hive, S3 / GCS)
- Logging and metrics (StatsD, Graphite, Grafana, and Graylog),
- Authentication systems (LDAP, Shibboleth)
- Encryption (SSL, GPG, Key and Cert management)
- Security Tools (Tripwire, ClamAV, IDS/IPS systems)
- Solid programming abilities - to help build any glue components between services
- Experience with SOC2, ISO 27001 and/or other strict, auditable compliance
- Strong communication and organizational skills.
- Experience organizing and growing a team is a huge bonus
What We Offer:
- Mission Driven - Some companies use AI to serve better digital ads and trade stocks, we seek to make our communities more resilient.
- Top Compensation - Competitive compensation package.
- Best in Class Medical Coverage - 100% benefits and premiums paid.
- Prime NoHo Location - Our office sits in the heart of NYC’s historic NoHo district and is just minutes away from the BDFM and 6 subway lines.
- Health Perks - Gym reimbursement and citibike membership.
- Strong Culture - collaborative office focused on teamwork, humility, and hustle.
- Catered lunch on Thursdays, plus a kitchen filled with snacks and drinks.
We're an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.