As a Site Reliability Engineer, you will ensure AMI’s SaaS solutions maintain high availability, the best customer experience, and optimal system uptime. In this role, you will benefit from the opportunity to work on a SaaS platform with cutting edge technology in the parking industry.
Leverage your expertise in coding, algorithms, complex analysis, enterprise incident coordination, and large-scale system design to triage customer instances and platform issues, and tune resource usage.
Model SRE culture of intellectual curiosity, problem solving, openness, collaboration, reasonable risk taking, and big thinking in a self-directed environment.
Informing stakeholders of service level objectives and impact to services and cost.
Analyze root-cause complex problems involving multiple integrated systems and services, networks, hardware, and software that relate to scaling and performance.
Set standards for deployments at scale, infrastructure reliability and scalability.
Influence engineering teams with customer focus towards quick and constructive resolution of conflicts.
Manage service availability and scalability through process, tools, and automation.
Perform post-mortems and optimize incident response processes.
Lead incident response for production incidents; Drive investigation, analysis and troubleshooting to resolve production incidents and systematically drive down detection and mitigation times.
Bring a strong engineering focus to operations, putting your energy into preventing incidents, automation frameworks, self-service infrastructure, logging and metrics, and operational scorecards.
Assist with CI/CD processes to improve cadence.
Identify or utilize existing tools for logging, monitoring, event management, notification, runbook automation, and root cause analysis.
Develop, communicate, and monitor standard processes to promote the long-term health of the platforms.
Participate in security compliance efforts; experience drafting and/or reviewing IT policies.
Improve capacity planning, configuration management and monitoring.
Occasional off-hours, on-call work required.
Additional duties as assigned.
Qualifications: (Skills, Abilities, Knowledge)
2+ years of experience supporting internet-facing production services and distributed
Passion designing, building, managing, and documenting resilient applications and infrastructures at scale.
Bachelor’s degree or an equivalent combination of education and related work experience.
2+ years hands-on experience with performance monitoring and diagnostic tools.
Excellent written and verbal communication skills.
Advanced knowledge of Linux Administration.
Extensive experience with Git.
Understanding of micro-service architectures and the complexities surrounding deployments.
Foundational understanding of security best practices.
Exposure to programming languages such as C#, Java, Python or Go.
Experience with scripting languages such as PowerShell, Bash, or Python.
Troubleshooting experience with Docker containers and Kubernetes.
Knowledge of best practices of running applications in containerized environments including
health checks and rolling update strategies.
Understand how to read network packet captures and troubleshoot connectivity issues.
Knowledge of CI/CD Pipelines Implementation for applications and infrastructure.
Knowledge of Microsoft Azure, AWS, GCP or similar cloud platforms. Preferred experience with AWS.
Experience using Terraform IaC
COVID-19 Hiring Update:
We’ve transitioned to a work-from-home model. We are continuing to interview and hire during this time. This role is expected to begin as a remote position. We understand each person’s circumstances may be unique, and we will work with you to explore possible interim options.