Cloud Operations Engineer

Summary of responsibilities

The Cloud Operations Engineer will be part of a remote team that works on commercial, state, and federal projects. This candidate will work closely with existing DevOps and CloudOps teams and be part of daily Scrum sessions. Our CloudOps Team is responsible for site reliability, monitoring, automation and resolving systems alerts and customer issues. They lead in educating and implementing solutions that meet or exceed customer’s needs. To us, this means that teams own their automation and monitoring. We focus our efforts on building scalable and reliable infrastructure that keeps our platform running smoothly. Some travel may be required.

• Monitor and resolve application and customer issues in production.
• Support the automation of recurring issues and issues that need manual intervention.
• Identify and implement process improvement to reduce time to resolve support tickets.
• Create dashboards and solutions to proactively identify issues.
• Reduce human errors, increase quality and security through automation.
• Collaborate with excellent verbal and written communication skills.
• Troubleshoot alerts and escalated issues.
• Engage in and improve services from deployment, operation, through refinement.
• Maintain production environments by measuring and monitoring availability, latency, and overall system health.
• Scale systems sustainably through automation.
• Evolve systems by pushing for changes that improve reliability.
• Practice sustainable incident response and disaster recovery exercises.
• Communicate in real-time using Slack and MS Teams.
• Follow infrastructure as code best practices.
• Participate in on-call rotation that will troubleshoot production impacting issues.
• Create and improve documentation and runbooks.
• Participate in blameless postmortems.

Proficiencies & Skills

• High sense of urgency and drive to resolve issues quickly.
• Expertise in analyzing and troubleshooting containerized workloads and applications.
• Script first mentality for automation.
• Ability to debug, optimize code, and automate routine tasks.
• Solid Python, shell, Java, and JavaScript knowledge.
• Systematic and creative problem-solving approach, with effective communication.
• Proven track record of supporting multi-az, multi-region, N-tier architecture applications in a public cloud-based infrastructure.
• Understanding of Unix/Linux operating systems.
• Understanding of application golden signal.
• Understanding of dashboarding using techniques like USE and RED.
• Managing cloud-based infrastructure on AWS (preferred), Azure, or GCP.
• Advanced knowledge of Infrastructure as code tools and best practices.
• Code repository best practices; Git, GitHub, “Git Flow” or other workflows.
• IaaS Administration (SDKs and cli – AWS preferred).
• Building, optimizing, hardening, and troubleshooting new services, tasks, and technology from POC to production.
• Application performance monitoring (APM).
• Experience using PostgreSQL and/or MySQL.
• Experience with Continuous Integration tools like GitHub Actions.
• Knowledge of web and application server management (Nginx, Tomcat, NodeJS).
• Experience with Terraform, Ansible or Cloud Formation.
• Experience with AWS technologies such as EC2, ECS, S3, RDS, and CloudWatch.
• Ability to run Docker containers on AWS ECS.

Education & Experience

• Bachelor’s Degree in Computer Science or related technical field, or equivalent experience.
• 4+ years professional experience in Cloud Operations and application monitoring.
• AWS and Terraform Certifications are a plus.

Share with
Share
Share

Subscribe For All Job Updates

%d bloggers like this: