AWS, GCP and/or Azure, Demonstrated ability to manage professional level employees, Demonstrated ability to build and leverage relationships to accelerate delivery, Own team delivery of highly available, secure, and cost-effective container orchestration platforms such as Kubernetes and ECS, Dedication to customer experience and quality, Be a team mentor, lead by example, build trust and solid relationships, Inspiring and energizing when talking to colleagues, Leading, managing and developing a distributed team of Site Reliability Engineers, Designing and implementing processes for rolling out software and security updates to deployments with zero downtime, Empowering and challenging squad members to foster individual development, Providing training for all personnel to ensure highest level of support and customer satisfaction, Encouraging employees to setup and achieve personal development goals, Improving Incident Mitigation Capabilities, Helping teams improve the resilience of their assets, Owning end-to-end availability, reliability, and performance of our PaaS services on Oracle public cloud, Expertise in problem-solving and analyzing global scale distributed systems, Mindset of continuous improvement of the service and way of working, Support in creating structural solutions instead of workarounds, Continuous Improvement of Continuous Delivery & Software Engineering Practices, Improve the MTTR (Mean Time To Repair) and MTBSF (Mean Time Between Service Failures) of service impacting incidents, Clear understanding of the product development cycle, technical requirements and project management, Build and maintain models for growth and capacity planning, Work closely with product development, program management, operational, and engineering peers to develop innovative technical tools and solutions, Manage on-call rotations across continents, using a follow-the-sun model, Customer centric and empathetic approach to support, thrive in making our customers and users successful, Act as crisis manager and technical lead during all major and mass outages, and offer technical input and feedback to all parties engaged in remediation, Coordinate all recovery efforts to provide rapid resolution to any issue that could be impacting the operational environment, Follow up on improvement actions after high impact incidents (root cause), Support in automation of the services (create consumable services), Create real-time and standardized insights of production chain for faster incident analysis, Work independently with little or no supervision as well as ability to work within a team, Experience managing high performing self-directed teams, especially on large-scale projects with technical deep-dives into code, networking, operating systems and/or storage, The most successful managers here act with drive and urgency, yet always take the time to listen and to do the right thing. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.
Created bid estimates and briefings for senior engineer staff and customers.
Researched several projects which resulted in a large number of best practices revisions for the organization.
Want to know more? Developed training seminars for reliability engineering awareness and identification processes.
Check out our. This information usually isn’t enough to directly identify you, but it allows us to deliver a page tailored to your particular needs and preferences.