Freelance Senior Cloud Engineer - Azure - 6 months - Inside IR35 - Competitive daily rate
Urgent role
Job summary:
Join a managed services team of Site Reliability Engineers (SRE) as part of the Microsoft Azure Managed Experience (AME) team, supporting a £1.2B, 10-year programme, delivering cloud super computing services to the UK government. You will work directly with public and privately available Azure platforms, 3rd party providers as well as with Microsoft Product Group to help define new services.
A typical shift covers monitoring dashboards, triaging and coordinating real-time event and incident responses, on a 60-petaflops compute platform, boasting over 4 exabytes of data, powered by 100% renewable energy. This is a programme that will re-define the supercomputing model and position the UK as a world leader in weather forecasting.
Key responsibilities:
75% Incident and Service Management:
Initial response and platform issue triaging for Alerts, Incidents and Service Requests received from Microsoft customer. Diagnose, investigate and troubleshoot issues on customer workloads hosted in Azure and partner with various internal and external teams across Microsoft and the Met Office. Take remediation or mitigation action to mitigate an incident based on the provided technical service guidelines (TSGs). Ownership of incident tracking, triage, mitigation, and resolution. Responsible for achieving agreed SLA / OLA / KPI targets.
Own and drive customer reported issues until resolution in accordance with SLAs / SLOs / KPIs designed for the customer initiative.
10% Driving efficiency:
Monitor alerting noise, take actions and institute processes, tooling to reduce noise and increase signal fidelity. Initiating process changes designed to improve efficiency.
Identify patterns from customer reported issues and drive proactive problem management activities to reduce repeated customer incidents. Partner with peers within the organization to improve tools, processes and customer support. Measure, track detection rates and provide analysis on missing monitors based on customer reported incidents compared to alert based incidents and provide to AME team, so those monitors can be implemented.
10% Documentation:
Create real time scenario’s for and document case studies. Create and maintain documentation of scenario-based TSGs, Operational Documents and Process documents. Generate required reporting, documentation for measuring key KPIs for the initiative and contribute to stakeholder communications by providing relevant content.
5% Other responsibilities:
Contribute to AME/SRE solutions across plan, design, develop and maintain stages.
Contribute to Security/Compliance to bring the AME initiative to be compliant with various security/compliance standards like GDPR, ISO, SOC, etc.
Essential Skills:
Excellent analytical and troubleshooting skills.
Strong experience in Microsoft Azure Platform (Compute, Storage, Networking etc.
Good IT and Azure/cloud experience.
Fair experience in enterprise level support for a large scale/enterprise customer
Strong fault analysis/determination and problem solving skills.
Ability to adapt to a diverse and changing environment.
Ability to work under continuous deadline pressure and handle crisis situations.
Ready to work on 24x7 shift pattern.
Managing/Operating/Troubleshooting experience in Azure using Azure Management Technologies (Azure Monitor, Monitoring Agents, Kusto Query Language, ARM template, Azure Policies, IaC and deployment models.
Demonstrate strategic thinking, quantitative and analytical skills and collaboration.
Must follow customer/program compliant processes to ensure actions are carried out in a safe and secure manner in managed customer environments.
Communicate and collaborate effectively in English.
Excellent written and verbal communications skills are must.
Experience with ITIL compliant incident management.
Documenting troubleshooting and problem resolution steps.
Scripting experience with PowerShell.
Nice to have:
Working experience in high availability environment.
Infrastructure as code experience highly desirable (Azure DevOps, ARM etc.).
Adherence to process (ITIL) and Incident management and SLA, Responsible for Incident management (Proactive & Reactive), execution of Changes / Change Management, Problem & Performance Management.
Linux experience (particularly Red Hat Linux) is highly desirable.
Working Knowledge of Programming Concepts.
ITIL or Microsoft certifications.
Experience in doing RCA.
Patch Management experience in the cloud.
Please send me your latest CV to : jake.williams@source-technology.com
I will call you this afternoon as it is an urgent role.