Dellent is a consulting company focused in System Information and Telecommunications. Our goal is to help our candidates and consultants to take a step forward in their careers through projects that meet their needs and expectations.
We are looking for a Site Reliability Engineer (SRE) for a telco project in Lisbon and Porto.
Role & Responsibilities:
- Building, support and manage the daily activities and best practices implementation regarding: Security; Reliability; Testing Automation; CI/CD; Application and Infrastructure Monitoring; Non-Functional Requirements Management and Implementation; Stabilization;
- Participate in the solution definition to ensure its operability;
- Ensure the solution resilience, acting as a SPOC within the team;
- Participate in the definition of performance and resilience tests;
- Ensure the solution observability;
- Define monitoring requirements (e.g. log types);
- Validate performance metrics and monitoring KPI's;
- Challenge the best practices for CI/CD solution and its evolution;
- Work with stakeholders to fully understand and communicate the Root Cause Analysis and implement the lessons learnt;
- Look at monitoring KPI’s & logging efficiency to introduce new tools towards a more reliable solution;
- Drive initiatives to make the solution (and all its components) more reliable – that is, less prone to cause support tickets;
- Work with developers during the software development lifecycle to ensure that developed services are operationalized.
Requirements:
- Familiar with DevOps culture and experienced in application reliability practices;
- Experience with Environments & Infrastructure (Unix/Linux);
- Experience with Cloud (AWS, Oracle, Azure);
- Experience working with Kafka and containers (Docker, Kubernetes);
- Familiar with real-time monitoring solutions & tools (e.g: Kibana, Elastic Search, AppDynamics, Prometheus, Grafana);
- Skilled in software implementation and configuration to maintain infrastructure and application solutions;
- Skilled in system reliability, quality and automation;
- Keen to measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs and innovating to continually improve;
- Provision of primary operational support and engineering for multiple large distributed software applications;
- Familiar with OSS & BSS complex solutions across several technologic domain (e.g. Online, Automation, QA);
- Experience in business/technical assessments on solutions life cycle asset management processes;
- Operations experience in Asset reliability risks evaluation;
- Operations experience in problem management processes;
- Operations experience in the collection of solution reliability metrics, communicating to internal and external stakeholders;
- Familiar with JIRA ticketing tool for operational reporting.
Nice to Have:
- Agile certifications;
- Cloud certifications;
- ITIL v4;
- At least 3 years of experience working on large scale, multiple agile team projects.
Personal Traits:
- Ability to adapt to different contexts and teams;
- Great communication and teamwork skills, with a sense of autonomy;
- Motivation for international projects - available to travel if required,
Apply:
If the description above sounds like your next professional challenge, please apply here.