We encourage candidates who are able to work on a W2 basis to apply for this position.
Overview: their team supports data movements from several key processes/workflows: cost processing, S3 data, network traffic (VPC low flows), Cloudtrail data (API captured as a record).
- Their pipelines collect this data, enrich it with human information, and load it into a unified data store in ClickHouse for reporting and visualization purposes.
• Current project needs between net new development of pipelines and optimization and maintenance of existing ones.
• Pipelines built in Scala, PySpark and moving some over into Lambdas (Python backed) where they can. Net new work might involve Lambda development.
• Spark clusters all handle the different types of data, various structured, unstructured data sets
• Data pipelines running on EMR infrastructure, should have understanding of EMR from perspective of data distribution, scalability, performance
• Majority of pipelines are real-time streaming, costing is predominantly batch.
• Candidates should have strong experience in not only building but suggesting performance enhancements for the pipelines.
• All code is integrated into their CI/CD pipeline, orchestrated by Jenkins
• Monitoring through Cloudwatch, some Ganglia (NTH)
Must Have:
• Scala
• PySpark
• Data pipeline engineering and optimization
• AWS (specifically Lambdas and EMR)
• SQL
Nice to Have:
• ClickHouse database experience
• Ganglia