Company:
EON Systems, Inc.
Location: Little Ferry
Closing Date: 03/11/2024
Salary: £100 - £125 Per Annum
Hours: Full Time
Type: Permanent
Job Requirements / Description
Competitive salaries, including equity, apply.
#J-18808-Ljbffr
This role
As a data engineer, you will be responsible for acquisition, processing and handling of large amounts of complex neuroscientific data. You will build and maintain an end-to-end cloud-based data pipeline structure from data capture to providing processed data to our ML models. You will be collaborating closely with the human / animal brain data acquisition and AI engineering teams, building the interface between data-acquisition and our machine learning models.
Representative projects
- Download neuro datasets from 10+ repositories, format and preprocess them, and store them in an infrastructure accessible for training pipelines.
- Build creative validation and quality assurance steps into this pipeline, that allow SMEs to judge their quality and later automate this process. Visualize key metrics in dashboards. One potential example: run our smallest neuro foundation model on it, rank by reconstruction loss, flag if the dataset was used to train the model and thus will have artificially low loss.
- Work with ML engineers to build an API to feed (tokenized) brain data to training runs.
- Download or scrape metadata from the above repositories, extract additional metadata from fields like Description, impute missing metadata via LLMs.
- Proactively work to determine what other projects would provide value to the ML team and the company
- Manage the acquisition process of petabytes of online datasets of different types and modalities
- Assess and process unstructured and noisy data sets, requiring intensive cleanup and organization.
- Build a cloud-based data pipeline to streamline massive amounts of data for our ML model applications
- Host and maintain our large cloud-based datasets, ensuring scalability, accessibility and end-to-end functionality at all levels
- Collaborate closely with our Machine Learning (ML) team to facilitate and optimize data pipeline projects.
- Document the data pipeline with clear and comprehensive guides, facilitating easy access and understanding for the ML team and other stakeholders.
- do not refer to internal details or delivery timelines, but be specific about what they’ll do and use
- Example (to be deleted)
- Strong demonstrated experience in handling and preprocessing messy, unstructured datasets, ideally within scientific research environments.
- Demonstrated experience in building software around cloud-based data pipeline infrastructures
- Demonstrated experience in building large data infrastructure for ML applications
- Proficiency in cloud computing platforms, at a minimum AWS, and ideally others
- Good understanding of machine learning concepts and how data preprocessing affects ML model performance.
- Strong background and experience in implementing data validation and cleaning techniques.
- Experience in managing complex projects with a focus on timely delivery of technical solutions.
- Excellent communication skills for effective collaboration with technical and non-technical teams.
- Experience in the following: Kafka, Hadoop, EMR, GCP, Glue, Spark, CloudStack, HDFS, Databricks, Sagemaker, etc
- Experience with database management, ETL processes, and SQL/NoSQL databases.
- Thoughtfulness about policy and epistemics related to the rapidly-changing future of technology
- You have predominantly developed data pipelines for business contexts, where data needs less serial and experimental processing compared to the complexities of scientific datasets.
- Your experience does not include hands-on work with design choices around dataset acquisition.
- You lack familiarity with fundamental scientific computing techniques, for instance, normalizing by z-score or resampling.
Competitive salaries, including equity, apply.
#J-18808-Ljbffr
Share this job
EON Systems, Inc.
Useful Links