Data Engineer
Atlanta, GA, United States
Company Introduction
OneSource Regulatory Technology hosts a number of innovative solutions to enhance job performance in the Pharmaceutical space. OSR Technology is looking for an experienced and dedicated data engineer to join our product solutions team!
Job Description
OneSource Regulatory is trying to identify a full-time contractor with at least 4+ years of experience to assist us with ongoing R&D projects.
We are looking for a data engineer to pull data from various sources and do all the necessary steps to clean, normalize, possibly annotate, and finally load the data into databases. The candidate should be able to develop and implement a strategy for testing the data integrity of the collected data. This role requires extreme attention to detail to ensure data quality is top priority.
Responsibilities
Well versed in parsing and synthesizing of XML and/or JSON documents.
Curating of data that can involve some intermediate to advanced web scraping. (data may need to be fetched via SFTP, FTP, Wget, Curl, REST APIs, GraphQL queries from spots on the Internet)
Proficiency with Linux command line and various simple tools, such as grep, wc, sed, awk, find, ls, cat, piped commands and possibly some very light Bash shell scripting, setting up crontab schedules and programs
Must have basic knowledge of SQL with the following databases: PostGres, MySQL, Google BigQuery
Must have basic knowledge of No-SQL database knowledge such as MongoDB or similar
Familiarity with basic Cloud technology such as storage buckets, cloud serverless functions
Must have experience extracting text and images from PDF files
Knowledge of Puppeteer or other automatable web client technologies
Understanding JavaScript, HTML/CSS and HTTP methods (for understanding page structure for web scraping)
Skills Solid experience with Python and Python Libraries such as Pandas, requests, etc
Skill set should match up with required responsibilities listed above
Strong English skills (e.g. grammatical analysis and rhetorical structure)
Team Player
Great communication skills
Bonus Skills Experience within the Pharmaceutical Space
Ability to expose data via C# NETCore and/or GraphQL
Google Cloud Platform (Cloud Buckets, Google Cloud Functions (.NET, Python, Node.JS))
Ability to parallelize data manipulation and scraping via Python multi-threading, etc.
Python BeautifulSoup
Scrapy
Docker (setting up Kubernetes style processing if warranted for data scraping/data ingestion/normalization)
Multithreading concepts