What is data engineering?
Lets divide the terminology “Data Engineering” in two parts and understand each word first.Then we will combine both the words and define Data Engineering.
Data : A information or message that can be processed, analysed, and used for a variety of purpose is known as data. A data can have different shapes such as numbers,words,emails,PDF,audio,video,etc. It can be
stored in Databases,Data warehouse,Datalake or even in normal text files,Word pad,spreadsheet,etc.
Engineering : It is a study where people practice,design,modelling,develop,build innovative products,applications,solutions,etc to solve problems which will benefit human beings.Although it has been divided into different streams but the goal remains same in each of the specification.
If we combine both of them, it stands like the practice of designing,building and maintaining the systems to extract data from various heterogeneous source systems at ,processed and can be stored in either databases,data warehouse,datalake as a single consolidated centralized warehouse which can be used by Data Scientists/Analysts/Business for analysis.
What is data engineer ?
A data engineer is a software professional who build data pipelines which includes collection of data from various source systems (which can be various formats) ,transform the data and load into lakehouse or warehouse which can be used by Data Scientists. This data pipeline goes through multiple phases which can vary case to case based on requirement.
Data Engineering pipelines
Generic data pipeline includes below phases :
- Collection : This is the 1st step where data is collected or extracted from various source systems which can be in various formats and high in volume.
- Cleaning : Clean the data like remove bad records,duplicate records, junk characters,etc
- Processing : Process the data using Big Data or Cloud computing tools as per business requirements,rules.
- Data Quality Check : Apply data quality check rules on processed data before loading to lakehouse or warehouse. For example Null check, Blank Check,Duplicate Check on Natural Keys, Integrity and Custom checks as applicable
- Storage : Load the processed data to lakehouse or warehouse and make them available for analysis to Data Scientists. Storage could be either in for example : HDFS, S3, Database Tables,etc.
- Orchestration : The entire data pipeline steps combined as a job which needs to be scheduled though a scheduler. The scheduler run the job as per the frequency mentioned. Few Scheduling tools like Airflow,Uc4,Oozie,Autosys.
- Governance : The stored data needs to be properly maintained,managed,compliant with proper security guidelines.
Role of a Data Engineer
The role of a data engineer in Data Engineering is to extract/collect the data from various source systems and make them available to Data Scientists. Below is the pictorial representation which depicts various kind of roles performed by a Data Engineer in data engineering roles :
Data engineering career path
Tools and data engineering skills required for Data Engineering : Data Engineering tools are revolving day by day with the trend of new technologies. But among them mostly and widely used tools in industries are below:
- Linux/Unix
- Relational and Non-Relational Databases
- Apache Hadoop (Sqoop,Flume,Hive,Pig)
- Apache Hive/Spark, Databricks for Processing
- Python/Scala for Programming
- Apache Airflow,UC4 for scheudling
- Cloud Services from AWS,Azure,GCP
Below skills are good to have to become a Data Engineer and learn data engineering:
- Knowledge on software engineering fundamentals
- Linux/Unix commands,shell scripting for day to day work
- Relational and Non-Relational Databases
- SQL
- Data Modelling
- Data warehouse Fundamentals
- Big Data and Hadoop Components: HDFS Commands,Sqoop,Hive,Pig,Hbase,Spark(Python/Scala)
- Orchestration: Apace Airflow, UC4 , Oozie
- CI/CD Pipeline : Jenkins,Github,Docker,Kubernetes
- Cloud Computing : S3,Lambda,Dynamo DB, EC3,VPC,IAM,SQS,SNS,Athena,Redshift,RDS
Data Engineer Vs Data Scientist
Below are few points which differentiates between a Data engineer and Data Scientist. Although both of them work together to provide a solution to a problem.
- Data Engineer are responsible for collecting the data from various heterogeneous systems and load it into a lakehouse or warehouse which can be used by Data Scientist for analysis,creating models using machine learning algorithms.
- Data Engineers need to be flexible and agile to adapt new technologies to fulfill there responsibilities whereas Data Scientists need to be focused on specific areas and studies to provide analysis ,results to stakeholders by training the models on top of data provided by Data Engineers.
- Data Engineers mainly use Python or Scala for programming whereas Data Scientists use Python o R programming language in day to day work
FAQ
Question 1 : Do data engineers do coding?
Ans : Yes data engineers do coding as per the use case. Mostly used language is Python/Scala with Apache Spark for data processing and computation.Python for data engineering is mostly choose by engineers. Sql queries are must to know for Data Engineers.
Question 2 : Is data engineering a good career?
Ans: Yes data engineering is a good option for career and its highly ever increasing in demand for data engineers. Data Engineers will learn multiple technologies which will make them tech giant with a knowledge of various skills. As they gain experience they will be more powerful and valuable to organizations for providing solutions in this field. So, its a good choice for all people to grow technically and financially in this field.
Question 3 : Is data engineer an IT role?
Ans: Yes it is an IT role. Software professionals with Computer Science background and experience can become Data Engineers. People with different backgrounds need to devote time to learn the fundamentals ,roles, tools used and responsibilities of an Data Engineer and practical hands on programming with dummy ETL project can give a fair idea about Data Engineering. Any professional with courage,dedication and eager to learn can become a data engineer.
Question 4 : Do data engineers need SQL?
1 thought on “Data Engineering”