Data within organizations - measured in petabytes - grows exponentially each year. As data is generated, it moves from its raw form to a processed version, to outputs that end users need to make better decisions. The five big data lifecycle stages include data ingestion, data staging, data cleansing, data analytics and visualization, and data archiving. All Big data goes through this lifecycle. Organizations can use services from multiple vendors in each stage of the data lifecycle to quickly and cost-effectively prepare, process, analyze, and present data in order to derive more value from it. Simpler data analytics, cheaper data storage, advanced predictive tools like machine learning (ML) and data visualization are necessary to make data-driven decisions and maximize the value of data. The Big Data Lifecycle helps organizations of all sizes establish and optimize a modern data analytics practice in their organization.
Data Generation or Source
For the data life cycle to begin, data must first be generated. Otherwise, the following steps can't be initiated.Data generation occurs regardless of whether you're aware of it, especially in our increasingly online world. Some of this data is generated by your organization, some by your customers, and some by third parties you may or may not be aware of. Every sale, purchase, hire, communication, interaction-everything generates data. Given the proper attention, this data can often lead to powerful insights that allow you to better serve your customers and become more effective in your role.
Common data sources include transaction files, large systems (e.g. CRM, ERP), user-generated data (e.g. clickstream data, log files), sensor data (e.g. from Internet-ofThings or mobile devices), and databases.
Data Ingestion
Not all of the data that's generated every day is collected or used. It's up to your data team to identify what information should be captured and the best means for doing so, and what data is unnecessary or irrelevant to the project at hand.
It's important to note that many organizations take a broad approach to data collection, capturing as much data as possible from each interaction and storing it for potential use. While drawing from this supply is certainly an option, it's always important to start by creating a plan to capture the data you know is critical to your project.
Data ingestion entails the movement of data from an external source, into another location for further analysis. Generally, the destination for data is some form of storage or a database. For example, ingestion can involve moving data from an on-premises data center or physical disks, to virtual disks in the cloud, accessed via an internet connection. Data ingestion also involves identifying the correct data sources, validating and importing data files from those sources, and sending the data to the desired destination. Data sources can include transactions, enterprise-scale systems such as Enterprise Resource Planning (ERP) systems, clickstream data, log files, device or sensor data, or disparate databases. During data ingestion, high value data sources are identified, validated, and imported while data files are stored.
Key questions to consider: What is the volume and velocity of my data?
Typical Tools used in the process: Kafka, Sqoop, Storm, Kinesis, Flume, NiFi, Gobblin etc...