Many businesses these days face a problem with data. There is generally too much of it. Data is spread across organization coming from different places. Each source of data poses a specific problem. It’s either very complex or data structure is difficult to manage. In addition, we observer exponential growth of data. The best to describe Big Data is using the concept of 3vs.
- Volume – huge volumes of data. Companies have so much data that it is difficult to manage and get insights from. If you have terabytes of data, you need specialized tools to manage it. You can achieve this with HDInsight, and its associated toolset.
- Variety – since businesses have multiple data sources. Coming from various business units and systems spread across the organization. This is creating a problem of mixed structured and unstructured data. Imagine that each system is providing you with data in a different file type.
- Velocity – to make things even more complex imagine that data is arriving very quickly from various places. How to manage it. It would be nice if you could process it in real time and analyze at the same time. This is why you need to put your hands-on Big Data tools that Azure offers in HDInsight.
You must remember that to take advantage of these tools you need to think more broadly. There is a need to apply the correct methodology of work. You need to create a strategy and align the business to make a success. There is a great article on how to make better business decisions with predictive analytics. This is something that must be included in your strategy.
Big Data Concepts
- Batch processing – you can process data in blocks (all in one go). Very often it needs some filtering and cleaning. Once you shape it, analysis and visualizations are much easier. This method works better with historical analysis.
- Stream processing – if you want low latency, real-time results for your analytics jobs, then the stream is your choice. As data arrives it is getting processed and fed straight away into analytical tools.
- Machine Learning – when you have access to large volumes of data then you may want to challenge Predictive analytics. Large volumes of good data allow to apply statistical models like classification or regression to forecast results. This will improve your decision-making process.
What HDInsight is?
HDInsight is a fully managed service that allows taking advantage of Big Data tools. It is a standard Hadoop running on Azure Virtual Machines, created by Hortonworks.
Managing cluster computing becomes very easy when you have the right tools in your hand. Behind this, there are several open-source frameworks ready to use (like Apache Hadoop, Apache Spark, Apache Hive, Apache LLAP, Apache Kafka, Apache Storm, and Microsoft Machine Learning Server). HDInsight supports processes like extract, transform, and load (ETL), data warehousing, machine learning or IoT.
In Azure, there are all the tools you need to achieve success in managing your data. I already started describing this toolset provided by Azure.
HDInsight is a bit of hybrid creature mostly PAAS with some IAAS attributes. It allows access to the family of Hadoop products. There are several cluster types to choose from. We have to remember one big disadvantage of HDInsight, mainly you cannot switch them off. In comparison to VMs, which you can put to sleep, HDInsight cluster is always running and sucking your money. Community is asking for months to add this feature but Microsoft seems to be ignoring it.
We have the option to choose from
- Hadoop. There are several Hadoop components on HDInsight to choose from.
- HBase. It is a NoSQL database that runs on top of Hadoop. For details go to Apache site.
- Storm. Used to compute streams of real-time data injections.
- Spark. Cluster computation in memory, for fast and fault tolerant calculations.
- ML Services (R Server). Machine Learning Server provides R-based analytics on HDInsight. It can leverage 8000+ R packages.
- Kafka. Message broker used to publish and subscribe to streams of data records.
- Interactive Query. Also called Apache Hive, used for low latency analytical processing.
HDInsight implemented the storage to use Azure Storage or Data Lake Storage. You can either use Azure Data Lake Storage Gen 1/Azure Data Lake Storage Gen 2, or both. Don’t worry about losing your data when you remove your cluster. There are independent of each other. You can safely delete the cluster without losing your data. When it is coming to meta-data HDInsight uses SQL Database to keep Hadoop configuration meta-data. You don’t need to worry about this, Azure automatically manages this for you.
Hadoop is a set of open-source programs, combined they provide a system to process data in a cluster. HDFS stands for Hadoop Distributed File System. Data is processed by following MapReduce principles. Very important is framework is YARN (yet another resource negotiator), which is responsible for managing Hadoop resources.
MapReduce is a very important concept of Big Data. To take full advantage of this technology you must understand what it is. This is a process of managing data. The Process has two phases that split data into key-value pairs.
At first, Mapping is performed, in which stage data values are mapped to the specific key. This creates a list of keys and matching values.
Next step is Reduce, which perform an operation on each value per key. For example, adding or multiplying values. Output generates distinct keys with calculated values.
The basic example below shows this process.
How to start with HDInsight
Provisioning HDInsight clusters are quite simple. The below video shows how to do it in the Azure portal. This is not the only way. In Azure, you can create resources not only by using portal but by scripting like PowerShell or CLI. The most optimal way would be to use ARM templates.