Whatever your business problem that relates to data, there is always a solution in Azure. There are several data storage tools available. As we all know data is an asset and you should carefully manage it. This applies to all part of the business. Each function or each data source should point to one direction where data will be collected. Later on, you can add applications that will manage it and digest. This should be one of the top priorities to prepare a plan for data storage.
What Azure storage offers is a managed service that allows you to store data that can be used throughout your cloud applications.
The beauty of Azure services is the ability to seamless integration with each part of the Azure cloud. You can access this data by using client libraries, URLs, or the REST API. So at each stage of processing, there is a tool that will plug into appropriate account and consume your data. Ultimately providing insights into your operations. By applying machine learning techniques you can advance this processing to the next level. Creating a forecast or finding patterns in data that are deeply hidden from the human eye.
The main benefit of Azure storage is the durability of the data. There are several mechanisms providing redundancy making sure your data is safe. Data in your account is always replicated to limit the possibility of loss in case of hardware or software failures. When your data is copied to multiple physical places is resistant to transient hardware failures, network or power outages.
There are four redundancy options.
- Locally redundant storage (LRS): data is replicated to storage scale unit. It applies to the region where the storage account was created.
- Zone-redundant storage (ZRS): replicates your data synchronously across three storage clusters in a single region. They are physically separated from the other utilities.
- Geo-redundant storage (GRS): Data is replicated to the secondary region that is hundreds of miles away from the primary region.
- Read-access geo-redundant storage (RA-GRS): RA-GRS provides read-only access to the data in the secondary location, in addition to geo-replication across two regions
All data written to storage is encrypted and you can control access and permissions. Azure Active Directory and Role Based access support Azure Storage.
- RBAC – role bases access control can be assigned to Azure AD to authorize resource management
- You can assign RBAC roles scoped to a subscription, resource group, storage account, or an individual container or queue to a security principal
- Data can be secured in transit by HTTPS or SMP 3.0 client encryption protocols
- OS and data disks can be encrypted.
- Data can be encrypted both in transit and on disks
Your storage account can be integrated directly with Virtual Networks. VNets in Azure support multi-tenant services. You can extent your private address space to service endpoints, via a direct connection. This feature will protect your resources to VNets providing private connectivity. There is no need for gateway devices to set up endpoint.
Shared Access Signature
SAS provide you with a mechanism to grant limited access to specific objects in storage account. You can define this specifically for particular client providing access to a object in your storage. By using SAS your account key is not exposed to public. The account key is similar to account password. Microsoft recommends using Active Directory (Azure AD) for authentication to Blob or Queue storage applications. SAS gives you complete control over what types of access do you want to provide client with.
- You can control the time interval for a particular application.
- There is full control what type of access is granted like read or write.
- You can define an IP or range of IP addresses that can access object in storage account.
- You can control what protocol to use like HTTPS
You can use SAS to authorize access to objects when copying data from a blob to another blob or a file.
How a SAS works
A SAS is a signed URI that includes token. This token contains a set of parameters defining what access and to which object client has. It is generated by a client for example a web application. When a client provides a SAS with request. The service checks parameters and verifies if signature is valid. If this succeeds access is granted.
Data in Azure storage is accessible from anywhere in the world over HTTP or HTTPS. You can program this in many programming languages .NET, Java, Node.js, Python, PHP, Ruby, Go.
Blob storage is optimized to hold massive amounts of unstructured data. This is a fully managed service where files can be persisted and accessed by using URLs, the REST API, or a client library. You can use containers to logically group blobs. Blobs are used throughout service instances, such as Virtual Machines that use blobs to store virtual hard disks. You can use it for serving images to your application, streaming videos or as a backup solution to your files.
There are three types of blob storage tiers. You can pick specific one for your current needs. You will mostly used all three to minimize the costs and meet your business data retention periods. There is always an option to change from one tier to another. Bare in mind that there are different prices for each tier. Tiering is only available for General Purpose V2 accounts.
- Hot – You need this one if you access this data frequently.
- Cool – If you need access to data from time to time and stored for at least 30 days.
- Archive – This is purely for data archives that is used rarely. In most business cases you will use it for historical data. Which may be required by law to keep documentation for 7 or 10 years.
If your application needs low latency access to data then there is an option for you. Data in performance tier is stored on SSD drives. With this option you can access data at Formula 1 speeds 🙂
Data Storage Prices
First 50 terabytes (TB) / month
Costs of operations
Their main benefit is the ability to create a network file share. It means that multiple services can share the same files with both read and write access.
It gives the power to access files from anywhere using a URL – REST API or client libraries. This is useful when you migrate existing application workloads from on-premises to Azure.
The shared access signatures provide secure access token issued to the specific application. It can be valid for the amount of time as defined by the user. You can mount the file share to the same drive as on the premises simplifying access for your applications.
The Queue service provides a simple managed interface to push messages into a queue and consume the same messages. Messages are stored as serialized strings. They can be processed asynchronously.
They can be stored or retrieved in a First-In First-Out manner by using logical queue actions, such as enqueue, dequeue, and peek. Single queue message can be as big as 64 KB.
Azure table storage is part of Cosmos DB service. It is a NoSQL store that allows you to store loosely structured sets of entities. A Key/attribute store with the schemaless design. This feature makes the solution very flexible. This kind of storage is fast (use cluster index) and effective even with huge volumes of data (Tera Bytes+). You can store any number of items. This service can accept calls from inside and outside the Azure cloud. There are ideal for storing non-relational data.
Azure Cosmos DB provides the Table API for applications that use Azure table storage. Their main advantages are global distribution and single-digit millisecond latencies. They can be accessed using the OData protocol and LINQ queries.
Azure Data Lake Storage
Data lake storage is ideal for gathering large volumes of data for large scale analytics. There is not limit on how much data you can store. You can load anything from kilobytes to petabytes, which makes this solution limitless. In addition there is not time limit how long do you want to store your data, from now to eternity is your choice.
It is ready for Big Data. You can access it directly from HDInsight cluster. It has been designed for analytics workloads in mind. You can tick all the boxes for security, availability, scalability, and reliability. Fully compatible with Hadoop distributed file systems. Azure has thought about access layer and provided you with WebHDFS-compatible REST interface for applications. Data lake is compatible with most open source components of Hadoop ecosystem. It integrates with other Azure services. You have full flexibility as to how do you want to use it with your solutions.
There are two types of Data Lake Storage.
Data Lake Storage Gen 1
It is a hyper-scale repository for Big Data analytics workloads. You can capture data of any size, or type. Data is distributed in blocks in a file system. This gives you benefits of organizing data in directories. Which simplifies setting up security mechanism. This file system supports atomic operations and meta-data operations, which has direct benefits on performance. It is compatible with most open-source components in Hadoop ecosystem.
Data Lake Storage Gen 2
This is the newest service available from February 7, 2019. Its build on top of Azure storage, which uses Blob storage as its foundation. It enables both object storage and hierarchical file storage. Essentially Gen2 is combination of both worlds of blobs and file system. It provides benefits for analytics workloads. You will be able to access data in multiple ways depending on your need.
There are two new features that makes this different then blob.
- File System – The concept of container (in a blob storage) is now a file system.
- DFS Endpoint – It utilizes and ABFS driver, which is part of Hadoop. This file system driver allows performance and security optimization.
The Gen2 architecture allows for a fast data query. Leveraging partition scans, you can only retrieve a subset of data. With Gen2 by using DFS endpoints you are performing metadata operations. This will improve performance when changing files locations or names.
Since this is second generation storage it will be actively
evolving. You can expect more and more features.
SQL Database is a managed Database-as-a-Service platform that you can use to host your SQL objects. It is based on the SQL Server Database Engine. The main benefits are high-performance, reliability, and security. It supports data structures like relational data, JSON, spatial, and XML.
To query these datasets you can use tSQL language. For speed, Azure added columnstore indexes, which is extremally helpful in querying huge volumes of data. Since it is managed service there is no need for your admin team to worry about pathing or updating it. It is done automatically by MS.
When you require high performance it can scale dynamically when need arrives. There three deployment options available depending on your needs.
- A single database, one instance managed via s SQL Database server
- An elastic pool, a collection of databases with a shared set of resources.
- Managed instance, which is a collection of system and user databases with a shared set of resources.
What is important is the ability to scale automatically with a workload. No more bottlenecks in applications. The great feature is sharding.
What Sharding does is divide the data store into partitions (shards). Each shard has the same schema but holds a distinct subset of data. It can contain data of different types but working as a data store, acting as a storage node.
This sharding logic may be implemented as part of the data access code in the application, or it could be implemented by the data storage. You can create them manually or by using elastic scale automated functionality.
What problem it solves
When an application needs to reply in real time to highly changing usage. This generally applies to web applications during peak times. When an online store makes sales and inform customers about it then now of a sudden you may see that hundreds of requests are initiated against your web site.
Azure Cosmos DB allows to dynamically scale throughput and storage across any number of geographical regions. It will respond to your queries in milliseconds. All of this is available via API. For business, there is really attractive SLA describing throughput, latency, and availability.
Since Cosmos DB is a globally distributed database service. Its main feature allows transparently replicating your data where your users are. So it does not matter where is your business, your application will be available in a fast manner to people in different countries. This is a great benefit in delivering fast response to a local client.
For developers, it offers flexibility to work with. By using API you can create, read, update, and delete data.
This video will show you how to provision SQL Database.
As with many services in the Azure you can mange your resources with code. There are API and features allowing you to create, modify or delete specific items. For SQL servers there are API and libraries to create tables, read or update data. The below sample python script shows how to read data from SQL Database or create tables.
I will be updating with more useful scripts…