What are the techniques for storing data?

There are several key techniques for storing data depending on the type, size, frequency of access, and other requirements of the data. The main techniques include file storage, block storage, object storage, databases, data warehouses, and data lakes. Choosing the right storage technique is crucial for building an effective data management and analytics strategy.

Table of Contents

What is file storage?

File storage is the most basic way of storing data as files in a hierarchical folder structure. Files are stored and retrieved using a location path and filename. Examples of file storage systems include local hard drives, network file shares, and NAS (Network Attached Storage) devices.

File storage systems are easy to use and provide random access to data. However, they have limits on scalability and lack some advanced data management features of databases. File storage works well for storing office documents, media files, and other self-contained data sets.

What is block storage?

Block storage divides data into evenly sized blocks or chunks that are stored as separate units. Each block can be addressed independently using an offset. Block storage is typically used for storing raw disk volumes that can be formatted with a file system and mounted just like a physical disk drive.

Block storage provides high performance and scalability but lacks native data management capabilities. It is primarily used for fast access to large volumes of data like virtual machine disks and databases. Common block storage solutions include SAN, NAS, and cloud-based block storage services.

What is object storage?

Object storage manages data as discrete objects rather than files in a hierarchy or blocks on a disk volume. Objects contain the data content, a unique identifier, and metadata. The objects are stored in a flat namespace that can scale to billions of objects. Object storage uses REST APIs or cloud gateways to access the objects.

Object storage optimizes storage and retrieval of large unstructured data such as documents, images, audio, video files, backups, etc. It provides massive scalability, but individual object retrieval latency is higher. Amazon S3 and Microsoft Azure Blob storage are popular public cloud object storage services.

What are databases?

Databases are software systems designed for creating structured collections of data and enabling storage, querying, updating, administration and analysis of the data. Databases store data in tables with predefined schemas that enforce constraints and relationships between the data.

Databases use complex querying languages such as SQL and allow fine-grained control of operations through transactions. This makes them suitable for applications that require complex transactions and analytics. The main types of databases include relational databases, NoSQL databases, graph databases, time series databases, and object databases.

Examples of Databases

Type	Examples
Relational databases	MySQL, Oracle, MS SQL Server, PostgreSQL
NoSQL databases	MongoDB, Cassandra, Couchbase, HBase
Graph databases	Neo4j, Amazon Neptune, OrientDB
Time series databases	InfluxDB, TimescaleDB, Prometheus
Object databases	Objectivity/DB, Versant

What are data warehouses?

Data warehouses provide storage and management capabilities optimized for analytic workloads such as business intelligence (BI), reporting and online analytical processing (OLAP). Data is modeled into star schemas with facts at the center and dimensions radiating outwards to simplify queries.

Data warehouses store aggregated, historical data that originates from transactional systems or other data sources. This enables running complex analytical queries across vast datasets without impacting the operational systems. Data warehouses may utilize columnar storage and compression to optimize query performance.

What are data lakes?

Data lakes are highly scalable repositories that store massive amounts of raw data in native formats until it is needed. The purpose of a data lake is to ingest and store all of an organization’s data from disparate sources “as is” without applying schemas until the data is queried.

Data lakes leverage low-cost object storage to hold vast amounts of unstructured, semi-structured, and structured data. The data can then be processed and analyzed using schema-on-read techniques. Data lakes are suitable for workloads like data discovery, profiling, batch analytics, and machine learning where schema-on-write is impractical.

How do you choose the right data storage technique?

Choosing the optimal data storage technology depends on many factors including:

Type of data – structured, semi-structured, or unstructured

Size of data
Frequency of access – transactional or analytical
Performance requirements – throughput, latency

Need for indexing and ability to run queries
Scalability needs
Data protection and backup requirements

Cost considerations
Ease of management and integration

Here are some general guidelines on mapping data characteristics to storage technologies:

Data Type	Recommended Storage
Structured relational data needing ACID transactions	Relational databases
Semi-structured data with variable schemas	Document databases
Graph data with complex relationships	Graph databases
Time series data from IoT devices	Time series databases
Files and media	File storage
Virtual machine disks	Block storage
Backups and archives	Object storage
All enterprise data for analytics	Data warehouse
All raw data needed for processing	Data lake

What are the benefits of a database?

Some key benefits of using a database include:

Data persistence – Data is safely stored and persists beyond application sessions
Managed access – Controls for granting, restricting, and revoking data access

Data validation – Strict schema enforcement for data consistency
Indexing – Improves lookup performance for queries
Query capabilities – Use declarative languages like SQL to query and analyze data

Scale and concurrency – Designed to handle multiple concurrent transactions and scale read/write load
Reliability – Maintain integrity through ACID transactions and backups
Structure – Organizes data into related tables with well-defined relationships

Reduced data duplication – Normalization minimizes redundant data

What are the downsides of file storage?

Some potential downsides and limitations of file storage include:

Limited metadata and organization – Files lack structured data definitions

No built-in data validation – No enforcement of constraints or relationships
No query capabilities – Cannot execute complex queries like in a DB
No transactions – Lack of atomicity, consistency and isolation

Scaling issues – Namespace limitations, overhead with large # of files
Permission management – OS controls lack granularity
Versioning challenges – Manual handling of file revisions

Replication limitations – Rely on external systems for backup

When should you consider using a data warehouse?

Key situations where using a data warehouse could be beneficial include:

You need to analyze large volumes of business data for trends and insights

Your users need to generate business reports with custom aggregations
You want to offload reporting and analytics workloads from your production systems
You need to merge or consolidate data from disparate sources

You want to apply structure/schema to unstructured data sources
You need long term retention of historical data
Your business stakeholders need self-service access to data

You have fragmented data in silos that needs centralizing
You want to improve performance for analyzing large, complex datasets

What are some alternatives to consider over data lakes?

Some potential alternatives to using a data lake include:

Data warehouse – More structure and management capabilities
Databases – Better for transactional or operational workloads
Data virtualization – Abstracts data without copying it

MPP analytic database – Handles processing without ETL
Streaming analytics – Analyze data in motion vs. at rest
Data hub – Centralizes data with services/policies

Master data management – Manages golden records
Data catalog – Discovers, catalogs, and classifies data

The right choice depends on the use case, skill set, and types of analytics required. A data lake can still play a role in a modern data architecture alongside other technologies.

What are some key characteristics of object storage systems?

Some major characteristics of object storage include:

Uses integrated metadata to store attributes with objects
Employs a flat hierarchy instead of complex directory structure

Accesses objects through REST APIs and access keys
Designed for internet-scale capacity and access
Highly durable and available with built-in replication

Massively scalable to billions of objects
Suited for large amounts of unstructured data
Unable to directly modify objects, only replace them

Eventual consistent updates to optimize for scale
Higher latency than block storage due to metadata look ups

How can migrating to a data lake reduce costs compared to a data warehouse?

Some ways migrating to a data lake can reduce costs include:

Uses low-cost storage such as S3 instead of expensive SAN/NAS
Schema-on-read requires less upfront modeling effort
Avoids extract, transform, load (ETL) overhead

Minimizes data duplication using a single source of truth
Leverages open source analytics tools instead of licensed BI tools
Scales storage and compute independently on demand

Pay-as-you-go utilization and billing from cloud services
Enables scaling down resources when not in use

However, a data warehouse may be more cost effective if workloads require consistent performance, strong governance, and enterprise tooling support.

What are some key things to consider when choosing a database?

Some important considerations when choosing a database include:

How the data will be structured – relational vs non-relational models
Types of workloads – transactional, analytical, hybrid

Performance requirements – latency, throughput, scalability
High availability and disaster recovery needs
Data volume and growth expectations

Complexity of query and indexing needs
Consistency requirements – ACID vs BASE
Skill sets of teams supporting and interacting with the system

Budget constraints and total cost of ownership
On-premises, cloud, or hybrid deployment preferences
Administrative overhead to operate and manage

Clearly identifying all functional and non-functional requirements is key to selecting the best database technology.

How can you estimate storage capacity requirements for a data warehouse?

Some tips for estimating data warehouse storage capacity include:

Identify source systems, tables, and rows to load

Get row counts and average row sizes for source tables
Estimate growth rates based on historical data
Factor in overhead of database indexing, aggregation and projections

Consider compression ratios achievable
Add capacity buffer for spikes and future growth
Calculate total raw capacity required

Evaluate results against available storage options and budget
Periodically reassess projections vs actual usage

A bottom-up estimation based on data properties yields a more accurate capacity plan. Benchmarking and proofs of concept can further refine projections.

What are some key capabilities of block storage solutions?

Some key capabilities of block storage solutions include:

Provides raw block-level storage volumes
Volumes can be attached to multiple servers

Enables random I/O access at block level
Allows point-in-time snapshots and cloning
Supports thin provisioning of volumes

Facilitates creating read-only copies for backup
Can replicate volumes to remote sites
Scales up to large capacities with high IOPS

Delivers consistent performance with flash arrays
Integrates with cloud and virtualization platforms

Block storage provides the foundation for delivering shared storage with performance characteristics tailored to demanding workloads.

Conclusion

Choosing the right data storage technology requires thorough analysis of access patterns, performance needs, frequency of operations, data structure, and growth expectations. The various techniques discussed all serve specific purposes.

File storage provides simple data sharing, object storage delivers massively scalable capacity, block storage enables high performance volumes, databases offer structure with transactions and queries, data warehouses facilitate analytics on aggregated data, and data lakes provide centralized storage for mass data processing.

By mapping system requirements to data storage capabilities, enterprises can build flexible and scalable data platforms to securely manage data growth and unlock more value from data assets.