Big Data and Analytics
Infrastructure

Run Hadoop, Spark, and ClickHouse with predictable performance for ingestion, processing, and analytics.

At a Glance

What “big data” really means at infrastructure level—practical sizing rules, reference node profiles, and operational guardrails for production.

Best for

Data ingestion and processing pipelines (batch and near real-time)
Data lake and lakehouse workloads (Parquet, ORC, Iceberg, Delta patterns)
Fast analytics and observability queries (ClickHouse OLAP)
Log analytics, event analytics, product analytics, BI extracts

Primary Infrastructure Bottlenecks

Memory throughput and GC stability during joins, aggregations, and shuffles
Fast I/O for spill, shuffle, compaction, and merges
East-west bandwidth for replication, shuffles, and distributed queries
Storage layout and write amplification control

What "Good" Looks Like

Jobs finish because the platform is stable, not because you tuned it for weeks
Query latencies are predictable at peak concurrency
Costs are driven by hardware choices you control, not by surprise billing events

Pick Your Big Data and Analytics Approach

Most stacks fall into one of these four patterns. Pick the one that matches how your data moves.

1. Data Lake Processing with Spark

Use when

You run heavy ETL with joins, aggregates, window functions, ML feature engineering
You store raw and curated datasets in columnar files
You care about cost per TB processed, not per query millisecond

Infrastructure profile

CPU dense nodes with enough RAM to reduce shuffle spill
NVMe for local scratch, shuffle, and temporary storage
Strong network for shuffles and wide joins

2. Hadoop Style Storage + Compute

Use when

You still run HDFS and want local disks with predictable throughput
You want “data locality” and accept replication overhead

Infrastructure profile

Many HDDs for capacity and sequential throughput
NVMe or SSD tier for YARN local dirs and hot working sets
Network sized for replication and rebalancing

3. ClickHouse for Fast Analytics

Use when

You need fast OLAP queries on large volumes
You have many concurrent users or dashboards
You want cost-effective analytics without a full data warehouse platform tax

Infrastructure profile

High I/O throughput and stable latency
Enough RAM to keep hot parts and avoid external processing
CPU for compression, decompression, and parallel query execution

4. Hybrid Pipeline: Spark + ClickHouse

Use when

Spark builds clean datasets and aggregates
ClickHouse serves fast queries and dashboards
You want separation between heavy batch windows and interactive workloads

Infrastructure profile

Separate worker pools—one tuned for batch, one for OLAP
Clear ingestion boundaries and data lifecycle policies

What is Big Data and Analytics?

Big data infrastructure is the combination of compute, storage, and network that can reliably handle:

High-volume ingestion (events, logs, CDC, files)
Distributed processing (batch ETL and transformations)
Analytical serving (fast queries across large datasets)

The hard part is not “running Spark”. The hard part is preventing your platform from turning into a chaos machine when:

Your ingest rate spikes
Your shuffle explodes
A compaction backlog builds up
Your queries go from 20 to 200 concurrent users
Nodes fail and replication kicks in

When Should I Use Big Data Infrastructure?

Use this approach if:

Your analytics data is too large or too expensive to keep in a managed public cloud warehouse
Your workloads are predictable enough that owning the performance profile matters
You need to control where the data lives and how it is moved
You run production pipelines where missed SLAs have real business impact
You are done with “it depends”—you want a platform you can size, buy, and operate

Skip this approach if:

Your data volume is small and your query patterns are simple
You do not have a team that can operate distributed systems
You need elastic scale to zero and your workloads are truly sporadic

Rule of Thumb Sizing

These numbers are not laws. They are a sane starting point for mixed analytics nodes that run processing and/or OLAP.

Baseline rule of thumb

ComponentRecommendation

RAM

4 to 8 GB per CPU core

CPU

32 to 64 cores per node

Storage

2 to 4 TB NVMe plus HDD capacity as needed

ClickHouse explicitly discusses memory-to-CPU ratios and gives general guidance that maps to 4 GB per vCPU for general purpose and 8 GB per vCPU for data warehousing patterns. Spark clusters frequently end up in the same ballpark when configured sensibly.

What changes the sizing fast

You need more RAM when:

You do wide joins and large group-bys
You keep large working sets cached
You have high query concurrency on ClickHouse

You need more NVMe when:

Shuffles spill to disk
ClickHouse merges and compactions fall behind
You run heavy ingestion with large parts and frequent merges

You need more network when:

Spark shuffle traffic dominates
HDFS replication and rebalance becomes routine
You do distributed queries across many nodes

What Pain Points Does This Solve?

Slow pipelines caused by shuffle spill and weak local I/O
Query latency spikes from saturated disks or merge backlog
Unpredictable performance due to noisy neighbors
Cost unpredictability from consumption pricing and data movement
Operational chaos from running mixed workloads without isolation
Scaling problems from storage layouts that do not match access patterns

DrawbacksBenefits

You must design and operate it like a real platform—not a weekend project<br />

Predictable performance when you size for the real bottlenecks

You need disciplined data lifecycle management, or you will drown in storage<br />

Clear cost drivers: CPU, RAM, storage, network—no mystery line items

Mis-sizing is expensive: too little NVMe or bandwidth will punish you every day<br />

Workload separation: processing and serving can scale independently

Distributed systems require monitoring and operational maturity<br />

Freedom to choose frameworks and evolve over time

How Do I Connect Big Data to Price?

Big data costs are driven by a few levers. Make them explicit.

1. Compute cost drivers

Cores needed for parallelism
CPU architecture efficiency for compression and query execution
Peak concurrency for interactive workloads

2. Memory cost drivers

Working set size
Shuffle behavior and caching strategy
ClickHouse query patterns, especially joins and aggregations

3. Storage cost drivers

Data retention period
Replication factor and redundancy overhead
Compaction and merge write amplification

4. Network cost drivers

Replication traffic
Shuffle traffic
Cross-node query traffic

How Can I Build Big Data on Worldstream?

Worldstream is an infrastructure provider. The value is in building a stable foundation and giving teams control, without lock-in or vague contracts.

Option A: Bare Metal Cluster for Spark and Hadoop

Use when: You want full control of CPU, RAM, and local disks. You want stable performance and predictable costs.

Worker pool for Spark and ingestion
Storage workers if you run HDFS on-prem style
Separate control plane nodes for masters and coordination

Option B: Separate Pools for Processing and Analytics

Use when: Spark batch windows and ClickHouse queries should not fight each other.

Spark worker pool sized for throughput and shuffle
ClickHouse pool sized for query latency and merges
Shared ingestion layer and shared observability

Option C: Hybrid Storage Design

Use when: You need HDD capacity but also need fast performance for hot data and scratch.

NVMe or SSD tiers to accelerate larger HDD pools
Common approach when you need both capacity and performance

What to expect from Worldstream operationally: Worldstream manages its own data centers and its own network, and relies on in-house engineers. It explicitly positions around predictable spending and clear agreements. That matters for big data because unpredictability is usually the real enemy.

Performance Targets and Results Guidelines

Targets depend on the workload. These are the metrics that keep you honest.

Ingestion

Track:

Events per second or MB per second into the platform
End-to-end lag from source to queryable
Backpressure events and queue growth

Red flags: Ingestion is “fine” until compaction starts. Lag grows during peak hours and never recovers.

Spark and Batch Processing

Track:

Shuffle spill ratio and spill volume
Stage time variance across identical jobs
Executor GC time and failures
Skew and straggler tasks

Red flags: Adding cores makes it slower (usually means I/O or skew). Jobs fail only during peak load (resource contention).

ClickHouse Analytics

Track:

p95 and p99 query latency for key dashboards
Merge backlog and part counts
Disk throughput during merges
Query concurrency and memory usage

Red flags: Merges falling behind (will turn into user-facing latency). Frequent external aggregation or sorting due to memory pressure.

Operations, Performance & Risk Management

Capacity Management

Separate storage growth planning from compute scaling
Track hot vs cold data—not all TB are equal
Plan for failure: replication and rebuild traffic must fit in your network budget

Data Lifecycle

Define retention per dataset
Define compaction, partitioning, and tiering policies
Automate deletion—manual retention is not retention

Backup and Restore

Decide what “restore” means: full cluster restore, table restore, or just raw data recovery
Test restores—not once, routinely

Monitoring

Minimum set:

Disk throughput and disk latency
Network throughput and packet drops
Memory pressure and swap behavior
Queue and lag metrics for ingestion
Job success rate and runtime variance

Security

Encrypt in transit—always
Encrypt at rest when the threat model requires it
Use least privilege for service accounts
Separate environments: dev and prod should not share the same cluster

Worldstream Advantage: Our analytics-ready servers are deployed in Dutch data centers with predictable performance, transparent pricing, and local engineering support—ideal for production big data platforms.

Frequently Asked Questions

If you run Spark shuffles, yes. If you run ClickHouse merges, yes. If you only store cold data on HDFS and rarely compute, it is less critical. But most real platforms compute.

Glossary

Big Data Terms Explained

Batch Processing

Jobs that process data in chunks, often scheduled.

CDC (Change Data Capture)

Captures inserts, updates, deletes from databases for downstream processing.

ClickHouse

Columnar OLAP database designed for fast analytical queries.

Compaction / Merge

Background work that rewrites and merges data parts. Critical for ClickHouse performance.

Data Lake

Raw and curated datasets, often in object storage or HDFS, typically in Parquet or ORC.

Data Locality

Running compute close to data. Often discussed in HDFS based architectures.

Executor (Spark)

Process that runs tasks and holds memory for caching and shuffles.

HDFS

Hadoop Distributed File System. Stores replicated blocks across nodes.

Ingestion

Getting data into the platform. Streaming or batch.

NVMe

Fast local storage. Often used for shuffle, scratch, and hot analytics.

Shuffle (Spark)

Data movement between stages. Often the biggest I/O and network consumer.

Spill

When in-memory operations overflow to disk.

Throughput

How much data can be read or written per second.

Batch Processing

Jobs that process data in chunks, often scheduled.

ClickHouse

Columnar OLAP database designed for fast analytical queries.

Data Lake

Raw and curated datasets, often in object storage or HDFS, typically in Parquet or ORC.

Executor (Spark)

Process that runs tasks and holds memory for caching and shuffles.

Ingestion

Getting data into the platform. Streaming or batch.

Shuffle (Spark)

Data movement between stages. Often the biggest I/O and network consumer.

Throughput

How much data can be read or written per second.

CDC (Change Data Capture)

Captures inserts, updates, deletes from databases for downstream processing.

Compaction / Merge

Background work that rewrites and merges data parts. Critical for ClickHouse performance.

Data Locality

Running compute close to data. Often discussed in HDFS based architectures.

HDFS

Hadoop Distributed File System. Stores replicated blocks across nodes.

NVMe

Fast local storage. Often used for shuffle, scratch, and hot analytics.

Spill

When in-memory operations overflow to disk.

Next Steps with Worldstream

Define your dominant workload pattern: Spark batch, Hadoop storage, ClickHouse OLAP, or hybrid.
Pick one reference node profile as your baseline worker.
Run a proof workload. Measure spill, merge backlog, and network saturation.
Adjust. Then lock the profile. Consistency beats cleverness.

Worldstream’s core promise is “Solid IT. No Surprises.” We position around freedom of choice, clear agreements, and predictable spending—the mindset you want for big data.

Big Data and Analytics Infrastructure

At a Glance

Best for

Primary Infrastructure Bottlenecks

What "Good" Looks Like

Pick Your Big Data and Analytics Approach

1. Data Lake Processing with Spark

Use when

Infrastructure profile

2. Hadoop Style Storage + Compute

Use when

Infrastructure profile

3. ClickHouse for Fast Analytics

Use when

Infrastructure profile

4. Hybrid Pipeline: Spark + ClickHouse

Use when

Infrastructure profile

What is Big Data and Analytics?

Big data infrastructure is the combination of compute, storage, and network that can reliably handle:

The hard part is not “running Spark”. The hard part is preventing your platform from turning into a chaos machine when:

When Should I Use Big Data Infrastructure?

Use this approach if:

Skip this approach if:

Rule of Thumb Sizing

Baseline rule of thumb

RAM

CPU

Storage

What changes the sizing fast

You need more RAM when:

You need more NVMe when:

You need more network when:

What Pain Points Does This Solve?

You must design and operate it like a real platform—not a weekend project<br />

You need disciplined data lifecycle management, or you will drown in storage<br />

Mis-sizing is expensive: too little NVMe or bandwidth will punish you every day<br />

Distributed systems require monitoring and operational maturity<br />

How Do I Connect Big Data to Price?

1. Compute cost drivers

2. Memory cost drivers

3. Storage cost drivers

4. Network cost drivers

How Can I Build Big Data on Worldstream?

Option A: Bare Metal Cluster for Spark and Hadoop

Option B: Separate Pools for Processing and Analytics

Option C: Hybrid Storage Design

Performance Targets and Results Guidelines

Ingestion

Spark and Batch Processing

ClickHouse Analytics

Operations, Performance & Risk Management

Capacity Management

Data Lifecycle

Backup and Restore

Monitoring

Security

Frequently Asked Questions

Do I really need NVMe?

Do I really need NVMe?

Can I run Spark and ClickHouse on the same nodes?

Can I run Spark and ClickHouse on the same nodes?

Is the 4 to 8 GB per core memory rule always correct?

Is the 4 to 8 GB per core memory rule always correct?

What is the minimum network speed for a Hadoop cluster?

What is the minimum network speed for a Hadoop cluster?

Should I use HDD or SSD for HDFS?

Should I use HDD or SSD for HDFS?

How do I size local disk for YARN scratch?

How do I size local disk for YARN scratch?

What kills performance first in ClickHouse?

What kills performance first in ClickHouse?

What kills performance first in Spark?

What kills performance first in Spark?

Do I need Kubernetes for this?

Do I need Kubernetes for this?

Can I start small and scale?

Can I start small and scale?

Glossary

Batch Processing

Big Data and Analytics
Infrastructure