Skip to main content

Big Data and Analytics
Infrastructure

Run Hadoop, Spark, and ClickHouse with predictable performance for ingestion, processing, and analytics.

At a Glance

What “big data” really means at infrastructure level—practical sizing rules, reference node profiles, and operational guardrails for production.

Best for

  • Data ingestion and processing pipelines (batch and near real-time)
  • Data lake and lakehouse workloads (Parquet, ORC, Iceberg, Delta patterns)
  • Fast analytics and observability queries (ClickHouse OLAP)
  • Log analytics, event analytics, product analytics, BI extracts

Primary Infrastructure Bottlenecks

  • Memory throughput and GC stability during joins, aggregations, and shuffles
  • Fast I/O for spill, shuffle, compaction, and merges
  • East-west bandwidth for replication, shuffles, and distributed queries
  • Storage layout and write amplification control

What "Good" Looks Like

  • Jobs finish because the platform is stable, not because you tuned it for weeks
  • Query latencies are predictable at peak concurrency
  • Costs are driven by hardware choices you control, not by surprise billing events

Pick Your Big Data and Analytics Approach

Most stacks fall into one of these four patterns. Pick the one that matches how your data moves.

1. Data Lake Processing with Spark

Use when
  • You run heavy ETL with joins, aggregates, window functions, ML feature engineering
  • You store raw and curated datasets in columnar files
  • You care about cost per TB processed, not per query millisecond
Infrastructure profile
  • CPU dense nodes with enough RAM to reduce shuffle spill
  • NVMe for local scratch, shuffle, and temporary storage
  • Strong network for shuffles and wide joins

2. Hadoop Style Storage + Compute

Use when
  • You still run HDFS and want local disks with predictable throughput
  • You want “data locality” and accept replication overhead
Infrastructure profile
  • Many HDDs for capacity and sequential throughput
  • NVMe or SSD tier for YARN local dirs and hot working sets
  • Network sized for replication and rebalancing

3. ClickHouse for Fast Analytics

Use when
  • You need fast OLAP queries on large volumes
  • You have many concurrent users or dashboards
  • You want cost-effective analytics without a full data warehouse platform tax
Infrastructure profile
  • High I/O throughput and stable latency
  • Enough RAM to keep hot parts and avoid external processing
  • CPU for compression, decompression, and parallel query execution

4. Hybrid Pipeline: Spark + ClickHouse

Use when
  • Spark builds clean datasets and aggregates
  • ClickHouse serves fast queries and dashboards
  • You want separation between heavy batch windows and interactive workloads
Infrastructure profile
  • Separate worker pools—one tuned for batch, one for OLAP
  • Clear ingestion boundaries and data lifecycle policies

What is Big Data and Analytics?

Big data infrastructure is the combination of compute, storage, and network that can reliably handle:

  • High-volume ingestion (events, logs, CDC, files)
  • Distributed processing (batch ETL and transformations)
  • Analytical serving (fast queries across large datasets)

The hard part is not “running Spark”. The hard part is preventing your platform from turning into a chaos machine when:

  • Your ingest rate spikes
  • Your shuffle explodes
  • A compaction backlog builds up
  • Your queries go from 20 to 200 concurrent users
  • Nodes fail and replication kicks in

When Should I Use Big Data Infrastructure?

Use this approach if:

  • Your analytics data is too large or too expensive to keep in a managed public cloud warehouse
  • Your workloads are predictable enough that owning the performance profile matters
  • You need to control where the data lives and how it is moved
  • You run production pipelines where missed SLAs have real business impact
  • You are done with “it depends”—you want a platform you can size, buy, and operate

Skip this approach if:

  • Your data volume is small and your query patterns are simple
  • You do not have a team that can operate distributed systems
  • You need elastic scale to zero and your workloads are truly sporadic

Rule of Thumb Sizing

These numbers are not laws. They are a sane starting point for mixed analytics nodes that run processing and/or OLAP.

Baseline rule of thumb

RAM

4 to 8 GB per CPU core

CPU

32 to 64 cores per node

Storage

2 to 4 TB NVMe plus HDD capacity as needed

ClickHouse explicitly discusses memory-to-CPU ratios and gives general guidance that maps to 4 GB per vCPU for general purpose and 8 GB per vCPU for data warehousing patterns. Spark clusters frequently end up in the same ballpark when configured sensibly.

What changes the sizing fast

You need more RAM when:

  • You do wide joins and large group-bys
  • You keep large working sets cached
  • You have high query concurrency on ClickHouse

You need more NVMe when:

  • Shuffles spill to disk
  • ClickHouse merges and compactions fall behind
  • You run heavy ingestion with large parts and frequent merges

You need more network when:

  • Spark shuffle traffic dominates
  • HDFS replication and rebalance becomes routine
  • You do distributed queries across many nodes

What Pain Points Does This Solve?

  • Slow pipelines caused by shuffle spill and weak local I/O
  • Query latency spikes from saturated disks or merge backlog
  • Unpredictable performance due to noisy neighbors
  • Cost unpredictability from consumption pricing and data movement
  • Operational chaos from running mixed workloads without isolation
  • Scaling problems from storage layouts that do not match access patterns

You must design and operate it like a real platform—not a weekend project<br />

Predictable performance when you size for the real bottlenecks

You need disciplined data lifecycle management, or you will drown in storage<br />

Clear cost drivers: CPU, RAM, storage, network—no mystery line items

Mis-sizing is expensive: too little NVMe or bandwidth will punish you every day<br />

Workload separation: processing and serving can scale independently

Distributed systems require monitoring and operational maturity<br />

Freedom to choose frameworks and evolve over time

How Do I Connect Big Data to Price?

Big data costs are driven by a few levers. Make them explicit.

1. Compute cost drivers

  • Cores needed for parallelism
  • CPU architecture efficiency for compression and query execution
  • Peak concurrency for interactive workloads

2. Memory cost drivers

  • Working set size
  • Shuffle behavior and caching strategy
  • ClickHouse query patterns, especially joins and aggregations

3. Storage cost drivers

  • Data retention period
  • Replication factor and redundancy overhead
  • Compaction and merge write amplification

4. Network cost drivers

  • Replication traffic
  • Shuffle traffic
  • Cross-node query traffic

How Can I Build Big Data on Worldstream?

Worldstream is an infrastructure provider. The value is in building a stable foundation and giving teams control, without lock-in or vague contracts.

Option A: Bare Metal Cluster for Spark and Hadoop

Use when: You want full control of CPU, RAM, and local disks. You want stable performance and predictable costs.

  • Worker pool for Spark and ingestion
  • Storage workers if you run HDFS on-prem style
  • Separate control plane nodes for masters and coordination

Option B: Separate Pools for Processing and Analytics

Use when: Spark batch windows and ClickHouse queries should not fight each other.

  • Spark worker pool sized for throughput and shuffle
  • ClickHouse pool sized for query latency and merges
  • Shared ingestion layer and shared observability

Option C: Hybrid Storage Design

Use when: You need HDD capacity but also need fast performance for hot data and scratch.

  • NVMe or SSD tiers to accelerate larger HDD pools
  • Common approach when you need both capacity and performance

 

What to expect from Worldstream operationally: Worldstream manages its own data centers and its own network, and relies on in-house engineers. It explicitly positions around predictable spending and clear agreements. That matters for big data because unpredictability is usually the real enemy.

Performance Targets and Results Guidelines

Targets depend on the workload. These are the metrics that keep you honest.

Ingestion

Track:

  • Events per second or MB per second into the platform
  • End-to-end lag from source to queryable
  • Backpressure events and queue growth

Red flags: Ingestion is “fine” until compaction starts. Lag grows during peak hours and never recovers.

Spark and Batch Processing

Track:

  • Shuffle spill ratio and spill volume
  • Stage time variance across identical jobs
  • Executor GC time and failures
  • Skew and straggler tasks

Red flags: Adding cores makes it slower (usually means I/O or skew). Jobs fail only during peak load (resource contention).

ClickHouse Analytics

Track:

  • p95 and p99 query latency for key dashboards
  • Merge backlog and part counts
  • Disk throughput during merges
  • Query concurrency and memory usage

Red flags: Merges falling behind (will turn into user-facing latency). Frequent external aggregation or sorting due to memory pressure.

Operations, Performance & Risk Management

Capacity Management

  • Separate storage growth planning from compute scaling
  • Track hot vs cold data—not all TB are equal
  • Plan for failure: replication and rebuild traffic must fit in your network budget

Data Lifecycle

  • Define retention per dataset
  • Define compaction, partitioning, and tiering policies
  • Automate deletion—manual retention is not retention

Backup and Restore

  • Decide what “restore” means: full cluster restore, table restore, or just raw data recovery
  • Test restores—not once, routinely

Monitoring

Minimum set:

  • Disk throughput and disk latency
  • Network throughput and packet drops
  • Memory pressure and swap behavior
  • Queue and lag metrics for ingestion
  • Job success rate and runtime variance

Security

  • Encrypt in transit—always
  • Encrypt at rest when the threat model requires it
  • Use least privilege for service accounts
  • Separate environments: dev and prod should not share the same cluster

 

Worldstream Advantage: Our analytics-ready servers are deployed in Dutch data centers with predictable performance, transparent pricing, and local engineering support—ideal for production big data platforms.

Frequently Asked Questions

If you run Spark shuffles, yes. If you run ClickHouse merges, yes. If you only store cold data on HDFS and rarely compute, it is less critical. But most real platforms compute.

Glossary

Big Data Terms Explained

Batch Processing

Jobs that process data in chunks, often scheduled.

CDC (Change Data Capture)

Captures inserts, updates, deletes from databases for downstream processing.

ClickHouse

Columnar OLAP database designed for fast analytical queries.

Compaction / Merge

Background work that rewrites and merges data parts. Critical for ClickHouse performance.

Data Lake

Raw and curated datasets, often in object storage or HDFS, typically in Parquet or ORC.

Data Locality

Running compute close to data. Often discussed in HDFS based architectures.

Executor (Spark)

Process that runs tasks and holds memory for caching and shuffles.

HDFS

Hadoop Distributed File System. Stores replicated blocks across nodes.

Ingestion

Getting data into the platform. Streaming or batch.

NVMe

Fast local storage. Often used for shuffle, scratch, and hot analytics.

Shuffle (Spark)

Data movement between stages. Often the biggest I/O and network consumer.

Spill

When in-memory operations overflow to disk.

Throughput

How much data can be read or written per second.

Batch Processing

Jobs that process data in chunks, often scheduled.

ClickHouse

Columnar OLAP database designed for fast analytical queries.

Data Lake

Raw and curated datasets, often in object storage or HDFS, typically in Parquet or ORC.

Executor (Spark)

Process that runs tasks and holds memory for caching and shuffles.

Ingestion

Getting data into the platform. Streaming or batch.

Shuffle (Spark)

Data movement between stages. Often the biggest I/O and network consumer.

Throughput

How much data can be read or written per second.

CDC (Change Data Capture)

Captures inserts, updates, deletes from databases for downstream processing.

Compaction / Merge

Background work that rewrites and merges data parts. Critical for ClickHouse performance.

Data Locality

Running compute close to data. Often discussed in HDFS based architectures.

HDFS

Hadoop Distributed File System. Stores replicated blocks across nodes.

NVMe

Fast local storage. Often used for shuffle, scratch, and hot analytics.

Spill

When in-memory operations overflow to disk.

Next Steps with Worldstream

  • Define your dominant workload pattern: Spark batch, Hadoop storage, ClickHouse OLAP, or hybrid.
  • Pick one reference node profile as your baseline worker.
  • Run a proof workload. Measure spill, merge backlog, and network saturation.
  • Adjust. Then lock the profile. Consistency beats cleverness.

 

Worldstream’s core promise is “Solid IT. No Surprises.” We position around freedom of choice, clear agreements, and predictable spending—the mindset you want for big data.