Description
Kafka
- Overview: Apache Kafka is a distributed event streaming platform for building real‑time data pipelines and streaming applications.
- Core model: Uses topics, partitions, producers, consumers, and brokers to persist and stream ordered records with high throughput.
- Durability and replication: Data is durably stored in partitioned logs and replicated across brokers for fault tolerance and high availability.
- Scalability: Scales horizontally by adding brokers and partitions to increase throughput and parallelism.
- Delivery semantics: Supports at‑least‑once delivery by default; idempotent producers and Exactly‑Once Semantics (EOS) enable deduplicated, transactional writes for end‑to‑end correctness.
- Retention and compaction: Configurable retention policies and log compaction let you keep time‑windowed data or compacted latest‑key state for changelog patterns.
- Low latency and high throughput: Optimized for sequential disk I/O and batching to deliver millisecond latencies at millions of messages per second.
- Consumer groups and parallelism: Consumer groups provide scalable, fault‑tolerant consumption with partition ownership and rebalancing.
- Stream processing: Native Streams API and ecosystem tools (Kafka Streams, ksqlDB) enable stateful transformations, windowing, joins, and real‑time analytics.
- Ecosystem and connectors: Kafka Connect offers a pluggable framework of source and sink connectors for databases, object stores, and messaging systems.
- Security: TLS encryption, SASL authentication, and ACLs support secure multi‑tenant deployments.
- Operational concerns: Monitoring, partition rebalancing, broker configuration tuning, and careful retention/segment sizing are essential for stable clusters.
- Performance tuning: JVM tuning, network and disk throughput optimization, and producer/consumer batching settings drive production performance.
- Advanced patterns: Exactly‑once processing across producers and stream processors, event sourcing, CQRS, and change data capture (CDC) at scale.
- Resilience strategies: Multi‑datacenter replication (MirrorMaker or Cluster Linking), idempotent consumers, and backpressure patterns for graceful degradation.
- Testing and observability: Emphasize contract testing, chaos testing, distributed tracing, and metrics for end‑to‑end reliability.
- Experience progression: 3–5 years focus on core concepts, producers/consumers, and Connect; 6–12 years on tuning, stream processing, and cluster ops; 13–20 years on global replication, platform design, governance, and SRE leadership.




