Skip to main content

Cohestra

Open-source deployment management for Apache Flink on any Kubernetes cluster.

Cohestra is an Apache 2.0-licensed control plane that manages the full lifecycle of stateful Apache Flink deployments. It replaces proprietary managed services like AWS Managed Service for Apache Flink (MSF) with an open, portable solution that runs on any Kubernetes cluster — EKS, GKE, AKS, or on-prem.

Why Cohestra?

Pain PointManaged ServicesCohestra
Vendor lock-inLocked to one cloudAny Kubernetes cluster
Flink version lagMonths behind OSSDay-one support for Flink 2.x
Opaque operationsBlack-box deploysFull operation history via Temporal
Limited autoscalingBasic KPU scalingCustom autoscaler SDK
CostPer-KPU pricingJust your Kubernetes nodes
State ownershipProvider-managed S3You own checkpoints and savepoints

Key Features

  • Controlled Rollouts — Savepoint-gated deployments with automatic health checks and rollback on failure
  • Custom Autoscaling — Replace the Flink Operator autoscaler with your own logic using the SDK (Kafka lag, CPU, custom metrics)
  • Multi-language SDKs — Python, Go, and Java clients for programmatic control
  • Durable Operation History — Every deploy, scale, rollback, and savepoint tracked in Temporal workflows
  • Cluster Freeze — Namespace-level mutation freeze during incidents
  • GitOps Ready — API-driven, idempotent operations with Idempotency-Key headers

Architecture at a Glance

Cohestra uses the actor model implemented via long-running Temporal workflows — the same pattern used by Netflix to orchestrate 12,000+ Flink clusters.

SDK / CLI / CI → Cohestra API Server → Temporal Server

DeploymentActor (long-running)
↙ ↘
RolloutWorkflow SavepointWorkflow

Activities → Flink Kubernetes Operator → Flink Jobs

Each Flink deployment gets a dedicated DeploymentActor workflow that serializes all operations, maintains version history, and orchestrates child workflows for rollouts and savepoints.

Quick Install

helm repo add cohestra https://cohestra-project.github.io/charts
helm install cohestra cohestra/cohestra \
--namespace cohestra-system --create-namespace \
--set temporal.enabled=true

Then register and deploy your first Flink job — see Getting Started.