Tiered Storage in Kafka - Summary from Uber&#s Technology Blog - Programming

Front page > Programming > Tiered Storage in Kafka - Summary from Uber&#s Technology Blog

Tiered Storage in Kafka - Summary from Uber&#s Technology Blog

Published on 2024-08-17

Browse:582

Tiered Storage in Kafka - Summary from Uber

Uber's technology blog published an article, Introduction to Kafka Tiered Storage at Uber, aiming to maximize data retention with fewer Kafka brokers and less memory. This allows for longer message retention times across various business applications.

A common solution is to integrate external storage manually, periodically synchronizing data to the external system. However, this involves significant development and maintenance efforts, such as determining how to save the data, setting synchronization frequency, triggering processes, fetching data, and using indexing.

Therefore, Uber proposed a solution that encapsulates the logic of external storage, making it plug-and-play with simple configurations. This feature is being developed in collaboration with the Apache Foundation and will be available in future versions.

Scenario

It is important to understand that Kafka is an append-only message queue (MQ) component with very high throughput capabilities. Kafka stores logs on the broker's local storage, and users can configure the retention time or log size. In my previous company (Lenovo), we used Flink to continuously consume data. A large volume of data would cause Kafka to exceed the disk storage limit, leading to data write failures and business errors. To reduce costs, instead of deploying more machines, we could only adjust the retention time.

Additionally, if each company were to develop its own system to save older data to external storage, it would involve a huge amount of development work. There would also be numerous issues related to synchronization and data consistency.

Solution

The essence is to transform the Broker by adding remote log management and storage management to it.

RemoteLogManager: Manages the lifecycle of remote log segments, including copying, cleaning, and fetching.

RemoteStorageManager: Manages actions for remote log segments, including copying, fetching, and deleting.The metadata associated with remote log segments includes information about the segment’s start and end offsets, timestamps, producer state snapshots, and leader epoch checkpoints.
RemoteLogMetadataManager keeps track of this metadata to ensure that the system knows where each segment starts and ends, and other critical information needed for data retrieval and management.

RemoteLogMetadataManager: Manages the metadata lifecycle for remote log segments with strong consistency.

Among them, RemoteLogManager acts as a control component, directly connecting to the disk in the Broker to retrieve the read data. It is also responsible for calling back the remote data. RemoteStorageManager is the entity that operates on the data, and RemoteLogMetadataManager is responsible for managing the metadata.

Summary of the Three Actions in Kafka Tiered Storage

Copying Segments to Remote Storage
A log segment is considered eligible for copying to remote storage if its end offset (the offset of the last message in the segment) is less than the partition's last-stable-offset.（Last-Stable-Offset (LSO): The highest offset for which all prior messages are fully acknowledged by all in-sync replicas, ensuring no data loss.）RemoteStorageManager handles the copying of log segments along with their associated indexes, timestamps, producer snapshots, and leader epoch cache.
Cleaning up of Remote Segments
Remote data is cleaned up at regular intervals by computing the eligible segments by a dedicated thread pool. This is different from the asynchronous cleaning up of the local log segments. When a topic is deleted, cleaning up of remote log segments is done asynchronously and it will not block the existing delete operation or recreate a new topic.
Fetching Segments from Remote Storage
RemoteLogManager determines the targeted remote segment based on the desired offset and leader epoch by looking into the metadata store using RemoteLogMetadataManager. It uses RemoteStorageManager to find the position within the segment and start fetching the desired data.

Release Statement This article is reproduced at: https://dev.to/bochaoli95/tiered-storage-in-kafka-summary-from-ubers-technology-blog-40cg?1 If there is any infringement, please contact [email protected] to delete it

Latest tutorial More>

How to Overcome Jagged Edges and Blurry Results When Resizing Images with Canvas?
Resolving Smoothing Issues When Resizing Images Using Canvas in JavaScriptResizing images using canvas in JavaScript can sometimes result in noticeabl...

Programming Published on 2024-11-06
How to Resolve Text Encoding Issues in MySQL C#?
Fixing Text Encoding Issues in MySQL C#When working with MySQL databases in C# using Entity Framework, users may encounter text encoding problems, par...

Programming Published on 2024-11-06
How to Integrate Meilisearch with Node.js
As a Node.js developer, building applications that deliver fast and accurate search results is important. Users expect immediate and relevant response...

Programming Published on 2024-11-06
Parallel JavaScript Machine
Author: Vladas Saulis, PE Prodata, Klaipėda, Lithuania May 18th, 2024 Abstract This paper presents a new programming model that can utilize multi-core...

Programming Published on 2024-11-06
Recommended Project: Personnel Management System Database Setup
This comprehensive project from LabEx offers an invaluable opportunity to delve into the world of database management, focusing on the creation and im...

Programming Published on 2024-11-06
What is the difference between instance methods and class methods in Python?
Class vs. Instance MethodsPython's PEP 8 style guide recommends using "self" for instance method first arguments and "cls" for...

Programming Published on 2024-11-06
Loading AdoptiumJDK source code into Eclipse IDE
AdoptiumJDK does not have source code files built into its installer and if you need to check how to use any native JDK method through the Eclipse IDE...

Programming Published on 2024-11-06
Absolute vs. Relative Positioning: Why Do They Behave So Differently?
Understanding Absolute vs. Relative Position: Width, Height, and MoreWhen dealing with positioning elements on a web page, understanding the concepts ...

Programming Published on 2024-11-06
Top modules for recaptcha recognition in Python, Node js, and PHP
In our age of automation, most solutions can be freely found available, and I'm not talking about solving math problems right now, but slightly mo...

Programming Published on 2024-11-06
Here are a few title options, focusing on the question format and the core content: **Option 1 (Direct & Concise):** * **How to Efficiently Loop through Multidimensional Arrays in PHP?** **Option 2
Looping a Multidimensional Array in PHPMultidimensional arrays can be a challenge to parse, especially when dealing with varying levels of depth and n...

Programming Published on 2024-11-06
Improving Code Quality with Linting
Whenever I start a new project, one of the first things I do is put in place a code linter. For the uninitiated, linters analyze your project and call...

Programming Published on 2024-11-06
How to Effectively Execute Callback Functions in JavaScript?
Understanding the Essence of Callback Functions in JavaScriptIn JavaScript, callback functions offer a convenient mechanism to execute a function afte...

Programming Published on 2024-11-06
Intro to the Vue Framework
What is Vue? from the Vue website Vue is a "progressive" JavaScript framework for building user interfaces. It works by build...

Programming Published on 2024-11-06
Escape the Drama: Why HydePHP is Your WordPress Alternative
The WordPress Drama As the WordPress ecosystem faces unprecedented turmoil, many developers and site owners are reconsidering their platform ...

Programming Published on 2024-11-06
Concurrency patterns in Go; worker pools and fan-out/fan-in
Go is known for its exceptional concurrency model, but many developers focus only on goroutines and channels. However, concurrency patterns like worke...

Programming Published on 2024-11-06