OceanBase Database Overall Architecture

OceanBase distributed database adopts the Shared-Nothing architecture, where each node is completely equal and has its own SQL engine, storage engine, and transaction engine. It runs on a cluster of ordinary PC servers, providing core features such as high scalability, high availability, high performance, low cost, and high compatibility with mainstream databases.

Architecture

An OceanBase database cluster consists of multiple nodes. These nodes belong to several availability zones (Zones), with each node belonging to one availability zone. An availability zone is a logical concept that represents a group of nodes with similar hardware availability within the cluster. It can have different meanings in different deployment modes. For example, when the entire cluster is deployed within the same data center (IDC), a node in an availability zone can belong to the same rack or switch. When the cluster is distributed across multiple data centers, each availability zone can correspond to a data center. Each availability zone has two properties: IDC and region, describing the IDC and the region to which the IDC belongs. Generally, the region refers to the city where the IDC is located. The IDC and region attributes of an availability zone need to reflect the actual deployment situation to ensure better performance of automatic disaster recovery and optimization strategies within the cluster. Depending on the high availability requirements of the business, OceanBase cluster provides various deployment modes, see High Availability Architecture Overview.

In OceanBase database, the data of a table can be horizontally partitioned into multiple shards according to certain partitioning rules. Each shard is called a table partition or simply a partition. A row of data belongs to and only belongs to one partition. The partitioning rules are specified by the user when creating a table, including hash, range, list, and other types of partitioning, and support for secondary partitions. For example, in the order table of a transaction database, it can be first partitioned into several primary partitions based on user IDs, and then each primary partition can be further divided into several secondary partitions based on months. For a table with secondary partitions, each sub-partition of the second level is a physical partition, while the first level partition is only a logical concept. Several partitions of a table can be distributed across multiple nodes within an availability zone. Each physical partition has a storage layer object called a Tablet, used to store ordered data records.

When a user modifies the records in a Tablet, to ensure data persistence, redo logs (REDO) need to be recorded in the log stream (Log Stream) corresponding to the Tablet. Each log stream serves multiple Tablets on its node. To protect data and ensure uninterrupted service in case of node failure, each log stream and its associated Tablets have multiple replicas. Generally, these replicas are distributed across multiple availability zones. Among the replicas, there is only one replica that accepts modification operations, called the leader replica, while the others are called follower replicas. The consistency of data between the leader and follower replicas is achieved through a distributed consensus protocol based on Multi-Paxos. When the node where the leader replica is located fails, one of the follower replicas is elected as the new leader replica to continue providing services.

Each node in the cluster runs an observer service process, which contains multiple operating system threads. The functionality of each node is equal. Each service is responsible for accessing the partition data on its own node and for parsing and executing SQL statements routed to the local node. These service processes communicate with each other via the TCP/IP protocol. At the same time, each service listens for connection requests from external applications, establishes connections and database sessions, and provides database services. For more information about the observer service process, see Thread Introduction.

To simplify the management of deploying multiple business databases on a large scale and reduce resource costs, OceanBase database provides unique multi-tenancy features. Within an OceanBase cluster, multiple isolated database “instances” called tenants can be created. From the perspective of the application, each tenant is an independent database. Furthermore, each tenant can choose either the MySQL or Oracle compatibility mode. When connecting to a MySQL tenant, users can create users and databases within the tenant, providing a similar experience to using an independent MySQL library. Similarly, when connecting to an Oracle tenant, users can create schemas and manage roles within the tenant, providing a similar experience to using an independent Oracle library. After a new cluster is initialized, there will be a special tenant named “sys,” called the system tenant. The system tenant stores the metadata of the cluster and operates in MySQL compatibility mode.

Suitability of Features

The OceanBase database community edition only provides the MySQL mode.

To isolate resources for tenants, each observer process can have multiple virtual containers called resource units (UNIT) belonging to different tenants. Resource units of each tenant on multiple nodes form a resource pool. Resource units include CPU and memory resources.

To shield application programs from the details of internal partitioning and replica distribution in the OceanBase database, and to make accessing a distributed database as simple as accessing a single.