在前文《多IDC的數(shù)據(jù)分布設(shè)計(jì)(一)》中介紹了多IDC數(shù)據(jù)一致性的幾種實(shí)現(xiàn)原理,遺憾的是,目前雖然有不少分布式產(chǎn)品,但幾乎都沒有開源的產(chǎn)品專門針對(duì)IDC來優(yōu)化。本文從實(shí)踐的角度分析各種方法優(yōu)缺點(diǎn)。
背景資料 Latency差異
Jeff Dean提到不同數(shù)據(jù)訪問方式latency差異
Numbers Everyone Should Know
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
這個(gè)數(shù)據(jù)對(duì)于我們?cè)O(shè)計(jì)多IDC數(shù)據(jù)訪問策略具有關(guān)鍵的指導(dǎo)作用,我們可以用這個(gè)數(shù)據(jù)來衡量數(shù)據(jù)架構(gòu)來如何設(shè)計(jì)才能滿足高并發(fā)低延遲的目標(biāo)。
這份數(shù)據(jù)實(shí)際上對(duì)所有網(wǎng)絡(luò)應(yīng)用及分布式應(yīng)用開發(fā)者都具有很大借鑒作用,數(shù)據(jù)需要根據(jù)訪問頻率盡量放在latency小的地方。
1. 2PC/3PC/Paxos模式
在上文中提到,2PC/3PC相比Paxos有明顯的缺點(diǎn),因此最好不用于生產(chǎn)環(huán)境,這里就不再詳述。
Paxos選擇了CAP理論中的”Consistency, Partition”, 需要犧牲availability。它可以在多個(gè)IDC之間實(shí)現(xiàn)強(qiáng)一致性復(fù)制。
Paxos缺點(diǎn)
- IDC之間需要高速穩(wěn)定網(wǎng)絡(luò)
- 一個(gè)2f+1個(gè)節(jié)點(diǎn)的網(wǎng)絡(luò)中,需要f+1個(gè)節(jié)點(diǎn)完成事務(wù)才能成功。
- Throughput低,不適合高請(qǐng)求量的場(chǎng)合。所以大部分分布式存儲(chǔ)產(chǎn)品并不直接使用Paxos算法來同步數(shù)據(jù)。
2. Dynamo模式
Dynamo論文中并未專門描述Dynamo算法是否適合多IDC場(chǎng)景,只有少量文字提到
In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers. These datacenters are connected through high speed network links. This scheme of replicating across multiple datacenters allows us to handle entire data center failures without a data outage.
從上文看到,前提條件是“high speed network links” 可能對(duì)國(guó)內(nèi)的情況不太適用。假如IDC之間網(wǎng)絡(luò)不穩(wěn)定,那會(huì)發(fā)生哪些情況呢?
Quorum 算法中,如果要考慮高可用性,則數(shù)據(jù)需要分布在多個(gè)機(jī)房。雙機(jī)房如NRW=322由于單機(jī)房故障后可能會(huì)發(fā)生3個(gè)點(diǎn)中2個(gè)點(diǎn)都在故障機(jī)房,導(dǎo)致出現(xiàn)數(shù)據(jù)不 可用的情況,所以合適的部署是NRW=533,需要3個(gè)機(jī)房。大部分請(qǐng)求需要2個(gè)機(jī)房節(jié)點(diǎn)返回才能成功,考慮到多IDC的帶寬及l(fā)atency,性能自然會(huì)很差。
Quorum算法在讀寫的時(shí)候都要從quorum中選取一個(gè)coordinator,算法如下
A node handling a read or write operation is known as the
coordinator. Typically, this is the first among the top N nodes in
the preference list. If the requests are received through a load
balancer, requests to access a key may be routed to any random
node in the ring. In this scenario, the node that receives the
request will not coordinate it if the node is not in the top N of the
requested key’s preference list. Instead, that node will forward the
request to the first among the top N nodes in the preference list.
如果嚴(yán)格按照Dynamo協(xié)議,coodinator一定要在N中第一個(gè)節(jié)點(diǎn),那在3個(gè)機(jī)房中將有2/3的請(qǐng)求需要forward到異地機(jī)房的 coordinator執(zhí)行,導(dǎo)致latency增大。如果對(duì)coodinator選擇做優(yōu)化,讓client選取preference list中前N個(gè)節(jié)點(diǎn)中在本地機(jī)房的一個(gè)節(jié)點(diǎn)作為coordinator,這樣會(huì)一定程度降低latency,但是會(huì)存在相同的key選擇不同節(jié)點(diǎn)作為 coordinator的概率增大,導(dǎo)致數(shù)據(jù)conflict的概率增大。
同時(shí)在多機(jī)房模式下,F(xiàn)ailure detection容易產(chǎn)生混亂。Dynamo并沒有使用一致性的failure view來判斷節(jié)點(diǎn)失效。而是由每個(gè)節(jié)點(diǎn)獨(dú)自判斷。
Failure detection in Dynamo is used to avoid attempts to
communicate with unreachable peers during get() and put()
operations and when transferring partitions and hinted replicas.
For the purpose of avoiding failed attempts at communication, a
purely local notion of failure detection is entirely sufficient: node
A may consider node B failed if node B does not respond to node
A’s messages (even if B is responsive to node C’s messages).
而最近非常流行的Cassandra基本上可以看作是開源的Dynamo clone, 它在Facebook Inbox Search項(xiàng)目中部署在150臺(tái)節(jié)點(diǎn)上,并且分布在美國(guó)東西海岸的數(shù)據(jù)中心。
The system(Facebook Inbox Search) currently stores about 50+TB of data on a 150 node cluster, which is spread out between east and west coast data centers.
雖然在它的JIRA中有一個(gè)提案 CASSANDRA-492 是講”Data Center Quorum”,但是整體看來Cassandra并沒有特別的針對(duì)對(duì)IDC的優(yōu)化,它的paper[5]中提到
Data center failures happen due to power outages, cooling
failures, network failures, and natural disasters. Cassandra
is configured such that each row is replicated across multiple
data centers. In essence, the preference list of a key is con-
structed such that the storage nodes are spread across mul-
tiple datacenters. These datacenters are connected through
high speed network links. This scheme of replicating across
multiple datacenters allows us to handle entire data center
failures without any outage.
跟Dynamo中的描述幾乎是相同的。
3. PNUTS模式
PNUTS模式是目前最看好的多IDC數(shù)據(jù)同步方式。它的算法大部分是為多IDC設(shè)計(jì)。
PNUTS主要為Web應(yīng)用設(shè)計(jì),而不是離線數(shù)據(jù)分析(相比于Hadoop/HBase)。
- Yahoo!的數(shù)據(jù)基本都是用戶相關(guān)數(shù)據(jù),典型的以用戶的username為key的key value數(shù)據(jù)。
- 統(tǒng)計(jì)數(shù)據(jù)訪問的特征發(fā)現(xiàn)85%的用戶修改數(shù)據(jù)經(jīng)常來源自相同的IDC。
根據(jù)以上的數(shù)據(jù)特征,Yahoo!的PNUTS實(shí)現(xiàn)算法是
- 記錄級(jí)別的master, 每一條記錄選擇一個(gè)IDC作為master,所有修改都需要通過master進(jìn)行。即使同一個(gè)表(tablet)中不同的記錄master不同。
- master上的數(shù)據(jù)通過Yahoo! Message Broker(YMB)異步消息將數(shù)據(jù)復(fù)制到其他IDC。
- master選擇具有靈活的策略,可以根據(jù)最新修改的來源動(dòng)態(tài)變更master IDC, 比如一個(gè)IDC收到用戶修改請(qǐng)求,但是master不在本地需要轉(zhuǎn)發(fā)到遠(yuǎn)程master修改,當(dāng)遠(yuǎn)程修改超過3次則將本地的IDC設(shè)成master。
- 每條記錄每次修改都有一個(gè)版本號(hào)(per-record timeline consisitency),master及YMB可以保證復(fù)制時(shí)候的順序。
Yahoo!的PNUTS實(shí)際可理解為master-master模式。
一致性:由于記錄都需通過master修改,master再?gòu)?fù)制到其他IDC, 因此可達(dá)到所有IDC數(shù)據(jù)具有最終一致性。
可用性:
- 由于所有IDC都有每條記錄的本地?cái)?shù)據(jù),應(yīng)用可以根據(jù)策略返回本地cache或最新版本。
- 本地修改只要commit到Y(jié)MB即可認(rèn)為修改成功。
- 任一IDC發(fā)生故障不影響訪問。
論文中提到的其他優(yōu)點(diǎn)
hosted, notifications, flexible schemas, ordered records, secondary indexes, lowish latency, strong consistency on a single record, scalability, high write rates, reliability, and range queries over a small set of records.
總之,PNUTS可以很好的適合geographic replication模式。
- 記錄publish到本地YMB則認(rèn)為成功,免除Dynamo方式需要等待多個(gè)Data Center返回的latency。
- 如果發(fā)生master在異地則需要將請(qǐng)求forward到異地,但是由于存在master轉(zhuǎn)移的策略,需要forward的情況比較少。
極端情況,當(dāng)record的master不可用時(shí)候,實(shí)現(xiàn)上似乎有些可疑之處,讀者可自行思考。
Under normal operation, if the master copy of a record fails, our system has protocols to fail over to another replica. However, if there are major outages, e.g. the entire region that had the master copy for a record becomes unreachable, updates cannot continue at another replica without potentially violating record-timeline consistency. We will allow applications to indicate, per-table, whether they want updates to continue in the presence of major outages, potentially branching the record timeline. If so, we will provide automatic conflict resolution and notifications thereof. The application will also be able to choose from several conflict resolution policies: e.g., discarding one branch, or merging updates from branches, etc.
初步結(jié)論
低帶寬網(wǎng)絡(luò)
PNUTS record-level mastering模式最佳。
高帶寬低延遲網(wǎng)絡(luò)
(1Gbps, Latency < 50ms)
1. 用Dynamo Quorum, vector clock算法實(shí)現(xiàn)最終一致性
2. 用Paxos實(shí)現(xiàn)強(qiáng)一致性
后記
本文從開始準(zhǔn)備到發(fā)布時(shí)間較長(zhǎng),由于在多IDC數(shù)據(jù)訪問方面目前業(yè)界并無統(tǒng)一的成熟方案,相關(guān)資料和文獻(xiàn)也相對(duì)較少,而且對(duì)這方面有興趣且有相應(yīng)環(huán)境的人不多,短時(shí)間要提出自己成熟獨(dú)立的見解也具有一定難度,本文僅包含一些不成熟的想法的整理,由于自己對(duì)文中的觀點(diǎn)深度也不是滿意,所以一直沒有最終完稿發(fā)布。但考慮到最近工作較忙,暫時(shí)沒有精力繼續(xù)深入研究,所以希望公開文章拋磚引玉,同時(shí)也歡迎對(duì)這方面課題有興趣者進(jìn)一步交流探討。
Resource
- Ryan Barrett, Transactions Across Datacenters
- Jeff Dean, Designs, Lessons and Advice from Building Large Distributed Systems (PDF)
- PNUTS: Yahoo!’s Hosted Data Serving Platform (PDF)
- Thoughts on Yahoo’s PNUTS distributed database
- Cassandra – A Decentralized Structured Storage System (PDF)
- Yahoo!的分布式數(shù)據(jù)平臺(tái)PNUTS簡(jiǎn)介及感悟
億恩-天使(QQ:530997) 電話 037160135991 服務(wù)器租用,托管歡迎咨詢。 本文出自:億恩科技【mszdt.com】
服務(wù)器租用/服務(wù)器托管中國(guó)五強(qiáng)!虛擬主機(jī)域名注冊(cè)頂級(jí)提供商!15年品質(zhì)保障!--億恩科技[ENKJ.COM]
|