
  • 始創(chuàng)于2000年 股票代碼:831685
    咨詢熱線:0371-60135900 注冊有禮 登錄
    • 掛牌上市企業(yè)
    • 60秒人工響應(yīng)
    • 99.99%連通率
    • 7*24h人工
    • 故障100倍補(bǔ)償
    您的位置: 網(wǎng)站首頁 > 幫助中心>文章內(nèi)容


    發(fā)布時間:  2012/9/16 0:54:36


    背景資料 Latency差異

    Jeff Dean提到不同數(shù)據(jù)訪問方式latency差異

    Numbers Everyone Should Know
    L1 cache reference                           0.5 ns
    Branch mispredict                            5 ns
    L2 cache reference                           7 ns
    Mutex lock/unlock                           25 ns
    Main memory reference                      100 ns
    Compress 1K bytes with Zippy             3,000 ns
    Send 2K bytes over 1 Gbps network       20,000 ns
    Read 1 MB sequentially from memory     250,000 ns
    Round trip within same datacenter      500,000 ns
    Disk seek                           10,000,000 ns
    Read 1 MB sequentially from disk    20,000,000 ns
    Send packet CA->Netherlands->CA    150,000,000 ns


    1. 2PC/3PC/Paxos模式

    Paxos選擇了CAP理論中的”Consistency, Partition”, 需要犧牲availability。它可以在多個IDC之間實(shí)現(xiàn)強(qiáng)一致性復(fù)制。


    • IDC之間需要高速穩(wěn)定網(wǎng)絡(luò)
    • 一個2f+1個節(jié)點(diǎn)的網(wǎng)絡(luò)中,需要f+1個節(jié)點(diǎn)完成事務(wù)才能成功。
    • Throughput低,不適合高請求量的場合。所以大部分分布式存儲產(chǎn)品并不直接使用Paxos算法來同步數(shù)據(jù)。

    2. Dynamo模式


    In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers. These datacenters are connected through high speed network links. This scheme of replicating across multiple datacenters allows us to handle entire data center failures without a data outage.

    從上文看到,前提條件是“high speed network links” 可能對國內(nèi)的情況不太適用。假如IDC之間網(wǎng)絡(luò)不穩(wěn)定,那會發(fā)生哪些情況呢?

    Quorum 算法中,如果要考慮高可用性,則數(shù)據(jù)需要分布在多個機(jī)房。雙機(jī)房如NRW=322由于單機(jī)房故障后可能會發(fā)生3個點(diǎn)中2個點(diǎn)都在故障機(jī)房,導(dǎo)致出現(xiàn)數(shù)據(jù)不 可用的情況,所以合適的部署是NRW=533,需要3個機(jī)房。大部分請求需要2個機(jī)房節(jié)點(diǎn)返回才能成功,考慮到多IDC的帶寬及l(fā)atency,性能自然會很差。


    A node handling a read or write operation is known as the
    coordinator. Typically, this is the first among the top N nodes in
    the preference list. If the requests are received through a load
    balancer, requests to access a key may be routed to any random
    node in the ring. In this scenario, the node that receives the
    request will not coordinate it if the node is not in the top N of the
    requested key’s preference list. Instead, that node will forward the
    request to the first among the top N nodes in the preference list.

    如果嚴(yán)格按照Dynamo協(xié)議,coodinator一定要在N中第一個節(jié)點(diǎn),那在3個機(jī)房中將有2/3的請求需要forward到異地機(jī)房的 coordinator執(zhí)行,導(dǎo)致latency增大。如果對coodinator選擇做優(yōu)化,讓client選取preference list中前N個節(jié)點(diǎn)中在本地機(jī)房的一個節(jié)點(diǎn)作為coordinator,這樣會一定程度降低latency,但是會存在相同的key選擇不同節(jié)點(diǎn)作為 coordinator的概率增大,導(dǎo)致數(shù)據(jù)conflict的概率增大。

    同時在多機(jī)房模式下,F(xiàn)ailure detection容易產(chǎn)生混亂。Dynamo并沒有使用一致性的failure view來判斷節(jié)點(diǎn)失效。而是由每個節(jié)點(diǎn)獨(dú)自判斷。

    Failure detection in Dynamo is used to avoid attempts to
    communicate with unreachable peers during get() and put()
    operations and when transferring partitions and hinted replicas.
    For the purpose of avoiding failed attempts at communication, a
    purely local notion of failure detection is entirely sufficient: node
    A may consider node B failed if node B does not respond to node
    A’s messages (even if B is responsive to node C’s messages).

    而最近非常流行的Cassandra基本上可以看作是開源的Dynamo clone, 它在Facebook Inbox Search項(xiàng)目中部署在150臺節(jié)點(diǎn)上,并且分布在美國東西海岸的數(shù)據(jù)中心。

    The system(Facebook Inbox Search) currently stores about 50+TB of data on a 150 node cluster, which is spread out between east and west coast data centers.

    雖然在它的JIRA中有一個提案 CASSANDRA-492 是講”Data Center Quorum”,但是整體看來Cassandra并沒有特別的針對對IDC的優(yōu)化,它的paper[5]中提到

    Data center failures happen due to power outages, cooling
    failures, network failures, and natural disasters. Cassandra
    is configured such that each row is replicated across multiple
    data centers. In essence, the preference list of a key is con-
    structed such that the storage nodes are spread across mul-
    tiple datacenters. These datacenters are connected through
    high speed network links. This scheme of replicating across
    multiple datacenters allows us to handle entire data center
    failures without any outage.


    3. PNUTS模式



    • Yahoo!的數(shù)據(jù)基本都是用戶相關(guān)數(shù)據(jù),典型的以用戶的username為key的key value數(shù)據(jù)。
    • 統(tǒng)計數(shù)據(jù)訪問的特征發(fā)現(xiàn)85%的用戶修改數(shù)據(jù)經(jīng)常來源自相同的IDC。


    • 記錄級別的master, 每一條記錄選擇一個IDC作為master,所有修改都需要通過master進(jìn)行。即使同一個表(tablet)中不同的記錄master不同。
    • master上的數(shù)據(jù)通過Yahoo! Message Broker(YMB)異步消息將數(shù)據(jù)復(fù)制到其他IDC。
    • master選擇具有靈活的策略,可以根據(jù)最新修改的來源動態(tài)變更master IDC, 比如一個IDC收到用戶修改請求,但是master不在本地需要轉(zhuǎn)發(fā)到遠(yuǎn)程master修改,當(dāng)遠(yuǎn)程修改超過3次則將本地的IDC設(shè)成master。
    • 每條記錄每次修改都有一個版本號(per-record timeline consisitency),master及YMB可以保證復(fù)制時候的順序。

    一致性:由于記錄都需通過master修改,master再復(fù)制到其他IDC, 因此可達(dá)到所有IDC數(shù)據(jù)具有最終一致性。

    • 由于所有IDC都有每條記錄的本地數(shù)據(jù),應(yīng)用可以根據(jù)策略返回本地cache或最新版本。
    • 本地修改只要commit到Y(jié)MB即可認(rèn)為修改成功。
    • 任一IDC發(fā)生故障不影響訪問。


    hosted, notifications, flexible schemas, ordered records, secondary indexes, lowish latency, strong consistency on a single record, scalability, high write rates, reliability, and range queries over a small set of records.

    總之,PNUTS可以很好的適合geographic replication模式。

    • 記錄publish到本地YMB則認(rèn)為成功,免除Dynamo方式需要等待多個Data Center返回的latency。
    • 如果發(fā)生master在異地則需要將請求forward到異地,但是由于存在master轉(zhuǎn)移的策略,需要forward的情況比較少。


    Under normal operation, if the master copy of a record fails, our system has protocols to fail over to another replica. However, if there are major outages, e.g. the entire region that had the master copy for a record becomes unreachable, updates cannot continue at another replica without potentially violating record-timeline consistency. We will allow applications to indicate, per-table, whether they want updates to continue in the presence of major outages, potentially branching the record timeline. If so, we will provide automatic conflict resolution and notifications thereof. The application will also be able to choose from several conflict resolution policies: e.g., discarding one branch, or merging updates from branches, etc.


    PNUTS record-level mastering模式最佳。
    (1Gbps, Latency < 50ms)
    1. 用Dynamo Quorum, vector clock算法實(shí)現(xiàn)最終一致性
    2. 用Paxos實(shí)現(xiàn)強(qiáng)一致性




    1. Ryan Barrett, Transactions Across Datacenters
    2. Jeff Dean, Designs, Lessons and Advice from Building Large Distributed Systems (PDF)
    3. PNUTS: Yahoo!’s Hosted Data Serving Platform (PDF)
    4. Thoughts on Yahoo’s PNUTS distributed database
    5. Cassandra – A Decentralized Structured Storage System (PDF)
    6. Yahoo!的分布式數(shù)據(jù)平臺PNUTS簡介及感悟
    億恩-天使(QQ:530997) 電話 037160135991 服務(wù)器租用,托管歡迎咨詢。



  • 您可能在找
  • 億恩北京公司:
  • 經(jīng)營性ICP/ISP證:京B2-20150015
  • 億恩鄭州公司:
  • 經(jīng)營性ICP/ISP/IDC證:豫B1.B2-20060070
  • 億恩南昌公司:
  • 經(jīng)營性ICP/ISP證:贛B2-20080012
  • 服務(wù)器/云主機(jī) 24小時售后服務(wù)電話:0371-60135900
  • 虛擬主機(jī)/智能建站 24小時售后服務(wù)電話:0371-60135900
  • 專注服務(wù)器托管17年
    Copyright© 1999-2019 ENKJ All Rights Reserved 億恩科技 版權(quán)所有  地址:鄭州市高新區(qū)翠竹街1號總部企業(yè)基地億恩大廈  法律顧問:河南亞太人律師事務(wù)所郝建鋒、杜慧月律師   京公網(wǎng)安備41019702002023號
