Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

11/1/2020

Reading time:7 min

Transaction Management on Cassandra

by Scalar, Inc.

Transaction Management on Cassandra SlideShare Explore You Successfully reported this slideshow.Transaction Management on CassandraUpcoming SlideShareLoading in …5× 2 Comments 5 Likes Statistics Notes Leela Krishna Kandrakota , Data Science Solutions Architect Yuya Ma'emichi , Software Developer at Tech company t8kobayashi TetsuyaNihonmatsu Wataru Fukatsu , CEO/COO at Scalar Inc. No DownloadsNo notes for slide 1. Transaction Management on Cassandra10 Sep, 2019 at ApacheCon NA 2019Hiroyuki YamadaCTO/CEO at Scalar, Inc.1 2. © 2019 Scalar, inc.Who am I ?• Hiroyuki Yamada– Passionate about Database Systems and Distributed Systems– Ph.D. in Computer Science, the University of Tokyo– IIS the University of Tokyo, Yahoo! Japan, IBM Japan2 3. © 2019 Scalar, inc.Agenda• What is Scalar DB• Transaction Management on Cassandra• Benchmark and Verification Results3 4. © 2019 Scalar, inc.Maybe You Don’t Need ACID Transactions• ACID transactions are heavy– Especially when data is distributed• One of other solutions:– Make operations idempotent and retry them if atomicity isnot required4 5. © 2019 Scalar, inc.What is Scalar DB• A library that makes non-ACID distributed databases ACID-compliant– Cassandra is the first supported distributed database5https://github.com/scalar-labs/scalardbTransaction management, Recovery management, Java API 6. © 2019 Scalar, inc.System Architecture with Scalar DB and Cassandra6Cassandra nodesAchieves one-copySerializable TransactionsWebApplicationsClientprogramsScalar DBDataStaxJava DriverEnd UsersHTTPCommandexecutionKey key = new Key(new TextValue(“id”, ”1”));Result result =db.get(new Get(key));// do something with resultPut put = new Put(key).(…);db.put(put)Scalar DB 7. © 2019 Scalar, inc.Key Characteristics• Non-invasive approach– Any modifications to the underlying database are not required• High availability– Available as long as quorum of replicas are up– C* high availability is fully sustained by the client-coordinatedapproach• Horizontal scalability– Throughput scales linearly– C* high scalability is fully sustained by the client-coordinatedapproach• Strong Consistency– Replicas updated by transactions are always consistent and up-to-date7 8. © 2019 Scalar, inc.Transaction Management on Cassandra - Introduction• Based on Cherry Garcia protocol [ICDE’15]– Requires minimum set of features such as linearizable conditionalupdate and the ability to store metadata• Scalar DB is one of the applications of the protocol– Use LWT for Linearizability– Manage transaction metadata in user record space• Implement enhancements– Protocol correction– No use of TrueTime API– Serializable support (SI is the default isolation level)8 9. © 2019 Scalar, inc.Transaction Metadata Management• WAL (Write-Ahead Logging) records are distributed9Application data Transaction metadataAfter image Before imageApplication data(Before)Transaction metadata(Before)Status Version TxIDStatus(before)Version(before)TxID(before)TxID Status Other metadataStatus Recordin coordinatortableUser/ApplicationRecordin user tablesApplication data(managed by users)Transaction metadata(managed by Scalar DB) 10. © 2019 Scalar, inc.Transaction Protocol - Overview• Optimistic concurrency control• Similar to 2 phase commit protocol– Prepare phase: prepare records– Commit phase 1: commit status– This is where a transaction is regarded as committed oraborted in normal cases– (Commit phase 2: commit records)• Lazy recovery– Uncommitted records will be rollforwarded or rollbacked based onthe status of a transaction when the records are read10 11. © 2019 Scalar, inc.Transaction Protocol By Examples – Prepare Phase11Client1Client1’s memory spaceCassandraReadAtomic conditionalupdate (LWT)Update only ifthe versionis the version I read Fail due tothe condition mismatchUserID Balance Status Version1 100 C 5TxIDXXX2 100 C 4YYY1 80 P 6Tx12 120 P 5Tx1Tx1: Transfer 20 from 1 to 2Client2UserID Balance Status Version1 100 C 5Client2’s memory spaceTx2: Transfer 10 from 1 to 2TxIDXXX2 100 C 4YYY1 90 P 6Tx22 110 P 5Tx2UserID Balance Status Version1 100 C 5TxIDXXX2 100 C 4YYY 12. © 2019 Scalar, inc.Transaction Protocol By Examples – Commit Phase 112CassandraUserID Balance Status Version1 80 P 6TxIDTx12 120 P 5Tx1StatusCTxIDXXXCYYYAZZZCTx1Atomic conditionalupdate (LWT)Update ifthe TxIDdoes not existClient1 withTx1 13. © 2019 Scalar, inc.Transaction Protocol By Examples – Commit Phase 213CassandraUserID Balance Status Version1 80 C 6TxIDTx12 120 C 5Tx1StatusCTxIDXXXCYYYAZZZCTx1Atomic conditionalupdate (LWT)Update status ifthe record isprepared by the TxIDClient1 withTx1 14. © 2019 Scalar, inc.Failure Handling by Examples• If TX1 fails before prepare phase– Just clear the memory space for TX1• If TX1 fails after prepare phase and before commit phase 1 (nostatus is written in Status table)– Another transaction (TX3) reads the records and notices that the recordsare prepared and there is no status for it– TX3 tries to abort TX1 (TX3 tries to write ABORTED to Status with TX1’s TXIDand rolls back the records)– TX1 might be on it’s way to commit status, but only one can win, not both• If TX1 fails (right) after commit phase 1– Another transaction (TX3) tries to commit the records (rollforward) onbehalf of TX1 when TX3 reads the same records as TX1– TX1 might be on it’s way to commit records, but only one can win, not both14 15. © 2019 Scalar, inc.Benchmark Results15Workload2 (Evidence)Workload1 (Payment)Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3• Achieved 90 % scalability in 100-node cluster(Compared to the Ideal TPS based on the performance of 3-node cluster) 16. © 2019 Scalar, inc.Verification Results• Scalar DB has been heavily tested with Jepsen and ourdestructive tools– Note that Jepsen tests are created and conducted by Scalar• It has passed both tests for a long time• See https://github.com/scalar-labs/scalar-jepsen for more detail16JepsenPassed 17. © 2019 Scalar, inc.Other Contributions for Apache Cassandra from Scalar• GroupCommitlogService – Yuji Ito– Group multiple commitlog writes at once– CASSANDRA-13530• Jepsen tests for Cassandra – Yuji Ito, Craig Pastro– Maintain with the latest Jepsen– Rewrite with Alia clojure driver– https://github.com/scalar-labs/scalar-jepsen• Cassy– A simple and integrated backup tool– Just released under Apache 2– https://github.com/scalar-labs/cassy17 18. © 2019 Scalar, inc.Cassy: A simple and integrated backup tool18• Required to take a transactionally consistent backup 19. © 2019 Scalar, inc.Future Work• DataStax driver 4.x support• Cassandra 4.x support– Hopefully nothing needs to be done• Other C* compatible databases integration– Scylla DB (waiting for LWT)– Cosmos DB (waiting for LWT)• (HBase adapter)19 20. © 2019 Scalar, inc.Questions ?20 21. © 2019 Scalar, inc.Optimization• Prepare in deterministic order– => First prepare always wins21TX1: updating K1 and K2 and K3 TX2: updating K2 and K3 and K4H: Consistent hashingK1 K2 K3 K2 K3 K4Always prepare in this order !!!(otherwise, all Txs might abort.Ex. TX1 prepares K1,K2 and TX2 prepares K4,K3 in this order)H(K2) H(K1) H(K3) H(K4) 22. © 2019 Scalar, inc.Snapshot Isolation• Strong isolation level but weaker than Serializable– Similar to “MVCC”– Oracle’s most strict isolation level (it’s called “Serializable”)• Read only sees a snapshot (=> non blocking reads)• Mostly strong enough but there are still some anomalies22 23. © 2019 Scalar, inc.Anomalies in Snapshot Isolation• Write Skew, Read-Only Transaction• Write skew example:– Account balances: X and Y (assume family account)– Initial state: X=70, Y=80– Constraint: X + Y > 0– TX1: X = X – 100, TX2: Y = Y - 100– H: R1(X0, 70) R2(X0, 70) R1(Y0, 80) R2(Y0, 80)W1(X1, −30)C1 W2(Y2,−20)C22370 80X0 Y070-100 -> -30 80X YTX170 80-100 -> -20X YTX2Update succeeds without conflictin Snapshot IsolationOk for theconstraintOk for theconstraintUpdate XUpdate Y-30 -20 24. © 2019 Scalar, inc.Serializable Support• Convert all reads into writes (writing the same value) in atransaction24 25. © 2019 Scalar, inc.Protocol Correction• Commit records by non-atomically– => NO!!!• Someone else have already did it and have started andprepared a new transaction25TX1TX2Start andPrepare ACommitStatusCommit A (without knowing A with TX1 is alreadycommitted, and A is overwritten by a new TX2)Read A and commit Aon behalf of TX1Start andPrepare ACommitStatusCommit A Recommended Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2...Scalar, Inc. Scalar IST のご紹介Scalar, Inc. Scalar DB: A library that makes non-ACID databases ACID-compliantScalar, Inc. 個人データ連携から見えるSociety5.0~法令対応に向けた技術的な活用事例について~Scalar, Inc. 事業者間・対個人におけるデータの信頼性と透明性の担保によるデジタライゼーションの推進Scalar, Inc. What to Upload to SlideShareSlideShare Customer Code: Creating a Company Customers LoveHubSpot About Blog Terms Privacy Copyright × Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
Transaction Management on Cassandra

Successfully reported this slideshow.

Transaction Management on Cassandra
Transaction Management on Cassandra
10 Sep, 2019 at ApacheCon NA 2019
Hiroyuki Yamada
CTO/CEO at Scalar, Inc.
1
© 2019 Scalar, inc.
Who am I ?
• Hiroyuki Yamada
– Passionate about Database Systems and Distributed Systems
– Ph.D. in Co...
© 2019 Scalar, inc.
Agenda
• What is Scalar DB
• Transaction Management on Cassandra
• Benchmark and Verification Results
3
© 2019 Scalar, inc.
Maybe You Don’t Need ACID Transactions
• ACID transactions are heavy
– Especially when data is distrib...
© 2019 Scalar, inc.
What is Scalar DB
• A library that makes non-ACID distributed databases ACID-compliant
– Cassandra is ...
© 2019 Scalar, inc.
System Architecture with Scalar DB and Cassandra
6
Cassandra nodes
Achieves one-copy
Serializable Tran...
© 2019 Scalar, inc.
Key Characteristics
• Non-invasive approach
– Any modifications to the underlying database are not req...
© 2019 Scalar, inc.
Transaction Management on Cassandra - Introduction
• Based on Cherry Garcia protocol [ICDE’15]
– Requi...
© 2019 Scalar, inc.
Transaction Metadata Management
• WAL (Write-Ahead Logging) records are distributed
9
Application data...
© 2019 Scalar, inc.
Transaction Protocol - Overview
• Optimistic concurrency control
• Similar to 2 phase commit protocol
...
© 2019 Scalar, inc.
Transaction Protocol By Examples – Prepare Phase
11
Client1
Client1’s memory space
Cassandra
Read
Atom...
© 2019 Scalar, inc.
Transaction Protocol By Examples – Commit Phase 1
12
Cassandra
UserID Balance Status Version
1 80 P 6
...
© 2019 Scalar, inc.
Transaction Protocol By Examples – Commit Phase 2
13
Cassandra
UserID Balance Status Version
1 80 C 6
...
© 2019 Scalar, inc.
Failure Handling by Examples
• If TX1 fails before prepare phase
– Just clear the memory space for TX1...
© 2019 Scalar, inc.
Benchmark Results
15
Workload2 (Evidence)Workload1 (Payment)
Each node: i3.4xlarge (16 vCPUs, 122 GB R...
© 2019 Scalar, inc.
Verification Results
• Scalar DB has been heavily tested with Jepsen and our
destructive tools
– Note ...
© 2019 Scalar, inc.
Other Contributions for Apache Cassandra from Scalar
• GroupCommitlogService – Yuji Ito
– Group multip...
© 2019 Scalar, inc.
Cassy: A simple and integrated backup tool
18
• Required to take a transactionally consistent backup
© 2019 Scalar, inc.
Future Work
• DataStax driver 4.x support
• Cassandra 4.x support
– Hopefully nothing needs to be done...
© 2019 Scalar, inc.
Questions ?
20
© 2019 Scalar, inc.
Optimization
• Prepare in deterministic order
– => First prepare always wins
21
TX1: updating K1 and K...
© 2019 Scalar, inc.
Snapshot Isolation
• Strong isolation level but weaker than Serializable
– Similar to “MVCC”
– Oracle’...
© 2019 Scalar, inc.
Anomalies in Snapshot Isolation
• Write Skew, Read-Only Transaction
• Write skew example:
– Account ba...
© 2019 Scalar, inc.
Serializable Support
• Convert all reads into writes (writing the same value) in a
transaction
24
© 2019 Scalar, inc.
Protocol Correction
• Commit records by non-atomically
– => NO!!!
• Someone else have already did it a...

Upcoming SlideShare

Loading in …5

×

  1. 1. Transaction Management on Cassandra 10 Sep, 2019 at ApacheCon NA 2019 Hiroyuki Yamada CTO/CEO at Scalar, Inc. 1
  2. 2. © 2019 Scalar, inc. Who am I ? • Hiroyuki Yamada – Passionate about Database Systems and Distributed Systems – Ph.D. in Computer Science, the University of Tokyo – IIS the University of Tokyo, Yahoo! Japan, IBM Japan 2
  3. 3. © 2019 Scalar, inc. Agenda • What is Scalar DB • Transaction Management on Cassandra • Benchmark and Verification Results 3
  4. 4. © 2019 Scalar, inc. Maybe You Don’t Need ACID Transactions • ACID transactions are heavy – Especially when data is distributed • One of other solutions: – Make operations idempotent and retry them if atomicity is not required 4
  5. 5. © 2019 Scalar, inc. What is Scalar DB • A library that makes non-ACID distributed databases ACID-compliant – Cassandra is the first supported distributed database 5 https://github.com/scalar-labs/scalardb Transaction management, Recovery management, Java API
  6. 6. © 2019 Scalar, inc. System Architecture with Scalar DB and Cassandra 6 Cassandra nodes Achieves one-copy Serializable Transactions Web Applications Client programs Scalar DB DataStax Java Driver End Users HTTP Command execution Key key = new Key( new TextValue(“id”, ”1”)); Result result = db.get(new Get(key)); // do something with result Put put = new Put(key).(…); db.put(put) Scalar DB
  7. 7. © 2019 Scalar, inc. Key Characteristics • Non-invasive approach – Any modifications to the underlying database are not required • High availability – Available as long as quorum of replicas are up – C* high availability is fully sustained by the client-coordinated approach • Horizontal scalability – Throughput scales linearly – C* high scalability is fully sustained by the client-coordinated approach • Strong Consistency – Replicas updated by transactions are always consistent and up-to-date 7
  8. 8. © 2019 Scalar, inc. Transaction Management on Cassandra - Introduction • Based on Cherry Garcia protocol [ICDE’15] – Requires minimum set of features such as linearizable conditional update and the ability to store metadata • Scalar DB is one of the applications of the protocol – Use LWT for Linearizability – Manage transaction metadata in user record space • Implement enhancements – Protocol correction – No use of TrueTime API – Serializable support (SI is the default isolation level) 8
  9. 9. © 2019 Scalar, inc. Transaction Metadata Management • WAL (Write-Ahead Logging) records are distributed 9 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)
  10. 10. © 2019 Scalar, inc. Transaction Protocol - Overview • Optimistic concurrency control • Similar to 2 phase commit protocol – Prepare phase: prepare records – Commit phase 1: commit status – This is where a transaction is regarded as committed or aborted in normal cases – (Commit phase 2: commit records) • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 10
  11. 11. © 2019 Scalar, inc. Transaction Protocol By Examples – Prepare Phase 11 Client1 Client1’s memory space Cassandra Read Atomic conditional update (LWT) Update only if the version is the version I read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4YYY 1 80 P 6Tx1 2 120 P 5Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4YYY 1 90 P 6Tx2 2 110 P 5Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4YYY
  12. 12. © 2019 Scalar, inc. Transaction Protocol By Examples – Commit Phase 1 12 Cassandra UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5Tx1 Status C TxID XXX CYYY AZZZ CTx1 Atomic conditional update (LWT) Update if the TxID does not exist Client1 with Tx1
  13. 13. © 2019 Scalar, inc. Transaction Protocol By Examples – Commit Phase 2 13 Cassandra UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5Tx1 Status C TxID XXX CYYY AZZZ CTx1 Atomic conditional update (LWT) Update status if the record is prepared by the TxID Client1 with Tx1
  14. 14. © 2019 Scalar, inc. Failure Handling by Examples • If TX1 fails before prepare phase – Just clear the memory space for TX1 • If TX1 fails after prepare phase and before commit phase 1 (no status is written in Status table) – Another transaction (TX3) reads the records and notices that the records are prepared and there is no status for it – TX3 tries to abort TX1 (TX3 tries to write ABORTED to Status with TX1’s TXID and rolls back the records) – TX1 might be on it’s way to commit status, but only one can win, not both • If TX1 fails (right) after commit phase 1 – Another transaction (TX3) tries to commit the records (rollforward) on behalf of TX1 when TX3 reads the same records as TX1 – TX1 might be on it’s way to commit records, but only one can win, not both 14
  15. 15. © 2019 Scalar, inc. Benchmark Results 15 Workload2 (Evidence)Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)
  16. 16. © 2019 Scalar, inc. Verification Results • Scalar DB has been heavily tested with Jepsen and our destructive tools – Note that Jepsen tests are created and conducted by Scalar • It has passed both tests for a long time • See https://github.com/scalar-labs/scalar-jepsen for more detail 16 Jepsen Passed
  17. 17. © 2019 Scalar, inc. Other Contributions for Apache Cassandra from Scalar • GroupCommitlogService – Yuji Ito – Group multiple commitlog writes at once – CASSANDRA-13530 • Jepsen tests for Cassandra – Yuji Ito, Craig Pastro – Maintain with the latest Jepsen – Rewrite with Alia clojure driver – https://github.com/scalar-labs/scalar-jepsen • Cassy – A simple and integrated backup tool – Just released under Apache 2 – https://github.com/scalar-labs/cassy 17
  18. 18. © 2019 Scalar, inc. Cassy: A simple and integrated backup tool 18 • Required to take a transactionally consistent backup
  19. 19. © 2019 Scalar, inc. Future Work • DataStax driver 4.x support • Cassandra 4.x support – Hopefully nothing needs to be done • Other C* compatible databases integration – Scylla DB (waiting for LWT) – Cosmos DB (waiting for LWT) • (HBase adapter) 19
  20. 20. © 2019 Scalar, inc. Questions ? 20
  21. 21. © 2019 Scalar, inc. Optimization • Prepare in deterministic order – => First prepare always wins 21 TX1: updating K1 and K2 and K3 TX2: updating K2 and K3 and K4 H: Consistent hashing K1 K2 K3 K2 K3 K4 Always prepare in this order !!! (otherwise, all Txs might abort. Ex. TX1 prepares K1,K2 and TX2 prepares K4,K3 in this order) H(K2) H(K1) H(K3) H(K4)
  22. 22. © 2019 Scalar, inc. Snapshot Isolation • Strong isolation level but weaker than Serializable – Similar to “MVCC” – Oracle’s most strict isolation level (it’s called “Serializable”) • Read only sees a snapshot (=> non blocking reads) • Mostly strong enough but there are still some anomalies 22
  23. 23. © 2019 Scalar, inc. Anomalies in Snapshot Isolation • Write Skew, Read-Only Transaction • Write skew example: – Account balances: X and Y (assume family account) – Initial state: X=70, Y=80 – Constraint: X + Y > 0 – TX1: X = X – 100, TX2: Y = Y - 100 – H: R1(X0, 70) R2(X0, 70) R1(Y0, 80) R2(Y0, 80)W1(X1, −30)C1 W2(Y2, −20)C2 23 70 80 X0 Y0 70-100 -> -30 80 X Y TX1 70 80-100 -> -20 X YTX2 Update succeeds without conflict in Snapshot Isolation Ok for the constraint Ok for the constraint Update X Update Y -30 -20
  24. 24. © 2019 Scalar, inc. Serializable Support • Convert all reads into writes (writing the same value) in a transaction 24
  25. 25. © 2019 Scalar, inc. Protocol Correction • Commit records by non-atomically – => NO!!! • Someone else have already did it and have started and prepared a new transaction 25 TX1 TX2 Start and Prepare A Commit Status Commit A (without knowing A with TX1 is already committed, and A is overwritten by a new TX2) Read A and commit A on behalf of TX1 Start and Prepare A Commit Status Commit A

×

Related Articles

data.processing
cassandra
nvidia

GitHub - fversaci/cassandra-dali-plugin: Cassandra plugin for NVIDIA DALI

fversaci

8/17/2023

cassandra
prisma

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra