Cassandra - Technological watch

Learn what is Cassandra in less than 5 minutes !
Sunday, November 21, 2021

What is Cassandra ?

Apache Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Cassandra offers support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Cassandra was designed to implement a combination of Amazon’s Dynamo distributed storage and replication techniques combined with Google’s Bigtable data and storage engine model.

Wikipedia

Cassandra is based on a structuring in key-value pairs of type consistent over time. Cassandra cannot do joins or subqueries. Cassandra favors the denormalization of data.

Cassandra is an open source tool with 6.9K GitHub stars and 3K GitHub forks. Here’s a link to Cassandra’s open source repository on GitHub.

Cassandra is used by a large number of big companies like Uber, Facebook, Netflix, Instagram, Spotify or Reddit.

How it looks

Here is the documentation of CQL commands

Keyspace

First, we will make a keyspace called “movies”. A keyspace is a top level namespace.

CREATE keyspace movies with replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

Create table

We want to create a database with this key structure:

{
"_id": 1,
"title": "toy story",
"release_year": 1995,
"duration": 81,
"genre": "animation",
"actors": ["tom hanks", "tim allen", "don rickles", "jim varney"],
"director": "john lasseter",
"imdbrating": 8.3,
"imdbvotes": 611558,
"author": {
"scenario": " joss whedon ",
"original story": "john lasseter "
}
}

Here it is how it will look

The Cassandra Query Language (CQL) is a close relative of SQL.

CREATE TABLE movies.movies (
"_id" bigint,
title text,
release_year int,
duration int,
genre text,
actors list<text>,
director text,
imdbrating float,
imdbvotes int,
author map<text, text>,
PRIMARY KEY ("_id", release_year, genre)
) WITH CLUSTERING ORDER BY (release_year ASC, genre ASC);

Insert into database

To insert, the query looks like a SQL request:

INSERT INTO movies.movies("_id", "title", "release_year", "duration", "genre", "actors", "director", "imdbrating", "imdbvotes", "author") VALUES (1, 'toy story', 1995, 81, 'animation', ['tom hanks', 'tim allen', 'don rickles', 'jim varney'], 'john lasseter', 8.3, 611558, { 'scenario': 'joss whedon', 'original story': 'john lasseter'});

Select data

To make a select on the table, we need to create an index.

Example if we want to make a select over the title:

CREATE index index_title ON movies.movies(title); /* Only the first time ! */
SELECT * FROM movies.movies WHERE title = 'toy story';

And here is an another example if we want to make a select over the actor:

CREATE index index_actors ON movies.movies(actors);
SELECT * FROM movies.movies WHERE actors contains 'tom hanks';

Conclusion

Cassandra is a distributed database, made to have high performance and high availability. It is easy to scale, with replication and can be deployed across multiple data-centers. It is perfect to manage huge quantities of data (like telemetry data or big-data).


Recommended articles