Client layer with Thrift and Avro

Certify and Increase Opportunity.
Be
Govt. Certified Apache Cassandra Professional

Client layer with Thrift and Avro

apache thrift

Thrift

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

Thrift includes a complete stack for creating clients and servers. The top part is generated code from the Thrift definition. The services generate from this file client and processor code. In contrast to built-in types, created data structures are sent as result in generated code. The protocol and transport layer are part of the runtime library. With Thrift, it is possible to define a service and change the protocol and transport without recompiling the code. Besides the client part, Thrift includes server infrastructure to tie protocols and transports together, like blocking, non-blocking, and multi-threaded servers. The underlying I/O part of the stack is differently implemented for different languages.

Advantages of Thrift against CQL

Querying – In CQL you can query cassandra and get data in a couple of lines (using JDBC driver):

String query = “SELECT * FROM message;”;
PreparedStatement statement = con.prepareStatement(query);

While in thrift based API’s it’s a bit more complicated (example with Astyanax):

OperationResult<ColumnList<String>> result =
keyspace.prepareQuery(mail/*specify columnfamily structure*/)
.getKey(“lyuben@1363115059”).execute();
ColumnList<String> columns = result.getResult();

Performance – Based on the benchmarks carried out by Acunu, Thrift (RPC) is slightly ahead of CQL when it comes to query performance, but you need to be in a situation where high throughput is key for this performance advantage to have a significant benefit.

Installing Thrift
Thrift is a software framework for scalable cross-language services development. Cassandra supports Thrift, thereby allowing integration across multiple programming languages and platforms.

As part of this guide we’ll use Thrift over PHP. First, we need to install Boost.
cd /usr/ports/devel/boost
make all
make install

We need automake and autoconf:
cd /usr/ports/devel/automake110
make all
make install
/usr/ports/devel/autoconf262
make all
make install

Now, assuming you meet Thrift requirements, we can proceed.
cd /usr/tmp/
fetch “http://apache.raffsoftware.com/incubator/thrift/0.2.0-incubating/thrift-0.2.0-incubating.tar.gz”
tar xvfz thrift-0.2.0-incubating.tar.gz
cd thrift-0.2.0
./bootstrap.sh
./configure –with-boost=/usr/local
make
make install

If you’re on FreeBSD, you can simply install Thrift from the FreeBSD ports:
cd /usr/ports/devel/thrift
./bootstrap.sh
./configure –with-boost=/usr/local
make all
make install

Interfacing with Cassandra
Creating a record in Cassandra using Thrift:

/* Insert some data into the Standard1 column family from the default config */

// Keyspace specified in storage=conf.xml
$keyspace = ‘Keyspace1’;

// reference to specific User id
$keyUserId = “1”;

// Constructing the column path that we are adding information into.
$columnPath = new cassandra_ColumnPath();
$columnPath->column_family = ‘Standard1′;
$columnPath->super_column = null;
$columnPath->column = ’email’;

// Timestamp for update
$timestamp = time();

// Add the value to be written to the table, User Key, and path.
$value = “[email protected]”;
$client->insert($keyspace, $keyUserId, $columnPath, $value, $timestamp, $consistency_level);


apache avro

Avro

Apache Avro™ is a data serialization system.
Avro is a remote procedure call and serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. In other words, Avro is a data serialization system. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

It is similar to Thrift, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Avro provides:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other’s full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems
Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
Share this post
[social_warfare]
Connecting
Different clients for programming languages

Get industry recognized certification – Contact us

keyboard_arrow_up