Quick introduction to hypertable's Thrift API using Python
The purpose of this post is to provide a quick introduction with Hypertable’s Thrift API.. We’ll start from the basics, then gradually cover more advanced performance related topics. My environment is Ubuntu 10.04 and the hypertable binary package for 0.9.3.4. Just like all great open source projects, hypertable is undergoing fast changes. So I’ll try my best to provide links not only to latest online documentations but also to the source code just in case some documentation lags behind. Before reading this post, please make sure you’ve read the getting started guide and got familiar with the HQL language.
Minimal bookmark manager: set, get via HQL and hypertable shell
Let’s use hypertable to build a minimal personal bookmark manager, which is essentially a map between descriptions like “work” to links like “tribalytic.com”.
$ ht shell hypertable> create table bookmarks (url); Elapsed time: 0.03 s hypertable> INSERT INTO bookmarks VALUES -> ("homepage", "url", "http://alexdong.com"); Elapsed time: 0.00 s Avg value size: 19.00 bytes Total cells: 1 Throughput: 773.40 cells/s Resends: 0 hypertable> INSERT INTO bookmarks VALUES ("blog", "url", "http://notes.alexdong.com"); Elapsed time: 0.00 s Avg value size: 25.00 bytes Total cells: 1 Throughput: 913.24 cells/s Resends: 0 hypertable> SELECT * from bookmarks; blog url http://notes.alexdong.com homepage url http://alexdong.com Elapsed time: 0.00 s Avg value size: 22.00 bytes Avg key size: 9.50 bytes Throughput: 198738.17 bytes/s Total cells: 2 Throughput: 6309.15 cells/s hypertable> exit
Install
Before diving into the code, make sure we’ve setup the python dev environment properly:
add
/opt/hypertable/current/lib/pyinto yourPYTHONPATH. Thehyperthriftis in the/opt/hypertable/current/lib/gen-py/. So in order to use it asfrom hyperthrift.gen.ttypes import *, you’ll need to move it to thePYTHONPATHwe just specified like this:cd /opt/hypertable/current/lib/py/; sudo mv gen-py/hyperthrift ./.install the
thriftfrom cheese shop:sudo easy_install thrift
Basic: HQL exec and query
Now, let’s add a couple of records using the thrift api.
$ cat basic.py #!/usr/bin/env python from hypertable.thriftclient import * from hyperthrift.gen.ttypes import * client = ThriftClient("localhost", 38080) client.hql_exec("""INSERT INTO bookmarks VALUES ("work", "url", "http://tribalytic.com") """) print client.hql_query('SELECT * FROM bookmarks REVS 1')
Now, if you execute the basic.py, you should see something like this:
$ python basic.py HqlResult(mutator=None, cells=[ Cell(value='http://notes.alexdong.com', \ key=Key(column_family='url', column_qualifier='', \ timestamp=1280100808495279001L, flag=255, row='blog', \ revision=1280100808495279001L)), Cell(value='http://alexdong.com', \ key=Key(column_family='url', column_qualifier='', \ timestamp=1280100783276423001L, flag=255, row='homepage', \ revision=1280100783276423001L)), Cell(value='http://tribalytic.com', \ key=Key(column_family='url', column_qualifier='', \ timestamp=1280101746911510002L, flag=255, row='work', \ revision=1280101746911510002L))], results=None, scanner=None)
Please notice the REVS 1 in the hql_query. It specifies how many revisions we want to get back from hypertable. Without specifying it, if you execute the python basic.py a few times, you’ll notice the same data showing up with different revision numbers. The list of all options in the HQL language can be found here, the source code can be found here: src/cc/Hypertable/Lib/HqlParser.h: struct Parser: grammer.
The HqlResult.cells contains the results we’re interested in. We’ll cover scanner, mutator later. For now, the key take-away is the results we’re getting from hypertable is ‘deserialized’ into a list of Cell objects, which is essentially a Key and its value.
Introducing Mutator
For most performance sensitive applications, we want to buffer insertion together and “commit” them in a batch, similar to the unix’s fsync or SQL’s COMMIT. Mutator. Now, hypertable introduced two buffered input/output channels: mutator and scanner. Let’s take a look at mutator fist. To make it easy to understand, you can think ThriftClient as mysql’s Connection and mutator as cursor. In order to make changes to the hypertable, we need to call ThriftClient.flush_mutator to “commit” the changes to the server.
$ cat mutator.py #!/usr/bin/env python from hypertable.thriftclient import * from hyperthrift.gen.ttypes import * client = ThriftClient("localhost", 38080) mutator = client.open_mutator('bookmarks', 0, 0) client.set_cell(mutator, \ Cell(Key('twitter', 'url', None), 'http://twitter.com/alexdong')) client.set_cell(mutator, \ Cell(Key('github', 'url', None), 'http://github.com/alexdong')) client.flush_mutator(mutator) print client.hql_query('SELECT * FROM bookmarks WHERE row = "twitter" REVS 1') $ python mutator.py HqlResult(mutator=None, \ cells=[ Cell( value='http://twitter.com/alexdong', key=Key(column_family='url', column_qualifier='', timestamp=1280107204496367002L, flag=255, row='twitter', revision=1280107204496367002L) )], results=None, scanner=None)
The Key object specifies “row key”, “column family” and “column qualifier”. Since we are not using “Column family” here, we leave the third argument as blank, or None. The Cell contains a Key and the value.
Asynchronous updates: put_cell family
The set_cell is a family of synchronous operations. Hypertable offers a set of asynchronous functions called put_cell. As explained in the api documentation, “Open a shared periodic mutator which causes cells to be written asyncronously. Users beware: calling this method merely writes cells to a local buffer and does not guarantee that the cells have been persisted. If you want guaranteed durability, use the open_mutator+set_cells* interface instead.” Following code puts two more bookmarks into the hypertable by using put_cells_as_arrays.
$ cat put_cells.py #!/usr/bin/env python import time from hypertable.thriftclient import * from hyperthrift.gen.ttypes import * client = ThriftClient("localhost", 38080) cells = [ ['delicious', 'url', "", 'http://delicious.com/dongxun'], ['bit.ly', 'url', "", 'http://bit.ly/u/alexdong'] ] client.put_cells_as_arrays("bookmarks", \ MutateSpec("bookmark_app", 1000, 2),cells) # sleep for 2 seconds to wait the async operation to finish time.sleep(2) for cell in client.hql_query('SELECT * FROM bookmarks REVS 1').cells: print cell.key.row, "->", cell.value $ python put_cells.py bit.ly -> http://bit.ly/u/alexdong blog -> http://notes.alexdong.com delicious -> http://delicious.com/dongxun github -> http://github.com/alexdong homepage -> http://alexdong.com twitter -> http://twitter.com/alexdong work -> http://tribalytic.com
In the set_cell example, we are using two objects to represent a cell. Given the object construction/destruction overhead, there have been some tests showing that by replacing the Cell, Key objects with array, there will be about 3 times increase in read performance. Use the CellAsArray type, each cell can be represented as ["row_key", "column_family", "column_qualifier", "value", "timestamp"] array. Here the “as_array” feels similar to MySQL’s DictCursor vs. standard Cursor.
Also please notice that instead of using None in “column_qualifier” field, we use an emtpy string "". Otherwise, you’ll receive an error message like this:
% python put_cells.py Traceback (most recent call last): File "put_cells.py", line 12, in cells) File "/opt/hypertable/current/lib/py/hyperthrift/gen/ClientService.py", line 1189, in put_cells_as_arrays self.send_put_cells_as_arrays(tablename, mutate_spec, cells) File "/opt/hypertable/current/lib/py/hyperthrift/gen/ClientService.py", line 1198, in send_put_cells_as_arrays args.write(self._oprot) File "/opt/hypertable/current/lib/py/hyperthrift/gen/ClientService.py", line 4966, in write oprot.writeString(iter150) File "/usr/local/lib/python2.6/dist-packages/Thrift-0.2.0-py2.6-linux-i686.egg/thrift/protocol/TBinaryProtocol.py", line 122, in writeString self.writeI32(len(str)) TypeError: object of type 'NoneType' has no len()
Also, if you’re using put_cells_as_arrays in your test code, make sure you call refresh_mutator to retrieve a new one for each test.
High performance reading: Scanner and next_row_as_arrays iterator
Now that we’ve covered the ground of buffered asynchronous writing, let’s take a look at high performance reading using Scanner. To those who are familiar with MySQL, the concept of using scanner is quite similar to the SSCursor. Instead of reading all the records into client side memory, there is a server-side cursor that’s “streaming” the result set to client side.
$ cat scanner.py #!/usr/bin/env python import time from hypertable.thriftclient import * from hyperthrift.gen.ttypes import * client = ThriftClient("localhost", 38080) r = client.hql_exec2("""SELECT url FROM bookmarks WHERE "delicious" <= ROW <= "homepage" REVS 1""", 0, 1) scanner = r.scanner while True: cells = client.next_row_as_arrays(scanner) if not len(cells): break print cells client.close_scanner(scanner) $ python scanner.py ['delicious', 'url', '', 'http://delicious.com/dongxun', '1280110063691009002'] ['github', 'url', '', 'http://github.com/alexdong', '1280107204496367001'] ['homepage', 'url', '', 'http://alexdong.com', '1280100783276423001']
hql_exec takes three parameters, the third of which is “unbuffered”. When this value is True, the method will return a Scanner object we can use to iterate through the results. In the example above, we call next_row_as_arrays to read one row each time.
After fiddling around with the code src/cc/Hypertable/Lib/HqlInterpreter.cc for a while, it seems like next_row_as_arrays is not the optimal way either. We don’t want to waste lots of setup/tear down works just to read one row. Instead, we want to read just-enough data to justify the underlying cost. Hypertable offers another api called next_cells_as_arrays. The number of cells to be read in is specified in configuration files ThriftBroker.NextThreshold field. As of writing, there is no way to specify a scanner specific value. Here is the code to replace the call to next_row_as_arrays.
cells = client.next_cells_as_arrays(scanner) if not len(cells): break print cells
Hope this helps. Please leave a comment if you have any questions.