Update for Thursday, 10 March 2011

Virtuoso, deadlocks, prefixing SPARQL and SPARUL queries

Lately Virtuoso has been greeting me with 40001 Error SR172: Transaction deadlocked. These happen during heavy concurrent writes to database. hwilliams in forum suggested to try log_enable(2) to force autocommit of transactions. So I’d want to prefix my SPARUL queries with the following:

DEFINE sql:log-enable 2 

My code is in Python, and it’s using SuRF as ORM in front of Virtuoso’s triple store. So I’d need to somehow tell SuRF to add that DEFINE thing in front of each SPARUL query it generates and executes. Luckily, there’s virtuoso_protocol SuRF plugin by Christoph Burgmer that supports such thing for SPARQL queries. After some minor patching it supports prefixing SPARUL queries as well. So in my python code, the store and session initialization code looks something like:

# config_items have been read from .ini file previously
store = surf.Store(**config_items)
store.writer.define = "sql:log-enable 2"
session = surf.Session(store, {})

Autocommit only makes sense for queries that modify data, so I’m not putting anything in store.reader.define. But I could put some stuff there as well, for example, a directive to turn on some inferencing:

store.reader.define = 'input:inference "uri_of_my_inference_rules"'

And yes, the log_enable(2) thing seems to help, I haven’t encountered deadlocks since turning that on.

Caching → Complexity → Doom

Resisted urge to solve things by the usual “let’s just add more caching”. Caching, sharing cache between processes and machines, invalidating cache–adds tons of complexity. Complexity destroys the fun of hacking and makes brain hurt. There are alternatives to caching like, avoiding work, or grouping work so that cache can have a scope of one function or one HTTP request, or just going shopping. Yay.

hg and git and time machines

I’ve been getting along with bitbucket fine, just pushing and pulling changes from and to my lonely repositories, but as soon as I needed to push a change upstream or merge some stuff back, I’d get confused with merges, conflicts, and many heads–seriously scary stuff. Like I need to operate a time machine and I don’t know what any of the knobs do. Actually, when you think about it, for a code artifact living in VCS, VCS is a lot like a time machine, forks being parallel universes… Anyway, I read some chapters from Mercurial book, and the many heads thing is now much clearer to me. Time well spent.

Other stuff

Also some wrestling with tests, it’s complexity again, as soon as your test suite gets sufficiently big and diverse, it’s a PITA to keep it up and running, and it’s extreme PITA to get it running in new environments.

Been also looking in server monitoring services. You know, all good things come to an end, production server included. Started with registering at Was it up?, this service promises to send me an email in case the server doesn’t respond to HTTP request.

Advertisements

Update for Monday, 14 February 2011

  • Happy to see SPARQLWrapper 1.4.2 released, enables rdflib 3.0 goodness in SuRF
  • Was putting together a shell script to resize user-uploaded pictures to few standard sizes with ImageMagick. Picked up few useful shell tips:
    • set -e causes script to exit on first error (non-zero exit code from command). Fail early is good
    • Get the absolute, canonical path of file or directory: CANONICAL=`readlink -e $SOMEPATH`
  • It’s cold around here!

“keywords must be strings”

Bugfixing! Here’s interesting one:

def some_function(**kwargs): 
    pass

some_function(**{u"this_is_unicode_keyword":"value"})

Passing unicode keyword to function works in Python 2.6.6 (and probably above), but doesn’t work in Python 2.6.2 (and probably below)–it throws TypeError: some_function() keywords must be strings. Of course it took some digging to get from generic “something went wrong somewhere” error report, to this specific cause. Turns out there was at least one affected place in SuRF, which is fixed now.

SuRF update

Relatively low activity since last update, but some interesting stuff nevertheless. There hasn’t been a release for a while, but, hey, you can install right from svn with pip!

pip install http://surfrdf.googlecode.com/svn/trunk#egg=SuRF
pip install http://surfrdf.googlecode.com/svn/trunk/plugins/surf.sparql_protocol#egg=surf.sparql_protocol

So, the updates:

  • Python 2.4 compatibility based on patch by Saša Vinčić. This had mostly to do with replacing a = b if c else d constructs with and-or trick.
  • Good Relations namespace added to default namespaces in surf.ns module.
  • added dictionary-style attribute access. This comes handy in two situations:

    • When properties contain special characters that are not allowed in Python attribute names
    • When iterating over a set of attributes, or attribute names are determined dynamically. You can always use Pythons’ built-in getattr(), but this looks nicer.

    Demo:

    >>> for attr in ["name", "surname"]: print john["foaf_%s" % attr].first
    John
    Smith
    	
    # URIRefs are also accepted as dictionary keys:
    >>> for attr in ["name", "surname"]: print john[surf.ns.FOAF[attr]].first
    John
    Smith    
    
  • SomeClass.get_by(ns_predicate = ....) now accepts lists and other iterable objects as values. So, for example,

    johns_and_joes = FoafPerson.get_by(foaf_name = ["John", "Joe"])
    

    You can even use SuRF queries as arguments since they are iterable:

    persons_that_know_any_john_or_joe = \
        FoafPerson.get_by(foaf_knows = johns_and_joes)
    

    In this case, SuRF will first evaluate the inner query, and then use its values to form a bunch of UNION clauses in outer query. So beware, you can easily get a long and inefficient query. SURF could of course be made super-clever to combine both queries into a more sophisticated one.

  • … bugfixes, optimizations
  • Inference Rules & Reasoning with Virtuoso

    OpenLink Virtuoso supports some inferencing, but requires you to prefix SPARQL queries with some special syntax to enable it. So how do you prefix SPARQL queries generated by SuRF with some vendor-specific stuff? One way to do this, is to create a new reader/writer plugin which subclasses and extends sparql_protocol plugin. An example of this is in SuRF mailing list. There is a link to experimental surf.sparql_protocol_virtuoso plugin, which adds inferencing support.

What’s been happening with SuRF

There was a SuRF 1.1.0 release on January 20, 2010, and it brought the following:

  • Class mapping based on rdf:type of resources and “single” descriptor. I’ve already blogged about it here: Extending SuRF resource classes.
  • MIN, MAX, AVG functions, UNION groups in query builder. For situations when querying capabilities of SuRF resources are not enough, you’ll sometimes resort to writing manual SPARQL/SPARUL queries and interpreting their results yourself. SuRF can help here a bit: instantiate Query object, call its methods to add things like FROM and WHERE clauses, and SuRF will translate it to string representation upon execution. Building queries this way can result in cleaner code than doing lots of string concatenations. Query builder doesn’t yet support all SPARQL/SPARUL syntax features and we’re extending it as we go along. Starting from v1.1.0 it can build queries containing aggregate functions and SVN version can do unions:
    import surf
    from surf.query import a, select
    from surf.query.translator.sparql import SparqlTranslator
    
    # get session here...
    
    query  = select("min(?price)", "max(?price)")
    query.union(("?s", surf.ns.SURF.price, "?price"), 
                 ("?s", surf.ns.SURF.discount_price, "?price"))
    
    # Now either execute the query
    result = session.default_store.execute(query)    
    
    # ... or translate it to string and look at it:
    print SparqlTranslator(query).translate()
    # prints: u'SELECT  min(?price) max(?price)  WHERE { {  ?s <http://code.google.com/p/surfrdf/price> ?price  } UNION {  ?s <http://code.google.com/p/surfrdf/discount_price> ?price  } }
    
  • get_by() accepts resource instances as values. To demonstrate:
    Person = session.get_class(surf.ns.FOAF.Person)
    mary = Person("http://mary.example.com/me")
    
    # Using URIRef as argument value--
    persons_who_know_mary = Person.get_by(foaf_knows = mary.subject)
    
    # And now, use resource 'mary' in get_by() directly--
    persons_who_know_mary = Person.get_by(foaf_knows = mary)
    
    # BTW similar effect can also achieved with
    # inverse attributes:
    persons_who_know_mary = mary.is_foaf_knows_of
    
    
  • ResourceValue supports “in” keyword. ResourceValue is the class that represents resource attributes. It tries to mimic Python lists and it got better in this by supporting “if john in mary.foaf_knows: ... ” syntax.
  • Fixed multiple bugs in Sesame2 plugin. Due to lack of manpower in SuRF development there are few dusty corners in codebase, like Sesame2 plugin. Well, it got a bit better!

And here are more recent developments in SVN trunk, not yet available in released version:

  • HTTP 1.1 Keep-alive support in sparql_protocol plugin. Keep-alive feature lets you reuse single connection for several requests. This is especially important on Windows systems if doing many requests in short period of time. Client creates a connection for each request, each connection occupies a port on client machine for 120 seconds or so. Due to default port configuration on Windows after few thousand requests the system runs out of free ports and requests start to fail. Keep-alive solves this. To use it, you’ll need to upgrade SPARQLWrapper to version 1.4.1 (released today!) and supply argument use_keepalive = True when creating SuRF store.
  • Fixed memory leak when eager-loading resources (using .full() modifier). All eager-loaded resources were incorrectly marked as dirty/unsaved and added to the pool of dirty resources. In intensive or long running processes this pool slowly grows, memory consumption grows, all SuRF operations gradually become slower because Python has to manipulate monster-sized set of dirty resources.
  • Significantly increased the efficiency of updates/deletes in sparql_protocol plugin:
    • store.update(), store.save(), store.remove() methods now accept multiple resources, and in case of sparql_protocol, these updates are performed in one or two queries. Previously, a separate SPARUL query would be issued for each resource.
    • OpenLink Virtuoso SPARQL endpoint supports multiple SPARUL queries in one request. SuRF can be instructed to utilize this feature by initializing store with “combine_queries = True” parameter.
    • Queries that delete resources now specify graph in WHERE clause. This greatly speeds up deletes on OpenLink Virtuoso with default indexes.

Locating and eliminating performance bottlenecks is fun! More fixes and improvements to come!

Extending SuRF resource classes

One cool feature SuRF has had for long but I didn’t know about is the ability to extend resource classes depending on their RDF:type. By default the resources in SuRF are all instances of surf.Resource. So they all have common methods like save() and dynamic attributes in “prefix_predicate” form. What if you wanted to have some logic that operates just on one kind of resources? The object-oriented way would be to store this logic inside the class. i.e., instead of

do_stuff(resource)

you would rather have

resource.do_stuff()

It turns out you can do that with SuRF. Write your own class that implements do_stuff method (or adds some extra attributes or properties), then put it in session.mapping:

session.mapping[surf.ns.EXAMPLE_NAMESPACE.some_type] = MyClass

From then on all resources of type example_namespace:some_type will also be subclasses of MyClass. Here’s a complete example that extends foaf:Person type resources with method get_friends_count:

import surf

class MyPerson(object):
	""" Some custom logic for foaf:Person resources. """
	
	def get_friends_count(self):
		return len(self.foaf_knows)
		
session = surf.Session(surf.Store(reader = "rdflib", writer = "rdflib"))		
session.mapping[surf.ns.FOAF.Person] = MyPerson

# Now let's test the mapping
john = session.get_resource("http://example/john", surf.ns.FOAF.Person)

# Is `john` an instance of surf.Resource? 
print isinstance(john, surf.Resource)	
# outputs: True

# Is `john` an instance of MyPerson?
print isinstance(john, MyPerson)
# outputs: True

# Try the custom `get_friends_count` method:
print john.get_friends_count()
# outputs: 0

Attribute aliases with properties, descriptors

Being able to customize SuRF resource classes allows for many nifty things. For example, you can add a short-named property to your class that works as an alias for some longer but frequently used attribute. If you expect the attribute to always have just one value you can encapsulate that in the property code as well:

class MyPerson(object):

    @property	
	def name(self):
        return self.foaf_name.first

If you define several such properties, you’ll start to see duplication of code. Following the DRY principle we can replace this code with descriptors (which is, by the way, the standard way how things are done in RDFAlchemy):

First, fragment from surf/util.py:

class single(object):
    """ Descriptor for easy access to attributes with single value. """
    
    def __init__(self, attr):
        if isinstance(attr, URIRef):
            attr = rdf2attr(attr, True)
        self.attr = attr
    
    def __get__(self, obj, type = None):
        return getattr(obj, self.attr).first
    
    def __set__(self, obj, value):
        setattr(obj, self.attr, value) 

    def __delete__(self, obj):
        setattr(obj, self.attr, []) 

In addition to read access, this descriptor also supports setting and deleting value. Here’s how it would be used:

from surf.util import single

class MyPerson(object):
    name = single("foaf_name")

More complex example: implementing rdf:Bag

There was a question on SuRF mailing list recently: does SuRF support RDF containers–Bag and Seq? Currently the answer is, unfortunately, no, or at least not very well. But customizable classes can help us here.

Bag and Seq are basically conventions how to model ordered and unordered lists in RDF. Here’s an example of a Bag in RDF/XML representation from W3Schools:

<?xml version="1.0"?>

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http://www.recshop.fake/cd#">

<rdf:Description
rdf:about="http://www.recshop.fake/cd/Beatles">
  <cd:artist>
    <rdf:Bag>
      <rdf:li>John</rdf:li>
      <rdf:li>Paul</rdf:li>
      <rdf:li>George</rdf:li>
      <rdf:li>Ringo</rdf:li>
    </rdf:Bag>
  </cd:artist>
</rdf:Description>

</rdf:RDF>

Here’s the same data in N-Triples notation:

_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> .
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> "John" .
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2> "Paul" .
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3> "George" .
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#_4> "Ringo" .
<http://www.recshop.fake/cd/Beatles> <http://www.recshop.fake/cd#artist> _:genid1 .
<http://dog> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .

So we see that there is a resource (a blank node in this case) of type rdf:Bag and with predicates rdf:_1, rdf:_2, … containing bag items. Let’s write a class that extends resources of type rdf:Bag and provides an iterator over these predicates:

class MyBag(object): 
    def iterator(self): 
        # First, load everything that's known about this resource. 
        self.load() 
        
        # Now bag items are available in attributes 
        # "rdf__1", "rdf__2", ... 
        # We can either generate attribute names 
        # and use getattr function, or we can generate 
        # predicate URIs and look into self.rdf_direct 
        # dictionary directly. I chose the latter. 
        i = 0 
        while True: 
            i += 1 
            predicate_uri = surf.ns.RDF["_%d" % i] 
            if not predicate_uri in self.rdf_direct: 
                raise StopIteration 
            yield self.rdf_direct[predicate_uri] 
    def __iter__(self): 
        return self.iterator() 
        
session.mapping[surf.ns.RDF.Bag] = MyBag 

And here’s the iterator in action:

cd = session.get_resource("http://www.recshop.fake/cd/Beatles", surf.ns.OWL.Thing) 
artists = cd.cd_artist.first 

for name in artists: 
    print name 
# prints: 
# [rdflib.Literal(u'John')] 
# [rdflib.Literal(u'Paul')] 
# [rdflib.Literal(u'George')] 
# [rdflib.Literal(u'Ringo')] 

This is of course just a proof-of-concept code, and some essential functionality is missing. For example it would be nice for bags and sequences to also support modification and element access by index not just iteration. However, we see that ability to customize resource classes provides a nice and non-intrusive way to implement such features.

ResourceValue membership testing

A super-tiny SuRF update for today:

import surf
from surf.rdf import URIRef

session = surf.Session(surf.Store())
Person = session.get_class(surf.ns.FOAF)

john_uri = "http://john.com/me"
mary_uri = "http://mary.net/me"

john = Person(john_uri)
mary = Person(mary_uri)

# When setting attribute value, SuRF will wrap it into
# ResourceValue, a list-subclassing object with some 
# extra features.
john.foaf_knows = mary 

print type(john.foaf_knows)
# outputs: <class 'surf.resource.value.ResourceValue'>

print john.foaf_knows
# outputs: [<surf.session.Foaf object at 0x90b746c>]

# Now the new bit:

# Is SuRF resource `mary` in john.foaf_knows?
print mary in john.foaf_knows
# outputs: True 
  
# Is URIRef `mary_uri` in john.foaf_knows?
print URIRef(mary_uri) in john.foaf_knows
# outputs: True 

The new bit is last few lines of the example–list membership testing. So a fairly small thing, but it didn’t work correctly before and it’s these small things that build up the overall experience. Being more pythonic, having better integration with Python idioms is one of the areas where SuRF can improve a lot. For example there’s also the issue about transparently mapping Literals to Python built-in datatypes on the waiting list.