Loading
3 comments

A Better Sharded Counter
My current AppEngine project was crying out for some counters to track site-wide instances of various models, and I recalled watching Brett Slatkin’s video about building highly-scalable web apps. A few seconds of quality Google time later, and I had the ShardCounter classes ready to go. The same code is available in several locations:
http://code.google.com/p/google-app-engine-samples/source/browse/trunk/sharded-counters/generalcounter.py

One thing that quickly caught my attention is that these classes only support incrementing, and while that makes sense for something like a primitive visit counter, it didn’t handle my needs very well at all. My initial attempt to simply copy the increment function and change the critical += to a -= was naive and doomed to failure, but a little tinkering with the way that counts are recorded gave me a nice working solution that completely preserves the desirable performance characteristics of this approach.

Here’s the code that I came up with. Please feel free to use it in your own projects.
from google.appengine.api import memcache
from google.appengine.ext import db
import random

# This code unabashedly stolen from Google
# http://code.google.com/appengine/articles/sharding_counters.html#counter_python

class GeneralCounterShardConfig(db.Model):
    """Tracks the number of shards for each named counter."""
    name = db.StringProperty(required=True)
    num_shards = db.IntegerProperty(required=True, default=20)


class GeneralCounterShard(db.Model):
    """Shards for each named counter"""
    name = db.StringProperty(required=True)
    "The name of the counter."
    
    plus = db.IntegerProperty(required=True, default=0)
    "The number of times that the counter has been incremented."
    
    minus = db.IntegerProperty(required=True, default=0)
    "The number of times that the counter has been decremented."


def get_count(name):
    """Retrieve the value for a given sharded counter.

    Parameters:
      name - The name of the counter
    """
    total = memcache.get(name)
    if total is None:
        total = 0
        for counter in GeneralCounterShard.all().filter('name = ', name):
            total += counter.plus
            total -= counter.minus
        memcache.add(name, str(total), 60)
    return total


def increment(name):
    """Increment the value for a given sharded counter.

    Parameters:
      name - The name of the counter
    """
    config = GeneralCounterShardConfig.get_or_insert(name, name=name)
    def txn():
        index = random.randint(0, config.num_shards - 1)
        shard_name = name + str(index)
        counter = GeneralCounterShard.get_by_key_name(shard_name)
        if counter is None:
            counter = GeneralCounterShard(key_name=shard_name, name=name)
        counter.plus += 1
        counter.put()
    db.run_in_transaction(txn)
    memcache.incr(name)

def decrement(name):
    """Decrement the value for a given sharded counter.

    Parameters:
      name - The name of the counter
    """
    config = GeneralCounterShardConfig.get_or_insert(name, name=name)
    def txn():
        index = random.randint(0, config.num_shards - 1)
        shard_name = name + str(index)
        counter = GeneralCounterShard.get_by_key_name(shard_name)
        if counter is None:
            counter = GeneralCounterShard(key_name=shard_name, name=name)
        counter.minus += 1
        counter.put()
    db.run_in_transaction(txn)
    memcache.decr(name)

def increase_shards(name, num):
    """Increase the number of shards for a given sharded counter.
    Will never decrease the number of shards.

    Parameters:
      name - The name of the counter
      num - How many shards to use

    """
    config = GeneralCounterShardConfig.get_or_insert(name, name=name)
    def txn():
        if config.num_shards < num:
            config.num_shards = num
            config.put()
    db.run_in_transaction(txn)
0 comments

A Pattern for RESTful URLs

I recently decided that I didn't like the way that URLs on the blog were formatted.  For example, the link to show the entry before this is:

/showpost?id=aglhZGFtY2Jsb2dyLQsSCUJsb2dJbmRleCIJYWRhbWNibG9nDAsSBFBvc3QiC2FkYW1jYmxvZzE2DA

and that is bad on a number of levels.  First, the post-specific data is the AppEngine Datastore ID of the entity that holds the post.  While it is usefully unique and a quick index to the data, it is also terribly ugly and utterly unhelpful to either human readers or search engines.  It  needs to be a slug.  That's well-and-good, as I have been working on a Sluggable mixin class to go along with the two other tools in my CMS belt, Taggable and Commentable.  I'll write more about Sluggable when it is ready to be released.

Secondly, the ID is passed in to the showpost handler as a GET parameter, and I'd rather have it be more RESTful, something like:

/showpost/acts_as_urlnameable-instructions

or even

/showpost/aglhZGFtY2Jsb2dyLQsSCUJsb2dJbmRleCIJYWRhbWNibG9nDAsSBFBvc3QiC2FkYW1jYmxvZzE2DA

since I don't have Sluggable ready.  Now, it occured to me that it would be reasonably easy to change the code up to have the RESTful-style URLs, but then I would be breaking any existing links to posts.  So, I needed to be able to switch over to the new-style while keeping the old-style available.  I came up with a pretty decent approach, I think.

The first step is that I needed to change the mapping  in the WSGIApplication setup.  You'll notice that I have removed all of the other mappings for the sake of brevity, but it used to look like this:

def main():
    application = webapp.WSGIApplication(
        [
         ('/showpost', ShowPost)
        ])
    wsgiref.handlers.CGIHandler().run(application)

In order to handle the RESTful pattern, I changed to a regular expression:

def main():
    application = webapp.WSGIApplication(
        [
         (r'^/showpost{1}(/.*)?', ShowPost)
        ])
    wsgiref.handlers.CGIHandler().run(application)

That will match both the desired new format and the must-be-tolerated old format.  Now that the mapping is set up to call the correct function, I have to go about modifying the ShowPost function.  This is how it looks:

class ShowPost(SmartHandler):
    def get(self):
        from post import Post
       
        postid = self.request.get('id')
        if postid is not None and len(postid) > 0:
            try:
                post = Post.get(postid)

Not bad, but modifying it to account for the new format while keeping the old format will be ugly, and I'll end up repeating the code in any other request-handling methods, so I'm going to abstract it a bit and put it into SmartHandler, the customized version of RequestHandler that I use.  I added the following instance method to the SmartHandler class:

def expects_request_id(self, *look_for):
    "Searches the request Uri for an embedded resource id."
       
    import string
       
    # First preference is to find it in the request Uri.  Assumption is
    # that it is the last element in a multi-element path.
    path_parts = string.split(self.request.path, "/")
       
    # Empty elements are meaningless, so delete them
    cleaned_path_parts = []
    for each_part in path_parts:
        if len(each_part) > 0:
            cleaned_path_parts.append(each_part)
       
    found_id = None
       
    if len(cleaned_path_parts) > 1:
        found_id = cleaned_path_parts[-1]
    else:
        # There is only one element in the path, so we will look
        # for id info in the GET & POST arguments.  Candidate argument
        # names are passed in through *look_for
        for each_arg_name in look_for:
            if each_arg_name in self.request.arguments():
                found_id = self.request.get(each_arg_name)
                break
               
    if found_id is None:
        raise NoIDFound
    else:
        self.requested_id = found_id
           
    return found_id

And I call it in ShowPost like this:

class ShowPost(SmartHandler):
    def get(self, *args):
        from post import Post
       
        try:
            self.expects_request_id("id")
       
            try:
                post = Post.get(self.requested_id)
                #
                # code snipped for brevity
                #
            except db.BadKeyError:
                # Render an error page here..."Sorry, but the post that you requested isn't there."
        except NoIDFound:
            # Render an error page: "Sorry, but when requesting a post, you have to specify the id of the Post."

You can see that expects_request_id has a declarative feel to it, and it seamlessly allows me to handle new-style and old-style URLs.  It assumes that any request id information is the last element in a multi-element path, and if it is a single-element path, it looks for a URL parameter that we pass in.  In this case, the parameter name is id, but it could be any string, and it could even be many different strings:

self.expects_request_id("id", "postid", "post")

would allow me to honor many different parameters.

I hope that this pattern and this code is useful.  I'll be happy to answer any questions about it, and I'm always deeply grateful for any suggestions and comments.

0 comments

Acts_as_urlnameable Instructions

A few posts ago, I promised to share my fail-proof instructions for installing and integrating the Ruby on Rails plugin called acts_as_urlnameable.  Here is what I learned.  Since I put this together, I have discovered some more issues that I need to find workarounds for, and I'll post about those later.  For now, here's how you get it working for you in the vast majortiy of cases.

1. Install plugin:

  • script/plugin install http://code.helicoid.net/svn/rails/plugins/acts_as_urlnameable/  will make a static copy of the source for you, or
  • script/plugin install -x http://code.helicoid.net/svn/rails/plugins/acts_as_urlnameable/ will fetch a copy via SVN

2. Add acts_as_urlnameable to //environment.rb// if needed.  If you define config.plugins, add urlnameable there.

3. For each model that will be urlnameable, add acts_as_urlnameable:

class Foo < ActiveRecord::Base
  acts_as_urlnameable :nameable_field
end

 

4. Add to each Model an override of to_param.  This implementation differs from the suggested ones by continuing to provide the default numeric id for records that haven't been urlnameified yet.  This should help you to avoid breaking existing functionality when adding this to an existing website:

def to_param
  if urlname and urlname.length > 0
    urlname
  else
    id
  end
end

 

5. Add a new method, smart_find.  Again, this approach allows you to have mixed numeric and urlnamed ids.  This will save a good deal of time when converting an existing Rails application to use acts_as_urlnameable.  There are plenty of places in code that you won't care about having pretty, legible ids, like in HIDDEN form fields.  This reduces the total exposure to your code:

def self.smart_find(id)
  found_foo = nil
  if id.to_i > 0
    # We got a regular, old int id, so look it up as usual
    found_foo = Foo.find(id)
  else
    # We got a string, a urlname id
    found_foo = Foo.find_by_urlname(id)   
  end
end

6. Add a migration to add the table and apply to existing rows:

class AddUrlnamesTable < ActiveRecord::Migration # :nodoc:

  def self.up
    create_table 'urlnames' do |t|
      t.column 'nameable_type',     :string
      t.column 'nameable_id',       :integer
      t.column 'name',              :string
    end
   
    # For each Model to which acts_as_urlnameable will apply
    # and which has existing rows, add a loop like the following;
    # simply resaving each record will add the urlname data.
    for each_foo in Foo.find(:all)
      each_foo.save
    end
  end
 
  def self.down
    drop_table 'urlnames'
  end
end

7. In each controller that references one of the now-acts_as_urlnameable Models, change references to Foo.find to Foo.smart_find

8. In all of your views, check for links that use the pattern '':id => @foo.id'' and replace them with '':id => @foo''.  This will allow the to_param override to intelligently choose which id to provide.

 


That's the basic idea, and that should be enough to get anyone going with acts_as_urlnameable.  During the course of integrating it into My Kids Library, I have discovered a number of circumstances that require some pretty sophisticated customization, and I will detail those in future posts.  Until then, I am happy to answer any questions posted in the Comments section.

0 comments

Open Source as a Roadside Picnic

One of the many changes that Jessamyn suggested for MyKidsLibrary is that the URLs should be comprised of meaningful text rather than just numbers.  In addition to being more human-friendly, it is, apparently, an important search engine optimization technique.

Ruby on Rails likes to construct URLs that end with a numeric identifier that is used to look up a specific record in the database.  It is an efficient, effective solution, and the software engineer side of me never considered why you'd have it be otherwise.  I have come to think of URLs as being things that are as effectively meaningless and worthless to my brain as printouts of UNIX coredumps.  I click on links, I bookmark pages, I never pay the slightest attention to URLs.  I use tools -- browsers, bookmarking services -- to work with URLs just as I use tools to write software.

Once I decided to go about making the change, I set out to find who else had already done this work.  The Rails ecosystem is vast and densely populated; I knew that there was but a very tiny chance that I'd actually have to start from scratch.  Sure enough, a little work on Google revealed that there were many candidate solutions.  I picked one that looked solid and set about integrating it into my project.

I'm not new to this; I have been a working, salary-earning software engineer for nearly two decades, so I should have been prepared for the documentation to suck.  The documentation always sucks.  The last time that I read really good, comprehensive documentation was when I was writing code for a VMS system, and I sat right next to the big orange wall.  At least, I remember it being good; it's all so long ago that I might be remembering it in a somewhat nostalgic light.

I had to figure out a lot of things that weren't mentioned in the documentation, and while that's not the worst thing, it is still frustrating to see a useful, well-put-together package that stops just short of being perfect.  And, really, they all do.

Open Source is invaluable, but in many respects, it reminds me of a Roadside Picnic.

And just to prove that I'm not a hypocritical dick, my next post will include extensive, failproof instructions for configuring and using the wonderful Rails plugin acts_as_urlnameable.

3 comments

Paginating Records in Google AppEngine

3/3/2001: I now consider the information in this post to be obsolete. More useful and up-to-date advice is available in the post Do Not Reinvent the Pagination Wheel.

 

In creating this blogging software, I have had to come to grips with finding a way to paginate content.  It's a relatively trivial exercise under most circumstances; it is a well-understood pattern, and it is actually built in to some of the popular frameworks.  AppEngine is a little different, and the nature of the Datastore actually makes it rather challenging to implement efficient useful paging.   I've come up with a solution that I think makes for a good balance of functionality and AppEngine-friendliness.

The code and tehcniques included here are Open Source.  I do hope that if you choose to use this code in your oen project that you'll comment here to share your feedback, suggestions and experiences.  Sharing means caring, guys.  For real.

This Paginator class depends on the Model that it will be paginating having an 'index' field, a unique value that is order with respect to how the pagination will occur.  For instance, here is the model definition for this blog's Comment entity:

class Comment(db.Model):
    """A Model for storing comments associated with another entity."""
    author = db.StringProperty(required=True, verbose_name="Author")
    "A text representation of the user who write the comment."
   
    body = db.TextProperty(required=True, verbose_name="Comment")
    "The text of the comment."
   
    added = db.DateTimeProperty(auto_now_add=True, verbose_name="Date Added")
    "The date that the comment was added, or created."

    index = db.IntegerProperty(required=True, default=0)
    "The index of the comment in the collection of comments for the parent entity."

Here, index increases every time a new comment is added; in fact, it mirrors added, always increasing.  However, index will always be unique.  It might not always be contiguous however, as a Comment can be deleted.  This function adds comments to the parent entity.  You can see how index is maintained:

def add_comment(self, author, body):
    "Add a new comment to this entity.  Returns the new comment object."
    new_comment = None
    def add_comment_txn():
        new_comment = Comment(parent=self, author=author, body=body, index=self.comment_index)
        new_comment.put()
        self.comment_index += 1
        self.comment_count += 1
        self.put()
           
        return new_comment
    new_comment = db.run_in_transaction(add_comment_txn)
       
    memcache.delete(self._comments_cache_key())
    # Invalidate the cached collection of records, so it will be regenerated
    # and re-loaded with the new record in it.
       
    return new_comment

Paginator comes in to play in the function that gets a page of comments when the blog is requested to show a post:

def get_comments(self, index=0, count=5):
    "Return the comments attached to this entity."
    comments_paginator = Paginator(count, 'index')
    comments = comments_paginator.get_page(db.Query(Comment).ancestor(self), index, True)
           
    return comments

The only perhaps slightly non-obvious part is index.  Where does it come from?  How do I know which index to ask for?  Is index the page number?  The answer to those questions is a little bit of a chicken-and-egg situation.  You provide Paginator's get_page method with an index from a previous call, usually the next_page or prev_page index.  Usually, you'll get those values the first time by calling get_page with an index of None.  That will tell it to get the very first page of results, and then you will have access to the prev_index, next_index and curr_index values that can be fed back in to it.  The Paginator alwasy looks for indexes relative to what is passed in, so the requested index doesn't exist --because it was deleted between calls -- it'll find the next one in the order.

So, that should give you a pretty good idea of how the Paginator works.  Please post any questions or suggestions as a comment, and I'll see them and address them as best as I am able.  Here, then is the actual Paginator code:

#Copyright 2008 Adam A. Crossland
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.

from google.appengine.ext import db
import copy

class PaginatedList(list):
    """An extended normal Python list with three additional properties used for
    pagination purposes:
    prev_index - the starting index of the previous page of entities;
    next_index - the starting index of the next page of entities;
    curr_index - the starting index of the current page of entities
    """
    def __init__(self, *args, **kw):
        list.__init__(self, *args, **kw)
        self.prev_index = None
        "The starting index of the previous page of entities"
        self.next_index = None
        "The starting index of the next page of entities"
        self.curr_index = None
        "The starting index of the current page of entities"
   
class Paginator:
    "A class that supports pagination of AppEngine Datastore entities."
    def __init__(self, page_size, index_field):
        self.page_size = page_size
        "The number of entities that constitute a 'page'"
        self.index_field = index_field
        "The name of the field in the Model that is a orderable index"

    def get_page(self, query=None, start_index=None, ascending=True):
        """Takes a normal AppEngine Query and returns paginated results.
        query - a Datastore Query object.  It must not have an order clause.
        start_index - the index of the first record in the desired page.  If the
            index is not known, or the first page is needed, None should be
            passed.
        ascending - True if the index column is to be ordered ascending; False
            should be passed for descending ordering.
        """
       
        fetched = None
       
        # I need to make a copy of the query, as once I use it to get the main
        # collection of desired records, I will not be able to re-use it to get
        # the next or prev collection.
        query_copy = copy.deepcopy(query)
       
        if ascending:
            # First, I will grab the requested page of entities and determine
            # the index for the next page
            filter_on = self.index_field + " >="
            fetched = PaginatedList(query.filter(filter_on, start_index).order(self.index_field).fetch(self.page_size + 1))
            if len(fetched) > 0:
                # The first row that we get back is the real index.
                fetched.curr_index = fetched[0].index
            if len(fetched) > self.page_size:
                # We fetched one more record than we actually need.  That is the
                # index of the first record of the next page.  Record it, and
                # delete the extra record from our collection.
                fetched.next_index = fetched[-1].index
                del(fetched[-1])
            # Now, I will try to determine the index of the previous page
            filter_on = self.index_field + " <"
            previous_page = query_copy.filter(filter_on, start_index).order("-" + self.index_field).fetch(self.page_size)
            if len(previous_page) > 0:
                # The last record is the first record in the previous page.
                # Record it.
                fetched.prev_index = previous_page[-1].index
        else:
            # Follow the same logical pattern as for ascending, but reverse
            # the polarity of the neutron flow
            filter_on = self.index_field + " <="
            fetched = PaginatedList(query.filter(filter_on, start_index).order("-" + self.index_field).fetch(self.page_size + 1))
            if len(fetched) > 0:
                # The first row that we get back is the real index.
                fetched.curr_index = fetched[0].index           
            if len(fetched) > self.page_size:
                # We fetched one more record than we actually need.  That is the
                # index of the first record of the next page.  Record it, and
                # delete the extra record from our collection.
                fetched.next_index = fetched[-1].index
                del(fetched[-1])
            # Determine index of previous page
            filter_on = self.index_field + " >"
            previous_page = query_copy.filter(filter_on, start_index).order(self.index_field).fetch(self.page_size)
            if len(previous_page) > 0:
                # The last record is the first record in the previous page.
                # Record it.
                fetched.prev_index = previous_page[-1].index
               
        return fetched