I gotta have my orange juice.

Jesu, Juva

Posts Tagged ‘time

Python generators – saving time and memory

leave a comment »

Python version 2.2 introduced support for generator functions. A generator is a function that returns something kind of like a list. You can think of a generator as a function that gets to return more than once; each time it returns, it produces a new item in the list. Here is a simple example of a generator:

def mygen() :
  print "calling generator"
  for x in [1, 2, 3, 4] :
    print "yielding next value"
    yield x

for item in mygen() :
  print item

This example prints the following:

calling generator
yielding next value
1
yielding next value
2
yielding next value
3
yielding next value
4

Instead of return, we used the yield statement inside our generator. yield returns an item from the generator, but marks that point in the code so that we continue processing when we go back to the generator to fetch the next item in the pseudo-list. Notice how the generator is only called once, but the yield points are interleaved with the print statements in the calling code. Each time the generator needs to produce a new value, it picks up from the previous yield point. When the generator reaches the end of the function, no more values are produced. You cannot use return within a generator.

Let’s look at some code where the use of generators might help us. The following code instantiates a list of objects and then creates HTML to display them. We’ll assume the existence of a database API to execute queries and retrieve results:

def get_objects() :
  result = []
  query = db_execute("...")
  row = query.fetchrow()
  while not row is None :
    result.append(MyObject(row))  # Build object from DB, append to result
    row = query.fetchrow()
  return result
. . .
for object in get_objects() :
  print object.getHTML()

This code creates the entire list of objects before it can print any of them. There are two problems with this — first, there is a huge delay to create all of the objects before any progress is made in printing; this means that the user’s browser has no partial results to display. Second, all of the objects must be held in memory at the same time; if there are many objects, this can cause significant overhead.

Generators allow us to attack both of these problems. Since a generator produces items one at a time, on demand, it avoids both of these problems. We don’t have to wait to construct all of the objects in the list before we use the first one. And once we are done using an object, the Python garbage collector is now free to immediately clean it up. Here’s how we might rewrite the above code to use generators:

def get_objects() :
  query = db_execute("...")
  row = query.fetchrow()
  while not row is None :
    yield MyObject(row)    # Build object from DB, yield to caller
    row = query.fetchrow()
. . .
for object in get_objects() :
  print object.getHTML()

With a small change we have now significantly improved our code’s memory footprint — all of the objects do not need to be held in memory at the same time. And we are now producing the output for each object as we create it, without needing to wait for all the objects to be instantiated first. This is a significant improvement!

We can also build a chain of generators. Let’s assume that we need to modify the code above to optionally display only those objects that have an “active” flag set. Ordinarily, if we use Python’s filter function to accomplish this, it needs to create the entire list all at once. But if we use a generator to perform the filtering, then we can still keep our optimizations since we are only ever creating and filtering one object at a time. Here’s an example:

my_objs = get_objects()         # This returns a generator object

if display_active_only :
  def active_filter(objects) :  # A filtering generator
    for object in objects :
      if object.active :
        yield object

  my_objs = active_filter(my_objs)

for object in my_objs :
  print object.getHTML()

A generator function doesn’t produce a real list; instead, it produces a generator object that behaves like something called an iterator. You can’t write either of the following statements for a generator or iterator:

print len(get_objects())
print get_objects()[3]

The for statement is smart enough to traverse a generator, and it will probably be sufficient for your needs. Perhaps you can get the count of objects by other means, such as executing an SQL COUNT request. If you absolutely need to access a generator as a list, you can coerce it to a list as follows:

objlist = list(get_objects())

But be aware that this removes all of the advantages that we’ve discussed here, since this causes all of the objects returned by the generator to be created at once and stored in the list. If you find yourself needing to do this, you should consider rewriting your code so that you don’t need to do so. Or perhaps generators aren’t the right solution for your particular problem.

Written by Scott Moonen

February 1, 2008 at 10:22 am