在Python中建構緩存

發佈於2024-11-03

Building a cache in Python

Caching. Useful stuff. If you're not familiar with it, it's a way to keep data around in memory (or on disk) for fast retrieval. Think of querying a database for some information. Rather than do that every time an application asks for data, we can do it once and keep the result in a cache. Subsequent calls for the data will return the copy from cache instead of making a database query. In theory this improves the performance of your application.

Let's build a simple cache for use in Python programs.

Cache API

I'll start by creating a new module called simplecache, and defining a class Cache in it. I'll not implement anything yet, I just want to define the API my cache will use.

class Cache:
    """ A simple caching class that works with an in-memory or file based
    cache. """


    def __init__(self, filename=None):
        """ Construct a new in-memory or file based cache."""
        pass


    def update(self, key, item, ttl=60):
        """ Add or update an item in the cache using the supplied key. Optional
        ttl specifies how many seconds the item will live in the cache for. """
        pass


    def get(self, key):
        """ Get an item from the cache using the specified key. """
        pass


    def remove(self, key):
        """ Remove an item from the cache using the specified key. """
        pass


    def purge(self, all=False):
        """ Remove expired items from the cache, or all items if flag is set. """
        pass


    def close(self):
        """ Close the underlying connection used by the cache. """
        pass

So far so good. We can tell the cache to create a new in-memory or file based cache via the __init__ method. We can add items to the cache using update - which will overwrite the item if it already exists. We can get an item using a key get. Finally we can remove items by key remove, or empty the cache of expired items using purge (which optionally allows purging all items).

Where to cache the data?

So where is this class going to cache data? Sqlite ships with the Python standard library, and is ideal for this sort of thing. In fact one of the suggested use cases for Sqlite is caching. It allows us to create an in-memory or file based SQL database, which is both our use cases covered. Let's design a SQL table that can hold our cached data.

CREATE TABLE 'Cache' (
    'Key' TEXT NOT NULL UNIQUE,
    'Item' BLOB NOT NULL,
    'CreatedOn' TEXT NOT NULL,
    'TimeToLive' TEXT NOT NULL,
    PRIMARY KEY("Key"))
);

To break this down, we have a table called Cache that has four fields. The Key is a string, and will act as a unique primary key on the table. Next up our Item field is a blob of binary data - I'm thinking here that we will serialise objects added to the cache before saving them in the database. The last two fields are used to determine the lifetime of the item in the cache - CreatedOn is the timestamp of when the item was added, and TimeToLive is the length of time we need to hang on to the item for.

Constructing the cache

Let's start by importing the Sqlite library into our module.

import sqlite3

Then we need to turn our attention to the __init__ method. We have two scenarios to support: one where we are given a filename, and one where we are not.

def __init__(self, filename=None):
    """ Construct a new in-memory or file based cache."""
    if filename is None:
        self._connection = sqlite3.connect(":memory:")
    else:
        self._connection = sqlite3.connect(filename)
    self._create_schema()

We can open a connection and keep it around for the lifetime of the class. Therefore we set the self._connection property to hold the connection instance, before calling an internal method _create_schema (more on that in a little while).

Bootstrapping the schema

If we are using an in-memory database, then our schema will not exist yet. However for file based databases that may not be the case. This may be an existing file that has already been setup with our schema. Let's see the code that handles this process.

def _create_schema(self):
    table_name = "Cache"
    cursor = self._connection.cursor()
    result = cursor.execute(
        "SELECT name FROM sqlite_master WHERE type = 'table' and name = ?",
        (table_name,))
    cache_exists = result.fetchone() is not None
    if cache_exists:
        return    
    sql = """
        CREATE TABLE 'Cache' (
        'Key' TEXT NOT NULL UNIQUE,
        'Item' BLOB NOT NULL,
        'CreatedOn' TEXT NOT NULL,
        'TimeToLive' TEXT NOT NULL,
        PRIMARY KEY('Key'))
    """
    cursor.execute(sql)
    cursor.close()

First we define our table name and open a cursor to perform some database operations. Next we check for the existence of our table in the master table. If it's there we exit the method, otherwise we execute a CREATE TABLE... statement to build our table in the database.

We can now instantiate our Cache class with either an in-memory or file based database to cache objects in.

Adding to the cache

Let us now turn our attention to adding items to the cache. Recall our Update method from our API will be responsible for this. If a key already exists in the cache, we are going to replace its entry with whatever is supplied.

def update(self, key, item, ttl=60):
    """ Add or update an item in the cache using the supplied key. Optional
    ttl specifies how many seconds the item will live in the cache for. """
    sql = "SELECT Key FROM 'Cache' WHERE Key = ?"
    cursor = self._connection.cursor()
    result = cursor.execute(sql, (key,))
    row = result.fetchone()
    if row is not None:
        sql = "DELETE FROM 'Cache' WHERE Key = ?"
        cursor.execute(sql, (key,))
        connection.commit()
    sql = "INSERT INTO 'Cache' values(?, ?, datetime(), ?)"
    pickled_item = pickle.dumps(item)
    cursor.execute(sql, (key, pickled_item, ttl))
    self._connection.commit()
    cursor.close()

First we get the in-memory database connection. We then test for the existence of the the supplied key value in the cache. If it exists, it is removed from the cache. Next we pickle the supplied item value, turning it into a binary blob. We then insert the key, picked item, and ttl into the cache.

Getting data from the cache

Our behaviour for getting an item from the cache is mostly straight forward. We look for an entry with the specified key and unpickle the associated item. The complication in this process arises when the ttl value has expired. That is, the date of creation plus the ttl are less than the current time. If this is the case, then the item has expired and should be removed from the cache.

There is a philosophical issue with this method. There is a school of thought that says methods should either return a value (read) or perform a mutation of data (write). Here we are potentially doing both. We are deliberately introducing a side effect (deleting an expired item). I think it is OK in this case, but other programmers might argue otherwise.

def get(self, key):
    """ Get an item from the cache using the specified key. """
    sql = "SELECT Item, CreatedOn, TimeToLive FROM 'Cache' WHERE Key = ?"
    cursor = self._connection.cursor()
    result = cursor.execute(sql, (key,))
    row = result.fetchone()
    if row is None:
        return
    item = pickle.loads(row[0])
    expiry_date = datetime.datetime.fromisoformat(row[1])   datetime.timedelta(seconds=int(row[2]))
    now = datetime.datetime.now()
    if expiry_date 




  
  
  Removing and purging items in the cache


Removing an item is simple - in fact we have already done it (twice) in our other two methods. That's an ideal candidate for refactoring (which we will look at later on). For now, we will implement the method directly.


def remove(self, key):
    """ Remove an item from the cache using the specified key. """
    sql = "DELETE FROM 'Cache' WHERE Key = ?"
    cursor = self._connection.cursor()
    cursor.execute(sql, (key,))
    self._connection.commit()
    cursor.close()




Purging items is a little more complex. We have two scenarios to support - purging all items and purging only expired items. Let's see how we can achieve this.


def purge(self, all=False):
    """ Remove expired items from the cache, or all items if flag is set. """
    cursor = self._connection.cursor()
    if all:
        sql = "DELETE FROM 'Cache'"
        cursor.execute(sql)
        self._connection.commit()
    else:
        sql = "SELECT Key, CreatedOn, TimeToLive from 'Cache'"
        for row in cursor.execute(sql):
            expiry_date = datetime.datetime.fromisoformat(row[1])   datetime.timedelta(seconds=int(row[2]))
            now = datetime.datetime.now()
            if expiry_date 



Deleting all is simple enough. We simply run SQL to delete everything in our Cache table. For the expired only items we need to loop through each row, compute the expiry date, and determine if it should be deleted. Again, this latter piece of code has been repeated from one of our other methods (get in this case). Another candidate for refactoring.


  
  
  Refactoring


We have a working cache implementation that satisfies our original API specification. There is however some repeated code, which we can factor out into their own methods. Lets start with the delete logic that is present in get, update, remove, and purge methods. These instances can all be replaced with a call to the following new method.


def _remove_item(self, key, cursor):
    sql = "DELETE FROM 'Cache' WHERE Key = ?"
    cursor.execute(sql, (key,))
    cursor.connection.commit()




We can see this has a big impact on our code. Four other methods are now calling the one common _remove_item method. Next let's take a look at the expiry date checking code.


def _item_has_expired(self, created, ttl):
    expiry_date = datetime.datetime.fromisoformat(created)   datetime.timedelta(seconds=int(ttl))
    now = datetime.datetime.now()
    return expiry_date 



Great. That's two more places we have reduced code repetition in.


  
  
  Thread safety with locks


We are almost done. For this to be a robust class, we need to ensure we are thread safe. Caches can often surface as singleton instances in an application, so thread safety is important. We will achieve this by using locks around the destructive cache operations. This is how our whole class looks with locking added. Note the with blocks around the add and delete operations. These ensure the lock is released even if something goes wrong.


#! /usr/bin/env python3


import datetime
import pickle
import sqlite3
import threading


class Cache:
    """ A simple caching class that works with an in-memory or file based
    cache. """

    _lock = threading.Lock()

    def __init__(self, filename=None):
        """ Construct a new in-memory or file based cache."""
        if filename is None:
            self._connection = sqlite3.connect(":memory:")
        else:
            self._connection = sqlite3.connect(filename)
        self._create_schema()


    def _create_schema(self):
        table_name = "Cache"
        cursor = self._connection.cursor()
        result = cursor.execute(
            "SELECT name FROM sqlite_master WHERE type = 'table' and name = ?",
            (table_name,))
        cache_exists = result.fetchone() is not None
        if cache_exists:
            return    
        sql = """
            CREATE TABLE 'Cache' (
            'Key' TEXT NOT NULL UNIQUE,
            'Item' BLOB NOT NULL,
            'CreatedOn' TEXT NOT NULL,
            'TimeToLive' TEXT NOT NULL,
            PRIMARY KEY('Key'))
        """
        cursor.execute(sql)
        cursor.close()


    def update(self, key, item, ttl=60):
        """ Add or update an item in the cache using the supplied key. Optional
        ttl specifies how many seconds the item will live in the cache for. """
        sql = "SELECT Key FROM 'Cache' WHERE Key = ?"
        cursor = self._connection.cursor()
        result = cursor.execute(sql, (key,))
        row = result.fetchone()
        with self.__class__._lock:
            if row is not None:
                self._remove_item(key, cursor)
            sql = "INSERT INTO 'Cache' values(?, ?, datetime(), ?)"
            pickled_item = pickle.dumps(item)
            cursor.execute(sql, (key, pickled_item, ttl))
            self._connection.commit()
        cursor.close()


    def _remove_item(self, key, cursor):
        sql = "DELETE FROM 'Cache' WHERE Key = ?"
        cursor.execute(sql, (key,))
        cursor.connection.commit()


    def get(self, key):
        """ Get an item from the cache using the specified key. """
        sql = "SELECT Item, CreatedOn, TimeToLive FROM 'Cache' WHERE Key = ?"
        cursor = self._connection.cursor()
        result = cursor.execute(sql, (key,))
        row = result.fetchone()
        if row is None:
            return
        item = pickle.loads(row[0])
        if self._item_has_expired(row[1], row[2]):
            with self.__class__._lock:
                self._remove_item(key, cursor)
            item = None
        cursor.close()
        return item


    def _item_has_expired(self, created, ttl):
        expiry_date = datetime.datetime.fromisoformat(created)   datetime.timedelta(seconds=int(ttl))
        now = datetime.datetime.now()
        return expiry_date 




  
  
  Testing the cache


Time to test our cache. We can do this be spinning up an interactive session as follows.


python -i simplecache.py




Now we can new up an in-memory cache and test our methods.


>>> c = Cache()
>>> c.update("key", "some value")
>>> c.update("key2", [1, 2, 3], 300)
>>> c.get("key")
'some vlaue'
>>> c.remove("key")
>>> c.purge()
>>> c.get("key2")
[1, 2, 3]
>>> c.purge(True)
>>> c.get("key2")
>>> c.close()
>>>





  
  
  Exercises for the reader


Write a suite of unit tests for the Cache class. How easy is it to test? Do you need to make any changes to accommodate testing?
Make the time to live a sliding window instead of a fixed time. That is, whenever an item is retrieved from the cache, its time to live value starts over.
Add a method to write the contents of the cache out to the screen.