Monday, November 29, 2010

Persistence Patterns

Persistence Patterns
Patterns in Practice
Persistence Patterns
Jeremy Miller
Data access is a popular subject among developers. No doubt you've heard plenty of opinions on specific data access technologies and persistence frameworks, but what's the best way to consume these tools in your project? What criteria should you use to select the right tool for your project? What do you need to know conceptually about these tools before you use? What if you've got too much time on your hands and want to write your own persistence tool—what do you need to know?
Unsurprisingly, the answer to all of these questions is to examine the underlying design patterns of persistence.
Domain Models
When you think about how you are going to structure and represent the business logic in your system, you have a couple major choices. In this article, I'm largely assuming that you have chosen the domain model approach to organizing business logic into entity objects.
From the formal description, a domain model is an object model of the domain that incorporates both behavior and data.
For example, my current project involves Customer Relationship Management. (CRM) We have entity objects for Case and User and Solution that contain both data and implement business rules involving that data. A domain model can range from an anemic model that is simply a set of data structures to a very rich model that jealously guards the raw data behind a narrow interface (hardcore Domain-Driven Development). Where your domain model falls in this range is largely a matter of how complicated the business logic in your system really is and how prevalent reporting or data entry are in your system requirements.
An adequate discussion of the Domain Model pattern is beyond the scope of this article. I would strongly recommend reading Chapter 2 of Martin Fowler's Patterns of Enterprise Application Architecture book for a great discussion of the major design patterns for organizing business logic.
Before getting started, let's review the two main ways to perceive the role of the database and data access code in your system:
  • The database is the keystone of the application and a business asset. The data access code and even the application or service code are simply mechanisms to connect the database with the outside world.
  • The business objects in the middle tier and the user interface or service layer are the application, and the database is a means to reliably persist the state of the business objects between sessions.
Personally, I feel like the first, datacentric viewpoint is well understood in the .NET community, so I'd like to focus on the second viewpoint. In the second viewpoint, you're typically working with entity objects in the middle tier. One way or another, you're probably doing Object/Relational Mapping (O/RM) to map data from the business entities to the database tables and vice versa. You might be doing it by hand, but more likely with tools of some sort. Those tools are great and can potentially save a lot of development time, but there are some issues you should be aware of, and it's always nice to understand how a tool works beneath the covers. Hopefully, studying the patterns in this article will help on both accounts.
I'm a big proponent of Agile practices like Test Driven Development, Behavior Driven Development, Lean Programming, and Continuous Design. I say this just to be clear that I have a very specific bias in regards to the patterns I'll be discussing in this article, and it will probably show.

Mapping Objects to Databases
I've decided to model my system in the middle tier with entity objects that have identity, data, and related behavior. My business logic will be implemented either in these entity objects or in domain services that use these entity objects. Great, but how do you seamlessly move data back and forth between the database and the entity objects?
In the case of a relational database, you need to move the fields and properties of our objects to tables and fields in the database. You can write this code completely by hand, writing out separate INSERT, UPDATE, SELECT, and DELETE SQL statements, but you'll quickly realize that you're repeating yourself in the code quite a bit. Namely, you're repeatedly specifying that the data in an object property or field should be stored into a particular column in a database table.
This is where an Object/Relational Mapper (O/RM) steps in. When you use an O/RM, you simply create a mapping of the object properties to the database table and let the O/RM tool use that metadata to figure out what the SQL statements should be and how to move data from the object to the SQL statements.
Let's get concrete and look at a very basic sample. My current project has an Address class that looks like this:
public class Address 
  public long Id { get; set; }
  public string Address1 { get; set; }
  public string Address2 { get; set; }
  public string City { get; set; }
  public string StateOrProvince { get; set; }
  public string Country { get; set; }
  public string PostalCode { get; set; }
  public string TimeZone { get; set; }
When I set up the mapping for the Address class, I need to specify which table the Address class maps to, how the Address objects will be identified (the primary key), and which properties map to which database tables.
For this example, I'm using the Fluent NHibernate tool for mapping Address.
I'm purposely doing the mapping in a longhand manner to show all the details. (In real usage, I employ convention over configuration to eliminate much of the repetitiveness.) Here's the code:
public class AddressMap : ClassMap<Address> 
  public AddressMap() 
    UseIdentityForKey(x => x.Id, "id");

    Map(x => x.Address1).TheColumnNameIs("address1");
    Map(x => x.Address2).TheColumnNameIs("address2");
    Map(x => x.City).TheColumnNameIs("city");
    Map(x => x.StateOrProvince).TheColumnNameIs("province_code");
    Map(x => x.Country).TheColumnNameIs("country_name");
    Map(x => x.PostalCode).TheColumnNameIs("postal_code");
Now that you've made a mapping of the object model to the database model, something has to actually execute the mapping, and you roughly have two choices: Active Record or Data Mapper.

Active Record
When choosing a persistence strategy, the first decision you need to make is where to place the responsibility for carrying out the mapping. You have two very different options: you can either make each entity class itself responsible for the mapping, or you can use a completely separate class to do the mapping to the database.
The first option is known as the Active Record pattern: an object that wraps a row in a database table or view, encapsulates the database access, and adds domain logic on that data. An Active Record approach puts persistence methods directly onto the entity object. In this case, the Address class would probably have methods like Save, Update, and Delete, as well as a static Load method that queries the database for an Address object.
The typical usage of the Active Record pattern is to essentially make a strongly typed class wrapper around a single row in a database table. As such, the Active Record classes are generally an exact mirror of the database structure (this will vary between tools). Many Active Record implementations will generate the entity objects directly from the database structure.
The most famous Active Record implementation is the ActiveRecord tool that comes as part of the Ruby on Rails Web development framework (which is usable in .NET programming with IronRuby; see "Getting Started With IronRuby And RSpec, Part 1"). If I were using ActiveRecord, I would first create a table named addresses in the database. Then, if all I care about is having access to the address data and methods for finding, saving, and inserting address data, the entire Address class would look exactly like this Ruby class:
class Address < ActiveRecord::Base
The Ruby implementation is using metaprogramming to dynamically create fields and query methods that match the database table named addresses (pluralized form of the class name Address) at runtime.
With a few exceptions, .NET implementations of the Active Record pattern generally work by using code generation to create .NET classes that map directly to database tables. The benefit of this approach is that it is very easy to keep the database and object model synchronized.
Lean Programming
Lean Programming teaches you to eliminate wasted effort in development projects by favoring "pull" design over "push" design. This means infrastructure concerns like persistence should only be designed and built to satisfy the needs of business requirements (pulled on demand) instead of building the data access layer code that you think the application will need later (pushed).
In my experience, this means you should develop the system incrementally by developing vertically—by building one feature at a time instead of building the system one horizontal layer at a time. By working this way you can be sure that infrastructure development is no more than what you absolutely need by tying all infrastructure code to a feature being built into the system. When you work horizontally by building the data access layer or the database before you write the user interface, service, or business logic layer, you risk the following:
  • Not getting adequate feedback early from project stakeholders because there is no working functionality to demonstrate
  • Writing unnecessary infrastructure code for features that might not actually get built
  • Building the wrong data access code because you don't understand the business requirement early on, or the requirements change before the other layers are created
Pull design also means that your choice of data access strategy should be determined by the needs of the application. Given a choice, I would choose an O/RM for a system where I was modeling the business logic in a domain model. For a reporting application, I would bypass persistence tools altogether in favor of just using SQL and datasets. For a system that is mostly data entry, I might choose an Active Record tool. The point is that the data access and persistence is shaped by the needs of its consumers.

Data Mapper
Nice as the Active Record pattern may be, it is often valuable to have an object structure that varies from the database model. You may want to map multiple classes to the same database table. You might have a legacy database schema that doesn't fit the shape of the objects that you would want to express in some business logic (a common occurrence for me on my last several projects).
This is where I would choose the Data Mapper pattern: a layer of mapper objects that move data between objects and a database while keeping them independent of each other and the mapper classes themselves. The Data Mapper pattern strips most of the responsibility of persistence from the entity objects in favor of classes external to the entities. With a Data Mapper pattern, you access, query, and save entity objects with some kind of repository (discussed later in this article).
How do you choose between the two? My personal bias is very strongly towards the Data Mapper solution. That aside, Active Record is best for systems with simpler domain logic, CRUD-intensive applications (create, read, update and delete, that is), and situations where the domain model doesn't need to diverge much from the database structure. Active Record may be more comfortable for many .NET development teams because it implies a datacentric way of working that is more common for .NET teams and, frankly, much better supported by .NET itself. I would characterize most persistence tools in the .NET space as Active Record tools.
On the other hand, Data Mapper is more appropriate for systems with complex domain logic where the shape of the domain model will diverge considerably from the database model. Data Mapper also decouples your domain model classes from the persistence store. That might be important for cases where you need to reuse the domain model with different database engines, schemas, or even different storage mechanisms altogether.
Most important to me as an Agile practitioner, the Data Mapper approach allows me to design the object model independently of the database with Test Driven Development more efficiently than an Active Record solution could. This is important to teams that want to take an incremental approach to design because it allows a team to work through the design in the object model first where refactoring tools and unit testing techniques are generally more effective, then create the database model later when the object model is stable. The Data Mapper pattern is also more applicable to the Domain Driven Design architectural approach that is gaining in popularity in the .NET community.

Using a Repository
When I was growing up as a developer in the Windows DNA days, I lived in mortal fear of forgetting to close an ADO Connection object in my code. On my first big enterprise project, I coded directly against ADO. Every time I needed to request or save data to the database, I had to do the following:
  1. Find the connection string from some sort of configuration
  2. Open a new connection to the database
  3. Create a command object or a recordset object
  4. Execute the SQL statement or stored procedure
  5. Close the connection to release the database resources
  6. And oh yeah, put some adequate error handling around the data access code
The first problem was the absurd amount of repetitive code I was writing. The second problem was that I had to write the code correctly every single time, because forgetting to close a single connection could, and unfortunately did, drive a mission-critical system offline under a heavy load. Nobody wanted to be the guy whose code collapsed out of poor database connection hygiene, but it still happened too frequently.
In my next project, my coworker and I got smarter. We implemented a single class that would do all the setup, teardown, and error handling code of the ADO Connection object. Any other class in the system could interact with the database by using this central class to invoke stored procedures. We wrote much less code and, better yet, we weren't plagued by problems arising from forgetting to close database connections because we had to get that connection management code right only one time.
What my team did was to create a crude implementation of the Repository pattern that mediates between the domain and data mapping layers using a collection-like interface for accessing domain objects.
Basically, the Repository pattern just means putting a façade over your persistence system so that you can shield the rest of your application code from having to know how persistence works. My current project is using a Repository with a public interface that looks like the one in Figure 1.
public interface IRepository 
  // Find an entity by its primary key
  // We assume and enforce that every Entity
  // is identified by an "Id" property of 
  // type long
  T Find<T>(long id) where T : Entity;

  // Query for a specific type of Entity
  // with Linq expressions.  More on this later
  IQueryable<T> Query<T>();
  IQueryable<T> Query<T>(Expression<Func<T, bool>> where);

  // Basic operations on an Entity
  void Delete(object target);
  void Save(object target);
  void Insert(object target);

  T[] GetAll<T>();
When some class in the system needs to access an entity object, it can simply use IRepository to fetch that entity by its ID, or query for a list of entity objects with a LINQ expression.
When I use the concrete class of IRepository, which uses the NHibernate library for O/RM, I'm pulling a connection string from memory, loading the mapping definitions from an assembly, building the NHibernate SessionFactory (once and only once because it's a big performance hit), and wrapping the low-level ISession interface from NHibernate. With some help from the Repository class, NHibernate manages database connection lifecycle issues.
Whew. That's a lot of stuff going on behind the scenes. It's a good thing that I've swept all of the direct interaction with NHibernate behind the IRepository interface. I don't have to know all that bootstrapping NHibernate stuff just to load, save, and query objects. Even better, since every class depends on the abstract IRepository interface for data access and persistence, I can slide in an InMemoryRepository implementation of IRepository that uses LINQ to Objects internally to stub the database during testing.

Identity Map
Let's look at a common scenario. Inside a single logical transaction in some sort of shipping system, you have two completely different classes that work independently, but both classes will need to retrieve the same Customer entity during the transaction. Ideally, you want only a single Customer object inside a single transaction for each logical Customer so that each object is working off of consistent data.
In persistence tooling, preventing duplicate logical references is the job of the Identity Map pattern. As stated by Martin Fowler, an Identity Map ensures that each object gets loaded only once by keeping every loaded object in a map and looks up objects using the map when referring to them.
For the Customer class, you might build a naïve implementation of Identity Map like the one in Figure 2. I'm purposely leaving out thread locking here just to simplify the code. A real implementation would require adequate thread safety measures.
public class CustomerRepository 
  public IDictionary<long, Customer> _customers = 
    new Dictionary<long, Customer>();

  public Customer FindCustomer(long id) 
    if (_customers.ContainsKey(id)) 
      return _customers[id];

    var customer = findFromDatabase(id);
    _customers.Add(id, customer);

    return customer;

  private Customer findFromDatabase(long id) 
    throw new System.NotImplementedException();
In this example, a Customer object is identified by its ID. When you request an instance of Customer by ID, the CustomerRepository first checks an internal dictionary to see whether it has that particular Customer. If so, it returns the existing Customer object. Otherwise, CustomerRepository will fetch the data from the database, build a new Customer, store that Customer object in its dictionary for later requests, and return the new Customer object.
Fortunately, you generally won't write this code by hand because any mature persistence tool should include this feature. You do need to be aware that this is happening behind the scenes and scope your persistence support objects accordingly. Many teams will use the lifecycle management feature of an Inversion of Control tool (StructureMap, Windsor, Ninject, and others) to ensure that all classes in a single HTTP request or thread are using the same underlying Identity Map. The Unit of Work pattern is another way to manage a single Identity Map across multiple classes in the same logical transaction.
Just to illustrate this pattern farther, Figure 3 shows an example of how an Identity Map works. The code is written against the architecture of my current project. The instances of the IRepository interface shown in the code below wrap a single NHibernate ISession, which in turn implements the Identity Map pattern. When I run this test the output is this:
1 passed, 0 failed, 0 skipped, took 5.86 seconds.
public void try_out_the_identity_map() 
  // All I'm doing here is getting a fully formed "Repository"
  // from an IoC container and letting an IoC tool bootstrap 
  // NHibernate offstage.
  IRepository repository = ObjectFactory.GetInstance<IRepository>();

  // Find the Address object where Id == 1
  var address1 = repository.Find<Address>(1);

  // Find the Address object where Id == 1 from the same Repository
  var address2 = repository.Find<Address>(1);

  // Requesting the same identified Address object (Id == 1) inside the 
  // same Repository / Identity Map should return the exact same
  // object

  // Now, let's create a completely new Repository that has a 
  // totally different Identity Map
  IRepository secondRepository = ObjectFactory.GetInstance<IRepository>();

  // Nothing up my sleeve...

  var addressFromSecondRepository = secondRepository.Find<Address>(1);

  // and this is a completely different Address object, even though
  // it's loaded from the same database with the same Id

Lazy And Eager Loading
One of the best things about using a persistence tool is the ability to load a root object (Invoice, perhaps), then navigate directly to its children (InvoiceLineItem) and related objects just by using properties of the parent class. However, sooner or later you're going to have to care about the performance of your application. Fetching an entire object graph when you may need only the top-level object most of the time or can forgo parts of the object graph isn't efficient.
That's OK. In that case, you can use the Lazy Loading pattern in which you defer initializing the object until just before it's needed.
Let's put this in more concrete terms. Say that you have a class named Customer that references an Address object:
public class Customer : DomainEntity 
  // The "virtual" modifier is important.  Without it,
  // Lazy Loading can't work
  public virtual Address HomeAddress { get; set; }
In most use cases involving the Customer object, the code never needs the Customer.HomeAddress property. In that case, you could set up the database mapping to make the Customer.HomeAddress property lazy loaded like in this Fluent NHibernate mapping:
public class CustomerMap : DomainMap<Customer> 
  public CustomerMap() 
    // "References" sets up a Many to One
    // relationship from Customer to Address
    References(x => x.HomeAddress)
      .LazyLoad() // This marks the property as "Lazy Loaded"
With Lazy Loading turned on, the Customer object is fetched without the Address data. However, as soon as any caller tries to access the Customer.HomeAddress property for the first time, that data will be transparently loaded.
Do note the virtual modifier on the Customer.HomeAddress property. Not every persistence tool does this, but NHibernate implements lazy loaded properties by creating a dynamic subclass of Customer that overrides the HomeAddress property to make it lazy loaded. The HomeAddress property needs to be marked as virtual in order to allow a subclass to override the property.
Of course, there are other times when you request an object and you know that you will most likely need its children at the same time. In this case, you will probably opt for Eager Loading and have the children data loaded at the same time as the parent. Many persistence tools will have some sort of ability to optimize Eager Loading scenarios to fetch a hierarchy of data in a single database round-trip. If you need the Customer.HomeAddress data most of the time you use a Customer object, then you would be better off doing Eager Loading to get the Customer and Address data at the same time.
At this point, I should repeat the old maxim that the only way to reliably tune an application for performance is to use an empirical measurement of performance with a profiler.

Virtual Proxy Pattern
Lazy Loading is often implemented by using a virtual proxy object that looks just like the real object to be loaded later. Let's say that the domain model includes a class named CustomerRepresentative that references a list of Customer objects. Part of that class is shown in Figure 4.
public class CustomerRepresentative 
  // I can create a CustomerRepresentative directly
  // and use it with a normal List
  public CustomerRepresentative() 
    Customers = new List<Customer>();

  // Or I can pass an IList into it.
  public CustomerRepresentative(IList<Customer> customers) 
    Customers = customers;

  // It's not best practice to expose a collection
  // like I'm doing here, but it makes the sample
  // code simpler ;-)
  public IList<Customer> Customers { get; set; }
There are many times when the system uses an instance of CustomerRepresentative without needing the list of Customers. In that case, you could simply construct a CustomerRepresentative with a virtual proxy object that looks like an IList<Customer> object and use that virtual proxy class without making any changes to CustomerRepresentative whatsoever. That virtual proxy class might look something like Figure 5. A CustomerRepresentative object could then be created with Lazy Loading as shown in Figure 6.
public class VirtualProxyList<T> : IList<T> 
  private readonly Func<IList<T>> _fetcher;
  private IList<T> _innerList;
  private readonly object _locker = new object();

  // One way or another, VirtualProxyList needs to 
  // find the real list.  Let's just cheat and say
  // that something else will pass it a closure
  // that can find the real List
  public VirtualProxyList(Func<IList<T>> fetcher) 
    _fetcher = fetcher;

  // The first call to 
  private IList<T> inner 
      if (_innerList == null) 
        lock (_locker) 
          if (_innerList == null) 
            _innerList = _fetcher();

      return _innerList;

  IEnumerator IEnumerable.GetEnumerator() 
    return inner.GetEnumerator();

  public IEnumerator<T> GetEnumerator() 
    return inner.GetEnumerator();

  public void Add(T item) 

  // and the rest of the IList<T> implementation

public class CustomerRepresentativeRepository 
  private readonly ICustomerRepository _customers;

  public CustomerRepresentativeRepository(
      ICustomerRepository customers) 
    _customers = customers;

  // This method will "find" a CustomerRepresentative, and
  // set up the Virtual Proxy for the Customers
  public CustomerRepresentative Find(long id) 
    var representative = findRepresentative(id);
    representative.Customers = 
      new VirtualProxyList<Customer>(() => 

    return representative;
Like most of the patterns in this article, the virtual proxy isn't something that you're likely to write by hand, but be aware that it's there in the background of your persistence tool.

Taking the Next Step
When I began programming with Windows DNA technologies, I probably spent over half of my time working with raw ADO code. Today, persistence coding and infrastructure is a very small percentage of my team's time. So what changed over the years? We use persistence tools and these design patterns to eliminate so much of the repetitive coding we used to do.
Most of these patterns were taken from Martin Fowler's book, Patterns of Enterprise Application Architecture. I highly recommend reading this book if you have anything to do with writing enterprise applications. Due to length limitations, I was unable to cover some other important patterns like Unit of Work, Specifications, and Persistence Ignorance. In addition, there are quite a number of ways to use the Repository pattern and design considerations (a single generic repository versus specific, "narrow" repository classes, whether a repository should even expose "Save" methods, etc.) I would urge you to research these topics as well, and I may write a follow-up article to continue this discussion.
Lastly, I'd like to say that the .NET ecosystem is richer than just Entity Framework and LINQ to SQL. I'm happy with NHibernate as a Data Mapper that knows relatively little about persistence. SubSonic is a popular Active Record implementation for .NET programming. iBatis.Net is fantastic for existing databases or occasions when you want full control over authoring the SQL statements. LLBLGen Pro is a very mature tool with unique querying abilities. Many of the other tools also have the ability to use LINQ queries.

Send your questions and comments to

Jeremy Miller, a Microsoft MVP for C#, is also the author of the open-source StructureMap ( tool for Dependency Injection with .NET and the forthcoming StoryTeller ( tool for supercharged FIT testing in .NET. Visit his blog, "The Shade Tree Developer," part of the CodeBetter site.

No comments:

Post a Comment