In-Depth
REPLACE WITH NEW ID# 1787
Use the declarative programming model to enable type-safe data access in ADO.NET.
Technology Toolbox: VB.NET, SQL Server 2005
ADO.NET is extraordinarily useful, but it's not without its flaws. One of its biggest annoyances is that it includes far too much repetitive code. Fortunately, you can eliminate a lot of the repetitive code that goes along with ADO.NET by applying the declarative model to the data-access tier.
One of the reasons for this repetitive code is that ADO.NET gives you only weakly typed access to your data. Whether you read directly from IDataReader implementations or populate DataSet objects, you're probably doing a lot of casting in your data-access tier. It's also long-winded. If your data-access strategy favors the use of stored procedures, you know how much highly repetitive code it takes to call one. LINQ promises to address most of the problems with "old-style" data access, but it's not quite here yet. In the meantime, you have to mitigate ADO.NET's shortcomings using features built into the .NET Framework and the coding language of your choice.
One of the neatest features of modern type systems is reflection -- that is, the capability of code to inspect itself. Although the .NET Framework didn't invent reflection, it has taken it to an innovative new level with the addition of extensible metadata. This is the declarative programming model in which a type publishes metadata by declaring attributes on itself and its members. Another component then consumes these attributes and performs some appropriate action. The power of the declarative model is that once you complete and test the attribute consumer, you can invoke its behavior with no additional executable code. No executable code means fewer bugs and reduced testing expense.
When you use the declarative model, you represent stored procedure calls as instances of .NET classes, and those classes' member fields represent parameters, return values, and resultsets. Consider this stored procedure, which you might find in the Northwind database that ships with Microsoft SQL Server:
CREATE PROCEDURE STDA.CustomersDelete
@CustomerID NCHAR(5)
AS
DELETE FROM Customer WHERE CustomerID = @CustomerID
A class representing this stored procedure might look like this:
<StoredProcedure("STDA.CustomersDelete")> _
Public Class CustomersDelete
<Parameter()> _
Public CustomerID As String
End Class
To call the CustomersDelete stored procedure, you create an instance of the CustomersDelete class, set its CustomerID field, and pass it to a component I call a "data service," which uses Stored-
ProcedureAttribute and ParameterAttribute to construct and execute a concrete database command against a SQL Server database.
Improve Reflection Performance
The first challenge you face when writing the data service is reflection. Reflection is notoriously slow, especially where custom attributes are concerned. According to my crude testing, finding a DisplayNameAttribute declared on a public property (such as calling Type.GetCustomAttributes()) is 19 times slower than querying that property's data type (PropertyInfo.PropertyType), which itself is a whopping 70 times slower than assigning it a value. If you cache those attributes in a Dictionary instead, you reduce the attribute retrieval to a hashtable lookup, improving its speed by a factor of (again, by my crude testing) 50. This tremendous speed improvement comes at the cost of increased memory usage, so it's a good idea to estimate the amount of RAM it will require. If it runs on a server and caches information on 1,000 stored procedures, half of which are fetches (which require more memory), and each cached type consumes 5 KB of memory, the figure is a little more than 7 MB, which is quite a reasonable cost for a 50-fold performance increase.
You can cache the attributes in instances of the ReflectionCache class, with one instance holding the attributes for one type. You can force all creations of ReflectionCache objects through a static CreateInstance method, in a manner similar to the Singleton pattern. The logic is simple: If the cache already contains an item that represents the specified type, return it. If not, create a new one, add it to the cache, and return it (see Listing 1). The only peculiarity is that the method checks for an existing instance using Dictionary's ContainsKey method, and it also catches an Argument-
Exception in case the attempt to add it fails, presumably because there is already an item with the specified key name. This is not overly cautious. Although it seems unlikely, it's possible in multithreaded scenarios to call ContainsKey, have it return False, attempt to add the item, and find it already there. You could also handle this situation using a SyncLock block:
SyncLock _Instances
If Not _Instances.ContainsKey(ForType) Then
Instances.Add(ForType, New ReflectionCache(ForType))
End If
Return _Instances(ForType)
End SyncLock
Only one thread can execute the code inside the SyncLock block at once. This would prevent duplicate Add calls, but it would also impact performance in multithreaded applications heavily, such as those running on a Web server. While one thread is executing the ReflectionCache constructor -- slow code -- all other calls must wait in line. That's a hefty price to pay to prevent a rare occurrence. It's better to simply let the error happen and ignore it.
Getting and setting these properties' values is a different challenge. You can always count on data coming out of a database to throw you curveballs, so flexibility is the key here. In my first attempt, the data service called FieldInfo's GetValue and SetValue methods directly. I quickly realized that wouldn't do. There were just too many possible curveballs: What do you do with nulls? What about XML columns? Do you read them as strings? XmlDocument objects? XmlReader objects? I found the answer was moving the get and set code from the data service into the call and row objects. A base class -- DbObject -- can provide default implementations of the get and set procedures, which you can then override in a derived class when a back-door solution is required. DbObject handles the rule, and its inheritors handle the exceptions.
A data service can potentially represent any underlying data store -- a relational database, a Web service, or a message queue, just to name a few -- so abstract its members into an interface:
Public Interface IDataService
Inherits IDisposable
Sub Abort()
Sub BeginTransaction()
Sub Commit()
Property ConnectionName() As String
Sub Execute(ByVal Command As CallObject)
End Interface
Declare Singleton Fetches
One of the most common data-access scenarios is the singleton fetch, in which you need to retrieve only a single object. You typically know its primary key, most often as a result of a previous fetch. The preferred technique in this case is to return the row values in output parameters rather than incurring the overhead of a row set for only a single row. Consider the Customers table in Microsoft's Northwind database. A singleton fetch stored procedure for this table returns exactly one row (see Listing 2). You can call the stored procedure by submitting CallObject (see Listing 3).
Note that all the string-typed output parameter members specify a size. Input parameters and output parameters of fixed-length data types (such as INT) don't have to provide the Size property, but variable-length output parameters do. When passing a string in an input parameter, the provider can infer the length from the value passed in. As long as the value you pass fits, it will be OK. However, output parameters don't necessarily pass a value in, so the provider has no way to guess the parameter's size.
The CallObject to update the fetched Customer object turns out to be similar. The only difference is the direction of the parameters. All the parameters that are declared as outputs in the fetch are inputs in the update (see Listing 4). In this case, the primary key is an in parameter, because the Customers table has a character-type key. If it were an integer type, you would make it in/out instead.
To reduce the number of stored procedure objects in your databases -- which can grow to a large number quickly -- combine the object's INSERT and UPDATE statements into a single logical update. Most IDENTITY columns are defined to start at a value of 1, so reserve -1 as a special value to indicate "This object doesn't exist yet." When it gets a -1 for the key, the procedure does an INSERT and returns @@IDENTITY. Next, pass that value, which is a positive integer; the procedure performs an UPDATE. You can't do that for the case of Customers, so you have to do an IF EXISTS test to determine which statement you need to run (see Listing 5).
In most applications, of course, the primary key for a singleton fetch won't come from your long-term memory. You'll choose it from the results of a query or search. In this situation, the number of rows being fetched is small, so reading and displaying them all is a viable approach. Continuing with the Northwind customer editor example, this stored procedure fetches an overview from the Customers table:
CREATE PROCEDURE STDA.CustomersOverviewFetch
AS
SELECT
CustomerID, CompanyName,
City, Region,
Country
FROM Customers
CallObject and RowObject receive its resultset. When the data service executes CustomersOverviewFetch, it returns the resultset in the CallObject's Rows field, which the CustomerList form reads (Listing 6 and Listing 7).
Use a Paging Scheme
This is an elegant technique for small resultsets, but clearly it won't work for a billion rows. Consider using a paging scheme to process every row or almost every row in a large table. Assume the Northwind Orders table is really big. Rather than selecting the entire table with SELECT * FROM Orders, you can use SQL Server's TOP specifier to return a manageable number of rows, starting at a particular page:
SELECT TOP 100 * FROM Orders WHERE OrderID >
@OrderID ORDER BY OrderID
Each time you finish processing a batch of rows, you pass the largest OrderID in the set back into this query. When it yields no rows, you've reached the end of the table.
Be careful with this technique, because it's sensitive to optimization. Given the assumed size of the table, it's imperative that the column in the WHERE and ORDER BY clauses be indexed. A clustered index is strongly favored here, as it allows the storage engine to seek directly to the specified starting value and read only until the requested number of rows is returned. The batch size can also impact performance. A small value requires more trips to the server; a large one taxes memory. As always, it's best to return only the columns you need.
If you page by a column with a clustered index and use the right batch size, the performance of this technique can approach that of a plain-old data reader. It will never match the data reader's performance, though. If your code is truly performance-critical, this solution isn't for you.
Besides strongly typing command arguments and resultset columns, representing database commands and resultsets as .NET classes has another advantage: You can serialize them. Serializing call and row objects allows you to store them, making intermittently connected applications a breeze. You can drop them into a message queue for asynchronous processing, and logging data-tier errors is a simple matter of writing the commands to a table. Declarative programming is a perfect fit for data-access code.
About the Author
John Cronan is Lead Scientist at Devia Software, LLC.