Friday, February 22, 2008

Abstract vs. Opaque data types

So, the problem boils down to the ancient abstract vs. opaque data type debate, which was ultimately responsible for the emacs/xemacs split, and which is one major reason why the Lisp/Python/Ruby folks don't understand the Java/C# folks. I'm jotting ideas to clarify the issues in my own mind, and on the off chance that folks may find this useful if I de-lock it.

First, definitions. An abstract data type is a plain old data structure - a dict or list or integer - along with functions to operate on it as a domain object. You can still reach in and twiddle its innards as a normal hash or list though, and as far as the programming language is concerned, it's just a built-in data type. An opaque data type is a special class with methods to manipulate it, and the language either prevents you from accessing the guts (Java or C++) or strongly discourages it (Python _members).

Some languages - like Arc, JavaScript or pure Scheme - provide only abstract data types, and you need to fake opaque types with conventions. Other languages - like Java - provide only opaque types, and the basic concrete data types are special cases of that. The languages I tend to use - like Python - provide both, and it's up to you which is appropriate.

When prototyping, I've almost always found abstract types to be better (or sometimes even concrete types). This is because they don't need to be declared: the interface to an abstract data type is whatever functions you provide to manipulate it, and if you're missing something, you can just use it as a dict or literal. This lets you change things around very quickly, which is incredibly important when you don't really know what you're doing. They also tend to be less code.

In production code, I've found opaque types to be generally better, because they provide additional contractual guarantees that are really important when you're building stuff on top of these classes. You don't want the interface to change with every revision, because it'll break everything you've built on top. Moreover, because the interface is stable (and presumably tested), you can treat the type as a solid black box, which reduces the complexity you need to keep inside your head at once.

Unfortunately, GameClay is currently in that awkward stage where I don't yet know what I'm doing (in the sense of having detailed interface specifications for each object), and yet I need stronger specification guarantees on the base types in order to move forwards. I've sorta got a hybrid architecture now, where classes wrap raw data structures and provide accessors. However, the problem there is that I've gotta remember whether I'm dealing with raw unwrapped structures or wrapped structures with utility methods.

The cleanest solution in terms of remembering stuff is to bite the bullet, convert the raw JSON structure to language objects when read in, and have all accessors return language objects. Then I'd need a method to convert it back to JSON data structures. We can assume that anywhere within the system, once the objects have been constructed, it's all objects, and they all have the appropriate utility methods.

One pitfall I ran into when I thought of doing this yesterday was that the conversion is somewhat lossy. For example, expressions are represented by strings in the props structure, but get parsed into an internal data structure. Printing them back out loses all whitespace and parenthesization. I suppose I could store the initial prop for expressions, and just omit accessors to change parts of the expression (all our data structures are immutable anyway). If it's changed by code, it'll have to be changed all at once. I don't think I have any other places where the representation is lossy.

Another problem is that this is pretty significant code-bloat. This is unfortunate, but I'm not sure it's avoidable. Currently, we're using a decorator to reach inside the props structure and return the appropriate part, but this isn't really correct: it doesn't wrap the props structure with the appropriate class. If we were to make it correct, we'd need conversion logic, and conversion logic is probably simpler when all in one place.

A third downside is that we need certain validation state to properly validate objects, and we need to validate before we can safely convert. The easiest way to do this is probably to pass the state in to the constructor along with the props data structure; the constructor will throw an exception if invalid. This also fixes my uneasiness about having some constructs throw in the constructor (eg. parsing expressions, actions) while others don't throw until validate is called. A downside is that validate can't be called standalone, but this shouldn't be necessary: if the props are invalid, you shouldn't be able to create the object, and if you try to call a setter with an invalid value, it should fail in the setter (since objects are immutable and create new ones, this can just re-use the validation from the constructors; however, we need to save a copy of the validation state so that we can invoke the constructor in mutators).

1 comment:

Anonymous said...

The very heart of your writing shilst sounding agreeable at first, did not settle very well with me personally after some time. Someplace within the paragraphs you were able to make me a believer but just for a while. I nevertheless have a problem with your leaps in assumptions and one would do well to fill in those breaks. In the event you actually can accomplish that, I will definitely end up being fascinated.
Its such as you learn my thoughts! You seem to grasp so much approximately this, such as you wrote the ebook in it or something. I think that you simply can do with some% to force the massage house a bit, however other than that this is magnificent blog. An excellent read. I will certainly be back.

changeparts