Safe Serialization Under Mutual Suspicion/Introducing Data E

=Lessons of Deconstruction Serialization= The body of each chapter is presented using running code examples, and with detailed enough explanation that you should be able to follow this code. For those that wish only to understand the general lessons and how they may be applied to more conventional serialization systems, see the "Lessons of..." and "Corresponding Concepts in Convention Serialization" sections at the beginning and end of each chapter

The "Deconstruction Serialization" chapter defines our serialization format, Data-E, by subsetting a programming language, E . This move is inessential, but it allows us to more easily understand what's happening.

Many existing systems (Lisps, Smalltalk) often print an object as an expression that, if evaluated, would reconstruct that object. We can understand the data by reusing a subset of our code reading skills. When these printing systems do so reliably, they are often grown into serialization systems, in which the depiction actually is the program it appears to be. We can then understand the semantics of the serialized form by reusing a subset of our understanding of the semantics of the programming language.

(Serialization in Mozart creates the equivalent of a compiled module, providing the semantic correspondence but not the visibility. Perhaps these can be decompiled.)

In these terms, Java has three very different depiction systems.
 * The printed form of a Java object is simply its response to toString. While these are often in the form of expressions that would reconstruct the object, often they're not, and there's no general convention.
 * Java Object Serialization Streams, or JOSS [ref JOSS], defines a complex opaque binary format which few know how to parse, and which is rarely made visible. JOSS is implemented using special powers (native methods for violating encapsulation) not available to other objects, making it too dangerous to use under mutual suspicion. However, JOSS also has an extensive and well thought out set of hooks for customizing serialization and unserialization. The logic of these hooks strikes the balance we need between flexibility and security. The corresponding hooks in Data-E are based on lessons learned from JOSS.
 * In apparent reaction to the maintenance problems created by opacity and the private access, Java now has an additional serialization framework, the XMLEncoder [ref XMLEncoder]. The XMLEncoder writes depictions which are semantically identical to Java programs. This format combines the semantics of Java with the readability of XML. Like Data-E, this system serializes only by interacting with the public protocol of the original objects, and unserializes by evaluating the depiction-as-program which again only uses public protocols to perform the reconstruction. Unfortunately, the XMLEncoder is designed around the Java Beans conventions, which conflates access with initialization, rendering its design useless for our present purposes.

To produce a depiction of an object graph, a serializer must somehow obtain a representation of each object adequate for calculating an overall depiction. In the printing frameworks, like Java's toString, the representation offered is a depiction -- it is already only bits, and each object is responsible for traversing the portion of the subgraph rooted in itself. This is maximally flexible -- it allows an object to claim anything it likes. Used for serialization, this flexibility has fatal security problems, as explained in the next chapter.

 In our approach, the serializer must obtain for each object a portrayal -- a representation of that object in terms of live references to other objects. The serializer's traversal proceeds as it obtains a portrayal for each of these "components", etc. A serializer can simply ask an object to provide a portrayal of itself -- the object's self portrait -- or, as we will see, it can derive or substitute its own portrayal of the object as an expression of some other policy objective. However it obtains these portrayals, the resulting depiction is a literal picture only of the graph of these portrayals, rather than the actual graph of objects being serialized.

=Concrete Embodiment in E =

Although the ideas in this paper should be applicable to any object-capability language and many serialization formats, as previously mentioned, for concreteness, we present the implementation of these ideas in E  as applied to Data-E. When an example is shown as an E  command line session, like

then the example also doubles as an executable regression test. By updocing the page containing the examples, you can see whether the system behaves as shown by the example. If you installed E  at, for example, "c:/Program Files/erights.org" and placed "c:/Program Files/erights.org/scripts" on your PATH</tt>, then in a directory containing the chapters of this paper, at a shell prompt you can type:

(Though please see "Security Considerations" before running this or any other Updoc script.)Each of the dots is a test case that successfully passed, like the above "2 + 3</tt>". As you read, if you are curious about how a variation of an example would behave, make a copy of the page, edit appropriately, and updoc it. Or try the examples interactively at a rune</tt> command-line:

The rune</tt> program starts an <font color="#009000">E  read-eval-print loop. The "?" is the <font color="#009000">E  prompt. To exit rune</tt>, type the end-of-file character: Control-D or Control-Z depending on your shell.

In order to have a common point of reference, this paper assumes a basic prior knowledge of Java. JOSS (Java's Object Serialization Streams) [ref JOSS] is occasionally used for comparison, so a prior knowledge of it will help, but is not required.

We assume only that prior knowledge of <font color="#009000">E  explained in the Ode ("Capability-based Financial Instruments") [ref Ode] and that explained in the next section on "<font color="#009000">E 's URI Expressions". <font color="#009000">E  syntactically resembles other C tradition object languages, such as Java, C++, C#, and Python. When the meaning of <font color="#009000">E  code isn't covered by the Ode or the next section, and isn't obvious by analogy with Java, we will explain as we go. For more on <font color="#009000">E , please see &#91;ref erights.org, Walnut&#93;.

<a name="uri-exprsns"></a>

<font color="#009000">E 's URI Expressions
An unfamiliar bit of <font color="#009000">E  syntax, needed to understand the examples in this paper, is the URI expression, written as a URI string between angle brackets.

? def f := &lt;file:/foo/bar&gt;
 * 1) value: &lt;file:c:/foo/bar&gt;

The protocol name to the left of the URI's colon (file</tt> here, but any identifier) is transformed (mangled) into the name of the variable (file__uriGetter</tt>) whose value is asked to retrieve the named resource. The characters between the colon and the close angle bracket (which must be legal characters for a URI body), becomes the literal resource name to be looked up (" /foo/bar "</tt>). So the above session is a shorthand for the equivalent:

? def f := file__uriGetter.get("/foo/bar") Besides accessing the kind of resources normally accessed by URLs, the URI expression is also used as <font color="#009000">E 's module import mechanism. The typical form of import in <font color="#009000">E  is def name := &lt;import:fully-qualified-name&gt;
 * 1) value: &lt;file:c:/foo/bar&gt;

where fully-qualified-name is the full name, including package prefix, of an <font color="#009000">E  module or a safe Java class. (In order to make the extensive Java libraries available in <font color="#009000">E  without sacrificing capability security, we must tame them -- determine what subset of their public interface is consistent with capability security principles. As part of this taming process, we declare certain Java classes to be "safe", and therefore generally importable.) Because fully qualified names can be long, they are often factored as follows: def &lt;packgeName&gt; := &lt;import:package-prefix.*&gt; ... def name := &lt;packageName:rest-of-path&gt; The package subtree rooted at " org.erights.e.elib "</tt> is provided as a built-in convenience, as if we had already executed

def &lt;elib&gt; := &lt;import:org.erights.e.elib.*&gt; As a result, ? def deSubgraphKit := &lt;elib:serial.deSubgraphKit&gt; is equivalent to ? def deSubgraphKit := &lt;import:org.erights.e.elib.serial.deSubgraphKit&gt; Such a package subtree root gives access only to that subtree of the original package tree. Similarly, as will be explained in Manipulating Authority at the Exits, a directory can be used as a uriGetter in order to give access only to a subtree of the file system, as retrieved by names relative to that directory.
 * 1) value: <deSubgraphKit>
 * 1) value: <deSubgraphKit>

</a>

=Previews of Data-E Serialization= In the Data-E System, we compose a serializer or unserializer from a pair of Data-E Kits, such as the deSubgraphKit</tt> already imported above. Each kit knows how to recognize and build a given kind of Data-E representation. The deSubgraphKit</tt> is special, in that the representation it traffics in is subgraphs of actual objects. All the other representations are depictions of subgraphs expressed in the Data-E "language". For example, the deSrcKit</tt> traffics in representations of Data-E as source strings, written in the Data-E subset of the <font color="#009000">E  source language. In this document, to make clear when we're looking at Data-E rather than <font color="#009000">E  source, all Data-E source strings are shown prefixed with " de: "</tt>. ? def deSrcKit := &lt;elib:serial.deSrcKit&gt;
 * 1) value: &lt;deSrcKit&gt;

To serialize is to recognize a subgraph and to build a depiction. ? "de: " + deSubgraphKit.recognize([false, 3], deSrcKit.makeBuilder) Each recognize(..)</tt> method takes two arguments -- the input representation to recognize (here, the subgraph to traverse), and a builder to call as parts of the representation are recognized, in order to build the output representation (here, a Data-E source string). Each kit provides a recognize(..)</tt> method for accepting its form of representation as input, and a makeBuilder</tt> for making a builder to build its form of representation as output. The recognize(..)</tt> method returns its argument builder's final output.
 * 1) value: "de: def t__0 := [false, def t__2 := 3]"

The <tt>deASTKit</tt> manages another depiction of Data-E programs: as Abstract Syntax Trees, as explained in Appendix A: The Data-E Manual. For present purposes, we care only about one feature of this kit, that it simplifies the expression a bit during building; for example, by removing unnecessary temporary variables. Interposing it between between the <tt>deSubgraphKit</tt> and the <tt>deSrcKit</tt>, we build our first expository serialize function, <tt>serialize_a</tt>. (All functions so named are for expository purposes only. See Appendix A for a guide to realistic usage.) ? def deASTKit := &lt;elib:serial.deASTKit&gt;
 * 1) value: &lt;deASTKit&gt;

? def serialize_a(root) :String { &gt;    def ast := deSubgraphKit.recognize(root, deASTKit.makeBuilder) &gt;    "de: " + deASTKit.recognize(ast, deSrcKit.makeBuilder) &gt; }
 * 1) value: &lt;serialize_a&gt;

? serialize_a([false, 3]) To unserialize is to recognize a depiction and to build a subgraph. The matching expository unserialize function follows. Parameters in <font color="#009000">E  are actually patterns, of which the most common is a simple variable name, which defines the variable and binds it to the specimen (the argument passed in). Here we see a pattern for stripping off the additional prefix we added above. This pattern matches a string beginning with <tt>" de: "</tt>. If the specimen is such a string, then it matches the rest of the string against the pattern following the <tt>"@"</tt> -- in this case a simple variable name which is bound to the remainder of the string. ? def unserialize_a(`de: @src`) :any { &gt;    deSrcKit.recognize(src, deSubgraphKit.makeBuilder) &gt; }
 * 1) value: "de: [false, 3]"
 * 1) value: &lt;unserialize_a&gt;

? unserialize_a("de: [false, 3]") The result of evaluating the depiction-as-expression is a reconstruction of the original subgraph. The case shown above stays clear of all the problematic cases for serialization: All the objects in the graph being serialized here (a boolean, an integer, and a list) are transparent -- they willingly divulge all their state to their clients through their public protocol. All the objects participating in the above serialization -- the serialize function and the objects being depicted -- seek only literal accuracy; and likewise during the above unserialization. We call this unproblematic starting case literal realism for transparent subgraphs.
 * 1) value: [false, 3]

A Proper Failure
Objects defined in <font color="#009000">E  are encapsulated by default. If we make an encapsulated object and try to serialize it: ? def capsule {}
 * 1) value: &lt;capsule&gt;

? serialize_a(capsule) we find our simple serialize function fails, as it must. Usually this failure will be the desired behavior. When it isn't, we have several alternatives.
 * 1) problem: Can't uneval &lt;capsule&gt;

Unconditional Transparency
? def iAmFive { &gt;    to __optUncall :__Portrayal { &gt;        [2, "add", [3]] &gt;   } &gt; }
 * 1) value: &lt;iAmFive&gt;

? serialize_a(iAmFive)
 * 1) value: "de: 2.add(3)"

? unserialize_a("de: 2.add(3)")
 * 1) value: 5

? unserialize_a("de: 2 + 3") The object <tt>iAmFive</tt> is unconditionally transparent, but is not literally, or even usefully, realistic. When we reconstruct an object according to its self-portrait, the result can be quite different from the original.
 * 1) value: 5

Named Exit Points
Another approach to dealing with problematic objects is serialization avoidance by making references to these into named exit points. The traversal of a subgraph stops whenever it encounters such references, writing out named exit points instead (the jigsaw plugs shown on Figure 2: The Three Faces). We define matching <tt>serialize</tt> and <tt>unserialize</tt> functions customized to treat references to <tt>capsule</tt> as an exit point.To create a custom <tt>serialize</tt> function, we first create a custom unscope -- a table mapping from exit references to names. This is most conveniently done by modifying the default unscope table -- the one used implicitly in the previous examples. ? def unscope_b := deSubgraphKit.getDefaultUnscope.diverge ? unscope_b[capsule] := "foo"
 * 1) value: "foo"

? def recognizer_b := deSubgraphKit.makeRecognizer(null, unscope_b)
 * 1) value: &lt;unevaler&gt;

? def serialize_b(root) :String { >    def ast := recognizer_b.recognize(root, deASTKit.makeBuilder) >    "de: " + deASTKit.recognize(ast, deSrcKit.makeBuilder) > }
 * 1) value: &lt;serialize_b&gt;

? serialize_b([capsule, 3]) As we see, in the Data-E expression produced by serialization, a named exit reference becomes a free variable reference. Data-E unserialization is expression evaluation -- that subset of <font color="#009000">E  expression evaluation applicable to the Data-E subset of <font color="#009000">E . The inverse of the unscope is therefore just the conventional notion of a scope (or environment), a mapping from variable names to values. On reconstruction, the named exit points will be reconnected to these values. ? def scope_b := deSubgraphKit.getDefaultScope.diverge ? scope_b["foo"] := def newCapsule {}
 * 1) value: "de: [foo, 3]"
 * 1) value: <newCapsule>

? def unserialize_b(`de: @src`) :any { >    # Just a way of saying eval(src, scope_b) >    deSrcKit.recognize(src, deSubgraphKit.makeBuilder(scope_b)) > }
 * 1) value: &lt;unserialize_b&gt;

? unserialize_b("de: [foo, 3]") Each set of exit names together with a mutual understanding about what they may be bound to forms a unique micro-standard data format. Serializers and unserializers must agree on such a micro-standard, and so often come in matched pairs, as above. This agreement still leaves room for separate customization on each side, as with our unserializer's choice to bind a somewhat different object to the name <tt>" foo "</tt> in the scope, perhaps to adapt to a difference in the unserializer's context.
 * 1) value: [<newCapsule>, 3]

The Gordian Surgeon
Since the serialize function, unserialize function, scope, unscope, and (as we will see) uncallers list are often manipulated together, as above, we introduce the Gordian Surgeon -- an object that knows how to manipulate and wield these five tools in a coordinated fashion to, in effect, cut a subgraph from a donor context, freeze it, thaw it, and transplant it into a recipient graph. The following session is like that above, except that the same <tt>capsule</tt> is used for serialization and unserialization -- as this is the common pattern the surgeon makes convenient. The <tt>addExit(..)</tt> below adds the value-name association to the unscope and adds the inverse name-value association to the scope.

? def makeSurgeon := <elib:serial.makeSurgeon> ? def surgeon := makeSurgeon.withSrcKit("de: ").diverge

? surgeon.serialize([capsule, 3])
 * 1) problem: Can't uneval &lt;capsule&gt;

? surgeon.addExit(capsule, "foo")

? surgeon.serialize([capsule, 3])
 * 1) value: "de: [foo, 3]"

? surgeon.unserialize("de: [foo, 3]")
 * 1) value: [&lt;capsule&gt;, 3]

Counting "Generations"
Our first realistic example uses both self-portraits (with unconditional transparency) and named exit points: ? def makeGenerationCounter(count :int) :any { &gt;    def generationCounter { &gt; &gt;        /** &gt;         * Make my successor with the next larger count &gt;         */ &gt;        to __optUncall :__Portrayal { &gt;            [makeGenerationCounter, "run", [count+1]] &gt;        } &gt; &gt;        /** similar purpose as Java's .toString */ &gt;        to __printOn(out :TextWriter) :void { &gt;            out.print(`&lt;gen $count&gt;`) &gt;        } &gt;    } &gt; }
 * 1) value: &lt;makeGenerationCounter&gt;

? def genCounter := makeGenerationCounter(0) A <tt>generationCounter</tt> is an object with a single instance variable, <tt>count</tt>. The <tt>makeGenerationCounter</tt> function acts like a constructor -- it makes new <tt>generationCounter</tt> instances. Above, we make the <tt>genCounter</tt> instance with a count of zero. Each <tt>generationCounter</tt> uses the underlined code above to "misrepresent" itself as something that would be made by calling <tt>makeGenerationCounter</tt> with a <tt>count</tt> value one greater than its own.
 * 1) value: &lt;gen 0&gt;

However, this portrayal enables us to serialize a <tt>generationCounter</tt> only if we can serialize <tt>makeGenerationCounter</tt>, which is just as encapsulated as our earlier <tt>capsule</tt>. Since it is stateless, it is plausible to "serialize" it by not serializing it -- by making it into a named exit point.

? surgeon.addExit(makeGenerationCounter, "makeGenerationCounter")

? surgeon.serialize(genCounter)
 * 1) value: "de: makeGenerationCounter(1)"

? surgeon.unserialize("de: makeGenerationCounter(1)") A reconstructed <tt>generationCounter</tt> has a count one greater than its original, thereby accumulating a count of the number of serialize / unserialize cycles it has been through since it was born. Rather than seeing the underlined "misrepresentation" as a problem to prohibit, this example shows how this representational freedom is an opportunity.
 * 1) value: &lt;gen 1&gt;

=Unserialization as Evaluation=

(This is approximately an abridged presentation of the Unserialization as Evaluation section of Appendix A: The Data-E Manual.)

As shown above, unserialization can be thought of, or even implemented as, expression evaluation [ref Rees, XMLEncoder]. A depiction is an expression in some programming language, the unserializer is the eval function, the exit references to be reconnected are free variable references, the values to reconnect them to come from the scope (i.e., environment) provided to eval, and the root of the reconstructed subgraph is the value the expression evaluates to. Serialization is the logically inverse process, in which an uneval function is applied to a root and an unscope, and writes out an expression that, were it evaluated in the corresponding scope, would reconstruct the subgraph.

Data-E is the subset of <font color="#009000">E  used for depicting a subgraph as an expression. Ignoring precedence, it consists of the following productions:

<font color="#009000">E Syntactic Shorthands generated by <tt>deSrcKit</tt>
Since we use <tt>deSrcKit</tt> to build the depictions we present for expository purposes, we need to know the shorthands it builds, which are a subset of the shorthands recognized and expanded by <font color="#009000">E . Going the other way, all <font color="#009000">E  syntactic shorthands, including those below, are recognized by <tt>deSrcKit</tt>, since it uses the <font color="#009000">E  parser to parse and expand its input.

Any valid <font color="#009000">E  expression that expands only into the above Data-E primitives is a valid Data-E expression with the same meaning. Likewise any valid Data-E expression is a valid <font color="#009000">E  expression with the same meaning.

Using several cases together:

? def root := [1, root, 1, &lt;import:java.lang.makeStringBuffer&gt;]
 * 1) value: [1, &lt;***CYCLE***&gt;, 1, &lt;makeStringBuffer&gt;]

? def depiction := surgeon.serialize(root)
 * 1) value: "de: def t__0 := [def t__2 := 1,
 * 2)                          t__0,
 * 3)                          t__2,
 * 4)                          &lt;import:java.lang.makeStringBuffer&gt;]"

? surgeon.unserialize(depiction) The depiction is shown in the middle following the "de: ". It is written in Data-E and has the following meaning:
 * 1) value: [1, &lt;***CYCLE***&gt;, 1, &lt;makeStringBuffer&gt;]
 * The value of the <font color="#009000">E  expression <tt><import:java.lang.makeStringBuffer></tt> serializes as the Data-E expression <tt>import__uriGetter.get(" java.lang.makeStringBuffer ")</tt>, as will be explained in the next chapter. The <tt>deSrcKit</tt> shows this expression using the URI shorthand, which in this case looks like the original <font color="#009000">E .
 * The <tt> def t__2 := 1</tt> is non-cyclic instance of the <tt>defexpr</tt> production, since <tt>t__2</tt> is not used on its right hand side, even though <tt>t__2</tt> is used later in the serialization.
 * The <tt> def t__0 := ...</tt> is a cyclic instance of the <tt>defexpr</tt> production, since t__0 is used on its right hand side, expressing a cyclic data structure.

For those familiar with Java, Data-E should be mostly familiar, but with a few important differences:


 * In <font color="#009000">E , a variable definition is an expression. Like assignment, the value of a definition expression is the value of the right hand side.


 * <tt>null</tt>, <tt>false</tt>, and <tt>true</tt> are not keywords in <font color="#009000">E , but rather are variable names in <font color="#009000">E 's universal scope and in Data-E's default scope and unscope. This means an expression can count on them having their normal values, so these don't need to be literals. The "<tt>false</tt>" in the first example of Previews of Data-E Serialization above was a variable reference, not a literal, just as it is in <font color="#009000">E .


 * Using only the <tt>literal</tt>, <tt>varName</tt>, and <tt>call</tt> productions, we can write Data-E expressions that will evaluate to new tree structures whose leaves are reattached exit points.


 * In <font color="#009000">E , a variable is in scope starting from its defining occurrence, left-to-right, until the close-curly that closes the scope box (lexical contour) in which it is defined, and not counting regions of nested scope boxes where it is shadowed by a definition of the same name. In Data-E, since there are no constructs that introduce scope boxes (i.e., no constructs with curly brackets), every variable is in scope from its defining occurrence until the end of the depiction as a whole.


 * With the <tt>tempName</tt> and non-cyclic <tt>defexpr</tt> productions, we can use Data-E to represent DAGs. For those values that are multiply referenced, we can write out the sub-expression for calculating this value at the first position it needs to appear, as the right hand side of a <tt>define</tt>, capturing its value in a temp variable (<tt> def t__2 := 1</tt>). Everywhere else this value is needed, we use <tt>tempName</tt> to reuse this captured value (<tt>t__2</tt>).


 * With a cyclic <tt>defexpr</tt>, we can use Data-E to represent graphs. Unlike other block structured languages, even when the name being defined on the left is used on the right, <font color="#009000">E  still holds strictly to the left-to-right rule. This may seem strange, since <tt>defexpr</tt> must execute right-to-left -- the expression on the right must be evaluated to a value before the variable on the right can be defined to hold this value. When this not yet defined but in-scope variable name is used within the expression on the right, what does it evaluate to before its actual value has been determined? The answer is an unresolved promise, similar to a logic variable or a future. An unresolved promise is an object reference whose designation has not yet been determined. At the moment the above list is created by evaluating the "<tt>[..]</tt>" expression on the right, <tt>t__0</tt> is still an unresolved promise.Once the expression on the right has evaluated to a value, then the promise bound to the variable on the left is resolved to be this value. Once a promise is resolved, it becomes like any normal reference to the object it designates. This value is also the value of the <tt>defexpr</tt> as a whole. By the time <tt> def t__0 := ...</tt> finishes evaluating, the promise within the list becomes a direct reference to the list itself.

By using promises to reconstruct cycles, we safely avoid a host of hazards. Most other serialization systems [ref JOSS, XMLEncoder, BOSS, ...] are defined in systems without any kind of delayed references (promises, logic variables, futures), but which still allow user-defined unserialize-time behavior by the objects being unserialized. In any such system, when unserializing a cycle, user-defined behavior may interact with graph neighbors that are not yet fully initialized. Before an object is fully initialized, it may be easily confused. In a conventional system this is only a minor source of bugs. But in a graph of mutually suspicious objects this would be a major opportunity for an adversary. By referring to an object only with promises until its fully initialized, we make such bugs safely fail-stop, enabling an adversary in the graph only to mount a denial-of-unserialization attack, which it can trivially do anyway, and which we make no effort or claim to prevent.

Data-E is a true subset of <font color="#009000">E , both syntactically and semantically. Since <font color="#009000">E  is a secure object-capability language, the security issues surrounding evaluation of <font color="#009000">E  programs are already well understood. By using a subset of <font color="#009000">E  in this way, we get to leverage this understanding.