The ACSR has been improving the organization of generated TMA data for several years [8–10]. A sample TMA export illustrates the TMA DES format [See TA00-050.xml in Additional file 1]. Compare this to an improved version for the same TMA during the discussion below [See TA00-050_recordered.xml in Additional file 1].
Designing the DTD
The XML document type definition (DTD) for the TMA DES must represent the rules given in the specification and the 80 CDEs defined in the associated Tissue MicroArray Common Data Elements document. This DTD [See tmades.dtd in Additional file 1] accommodates the following special circumstances:
• The TMA DES permits locally needed data element (LDE) definitions to extend those defined in it. A thorough DTD must specify every element. When such elements are added, definitions must be added to the DTD locally. The next section, Extending the DTD, explains this further.
• The order of elements within a parent element is never constrained while the presence of at least one of certain CDEs is mandated. The DTD language will not necessarily allow specification of a certain number of elements when the order of elements is unconstrained. The DTD must be deterministic, i.e. a parser looking at successive elements must have only one path forward through the rules. The section called Mandating CDEs discusses this further.
• The TMA DES specifies a hierarchical nesting arrangement for the CDEs defined in it, it does not define the specific values for them. The content of elements is therefore specified as parsed character data (#PCDATA).
Extending the DTD
Our DTD provides for extensions to the specification to add data elements not in the specification. We made DTD extensions to implement those needed at the Mid-Region ACSR, which serve as examples of how to implement the TMA DES.
Our design uses a single external DTD file with selectable modes to contain definitions that implement the API proposed TMA DES and our proposed improvements and additions to it. We defined three place holder entities (block_LOCAL-TAGS, core_LOCAL-TAGS, slide_LOCAL-TAGS) that can be redefined in the internal portion to extend the specification with local defined elements (LDE) [See myLDEs.dtd in Additional file 1]. We illustrate how to place these in a separate file and include it into multiple XML files to reduce maintenance.
Suppose that age and gender are tracked for cores in TMA blocks at an institution. There are no CDEs in the specification to hold this data. These internal LDE and entity definitions can be shared by several TMA blocks because they are placed in a file called myCoreTags.dtd:
<!ELEMENT core_AGE (#PCDATA)>
<!ELEMENT core_GENDER (#PCDATA)>
<!ENTITY % core_LOCAL-TAGS (core_AGE | core_GENDER)>
This external DTD file fragment defines core_LOCAL-TAGS as a place-holder:
<!ELEMENT core_array-id (#PCDATA)>
<!ELEMENT core (#PCDATA)>
<!ENTITY % core_LOCAL-TAGS (place-holder)>
<!ELEMENT core (core_array-id, (%core_STD-TAGS; |
%core_LOCAL-TAGS;)*>
Multiple XML files, as in the following example, can reference the public external DTD file as well as the locally shared internal DTD file eliminating maintenance of multiple copies. (Usually a public URI is specified: http://www.acsr.mid-region.org/tma/tmades.dtd. The file name is used here to simplify the example.)
<!DOCTYPE histo SYSTEM "tmades.dtd" [
<!ENTITY % noLDEs 'IGNORE' >
<!ENTITY % myCoreTagFile SYSTEM "myCoreTags.dtd">
%myCoreTagFile;
]>
The first (internal) definition encountered is used. The external definition of block_LOCAL-TAGS as a place-holder is thus ignored when an internal one is provided. A similar arrangement is shown in Figure 2.
TMA DES export (& style) DTD is always tmades.dtd, TMA LDE DTD could be myCoreTags.dtd and TMA DES export XML would then contain the above document type declaration.
Mandating CDEs
In the following examples, assume there is an entity called header_unlimited_items that contains all of the elements in the header except filename as follows:
<!ENTITY % header_unlimited_items "(
Title |
Creator |
Subject |
Keywords |
Description |
Publisher |
Contributer |
Date |
Resource_Type |
Format |
Resource_Identifier |
Source |
Language |
Relation |
Coverage |
Rights_Management)">
Here is an example of a problem situation. There is a list of CDEs that may appear inside the header CDE. The filename CDE is in this list and has a maximum occurrence of one. All other CDEs in this list have a maximum occurrence that is unlimited. The order of elements is not limited and no CDEs are required. Here is a straightforward way to describe this in a DTD:
<!ELEMENT header (
(%header_unlimited_items;)*,
filename?,
(%header_unlimited_items;)*)>
Stated in English: a header is zero or many of the non-filename elements followed by a single optional filename followed by perhaps more non-filename elements. Although this faithfully describes our situation, a parser will decide this definition is non deterministic because for some input there are multiple paths through the definition (ex. when there are two Title elements and no others, the term before and/or after the filename can be used for them).
The following definition faithfully represents the allowable header contents and is deterministic but is not as easy to understand:
<!ELEMENT header (
(filename, ((%header_unlimited_items;)*)) |
(((%header_unlimited_items;)+),
((filename, ((%header_unlimited_items;)*)))?))>
Stated in English: a header is either:
• a filename followed by zero or many of the non-filename elements or
• this sequence:
If the language is changed slightly to say that any number of all header elements is allowed a simple definition is possible:
<!ELEMENT header (((%header_unlimited_items;)*))>
Likewise, if a variety of numbers of header elements is necessary (with some restricted differently than others) but it is acceptable to mandate an order, a simple definition is possible:
<!ELEMENT header (filename, Title*, Creator+, Subject?, ...)>
There are other elements that have a situation similar to the header:
• A tma is required to have one or more headers and one or more blocks but the order is unconstrained.
• A block is required to have one or more slide elements, one or more core elements and optionally any number of (non-parent) block elements in an unconstrained order. (In fact, the requirement for at least one slide and one core may be only for the entire file and not per block. A very convoluted DTD would be required to support a per file restriction.)
A DTD that facilitates validation can also have a role in communicating the details of what is acceptable in the XML application language. While allowing as much flexibility as possible so as to not restrict the style that TMA users might like to employ is desirable, simplifications that make the language easier to understand are also desirable and at times may be involved in a trade-off.
Improving the DTD
While the DTD implements the specification as initially proposed, some improvements were added for conditional use based on the suggestions in [10]. The external DTD can be used in one of two modes by defining one to INCLUDE and the other to IGNORE:
unimproved (default) mode – Enforces the TMA DES rules as proposed, notably:
• Allows multiple header elements within a tma which can be interspersed with block elements within that tma. Ambiguous situations are possible: As each header can have a filename element, what would it mean to have multiple filenames for a single file? Should we associate a certain header with a certain block? How could we tell which?
• Allows at most one filename element to be anywhere within header, if present. The DTD is more difficult to encode and understand unless the filename must be first (or must be an attribute).
• None of these identifiers are required: filename in header, block_identifier in block, slide_identifier in slide, or core_array-id in core. This leaves no certain way to refer to files, blocks, slides and cores. The absence of core_array-id leaves no certain way to locate the core in the array.
• Enforces that at least one block, one slide, and one core element are required. These elements may be empty and serve no purpose. For example, when a TMA block is constructed and no slides have been made from it and it is being provided with exported data to another institution, a slide element must be in the export although no slide exists. Enforcing the presence of these elements does not assure better use of the specification and that data is indeed provided.
improved mode – Enforces the TMA DES rules with the following changes:
• Enforces that there can be at most a single header element within a tma which must precede all block elements within that tma.
• Enforces that filename is the first element within header, if present.
• Enforces that a single identifier is present as the first element within each parent (block_identifier in block, slide_identifier in slide, core_array-id in core). Human readability is improved if an identifier is at the beginning of each element.
• No block, slide, or core elements are required. It is expected that every block, slide and core for which data is to be provided will have a corresponding element containing that data. This expectation does improve usage of the specification.
A program, BrowseTMA Reorder XSLT script [see BrowseTMAReorder.xsl in Additional file 1], was added to BrowseTMA that can convert unimproved TMA DES XML data to improved data by reordering elements and, if needed, adding consecutively numbered identifiers. Several batch and java scripts are used to invoke the XSLT script [see BrowseTMAReorder.bat in Additional file 1], repair the XML file [see fixordered.js in Additional file 1] and invoke the Microsoft XSLT Parser [see xsltTest.js in Additional file 1].
The same DTD also contains the TMA style definitions. By default they are ignored; defining addStyles as INCLUDE will cause them to be available.