Proposed XML Format For Aida 1D and 2D Histograms

Note:

This document provides a very brief overview of XML and details a proposed XML format for Aida 1D and 2D histograms, which when properly implemented would allow Aida histogram data to be easily transferred between applications.

This proposal was put forth by Tony Johnson and Paul Spence of the Stanford Linear Accelerator Center. Please send any comments to project-AIDA-dev@cern.ch.

Brief Overview of XML

The Extensible Markup Language (XML) is a document processing standard proposed by the World Wide Web Consortium (W3C) that allows you to create and format your own document markups. There are two files that need to be processed by an XML-compliant application to parse XML content. They are the XML document and the Document Type Definition (DTD). 

An XML document file contains the document data, which is tagged with meaningful XML elements, some of which may contain attributes. An element with one attribute is marked with the form <ElementNameTag attribute1 = "attributevalue" > Element Body </ElementNameTag>. 

The DTD file specifies the rules for how the XML document elements, attributes, and other data are defined and logically related in an XML-compliant document. The DTD declares each element, their parent/child relationship to other elements, and declares the elements attributes. Comments in a DTD are noted as <!-- This is a DTD comment -->.

Proposed Aida XML Format

The blue text boxes below contain an example of a 1D and a 2D histogram XML document which meet the proposed DTD specifications, and the proposed DTD. It is recommended that you look through these files to determine the proposed XML data structure, however, the following list details some note worthy points about the structure:

You may get a copy of the sample documents and DTD by following these links and viewing the source: 1D Histogram XML Document , 2D Histogram XML Document , Proposed DTD

1D Histogram XML Document Example

This is a small example of an Aida 1D histogram, with fixed bin widths, stored as an XML document.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE aida SYSTEM "aida.dtd">
<aida>
<histogram1d title="AIDA 1D Histogram">
<axis min="-3.0" max="3.0" numberOfBins="5">
</axis>
<outOfRangeData1d underflow="10" overflow="11"/>
<statistics>
<statistic name="Entries" value="9979"/>
<statistic name="EquivalentBinEntries" value="10000.0"/>
<statistic name="InRangeBinHeights" value="9979.0"/>
<statistic name="OutOfRangeBinHeights" value="21.0"/>
<statistic name="Mean" value="0.01065112392423203"/>
<statistic name="RMS" value="0.9923976726138446"/>
</statistics>
<data>
0,340.0,18.439088914585774,17.6,340
1,2364.0,48.620983124572874,22.0,2364
2,4533.0,67.32755750805164,45.76,4533
3,2393.0,48.91829923454004,12.3,2393
4,349.0,18.68154169226940617.5,349
</data>
</histogram1d>
</aida>

2D Histogram XML Document Example

This is a small example of an Aida 2D histogram stored as an XML document, with both axis set as variable width bins, and the bincontents listed as bin, height, error, entries.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE aida SYSTEM "aida.dtd">
<aida>
<histogram2d title="AIDA 2D Histogram">
<axis axis="x" min="-3.0" max="3.0" numberOfBins="5">
<variableWidthBins>
-3.0
-2.5
-0.7000000000000001
1.0
2.7
3.0
</variableWidthBins>
</axis>
<axis axis="y" min="-3.0" max="3.0" numberOfBins="5">
<variableWidthBins>
-3.0
-2.5
-0.7000000000000001
1.0
2.7
3.0
</variableWidthBins>
</axis>
<outOfRangeData2d>
0
0
2
5
2
0
0
0
4
3
1
0
0
0
3
11
3
0
0
0
2
5
5
0
</outOfRangeData2d>
<statistics>
<statistic name="EquivalentBinEntries" value="10000.0"/>
<statistic name="InRangeBinHeights" value="9954.0"/>
<statistic name="OutOfRangeBinHeights" value="46.0"/>
<statistic name="MeanX" value="0.004555075276661586"/>
<statistic name="RmsX" value="0.06733740836923806"/>
<statistic name="MeanY" value="0.021416261229759048"/>
<statistic name="RmsY" value="0.14476741686131506"/>
</statistics>
<bincontents order="xy">
bin, height, error, entries
</bincontents>
<data>
0,0.0,0.0,0
1,14.0,3.7416573867739413,14
2,37.0,6.082762530298219,37
3,8.0,2.8284271247461903,8
4,1.0,1.0,1
5,9.0,3.0,9
6,534.0,23.108440016582687,534
7,1371.0,37.027017163147235,1371
8,385.0,19.621416870348583,385
9,3.0,1.7320508075688772,3
10,28.0,5.291502622129181,28
11,1365.0,36.945906403822335,1365
12,3647.0,60.390396587537,3647
13,945.0,30.740852297878796,945
14,17.0,4.123105625617661,17
15,11.0,3.3166247903554,11
16,368.0,19.183326093250876,368
17,946.0,30.757112998459398,946
18,250.0,15.811388300841896,250
19,2.0,1.4142135623730951,2
20,0.0,0.0,0
21,3.0,1.7320508075688772,3
22,8.0,2.8284271247461903,8
23,2.0,1.4142135623730951,2
24,0.0,0.0,0
</data>
</histogram2d>
</aida>

Proposed DTD

This is the DTD file which specifies the rules used to make the preceding XML documents. The DTD details the precise proposed data structure for Aida 1D and 2D histograms.

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<!--This is the document type defintion file for hep.aida histogram xml files-->


<!--
The aida element parents 1 or more histogram1d and histogram2d elements. It serves the same purpose
as, for example, the opening and closing HTML elements do in HTML: to identify a file as
conforming to the appropriate Document Type Definition (DTD). The aida element has no attributes.
-->
<!ELEMENT aida (histogram1d|histogram2d)+>

<!--
The histogram1d element is parent to a data, a outOfRangeData1d, an axis, a statistics, and a bincontents element which contain
all of the available information pertaining to 1d histograms. The outOfRangeData1d and statistics element is optional.
The histogram1d element has the histograms title as its only attribute. 
-->
<!ELEMENT histogram1d
(axis,outOfRangeData1d?, statistics?, bincontents, data)
>

<!ATTLIST histogram1d
title CDATA #REQUIRED 
>

<!-- 
The histogram2d element is parent to a data, a outOfRangeData2d, two axis, statistics and a bincontents element which contain
all of the available information pertaining to 2d histograms. Two axis elements are required for 2d histograms. The axis elements should always be listed as the first child element of histogram2d in the xml file so that the number of bins and bounds can be determined before the bin data is parsed. The outOfRangeData2d and statistics element is optional. The histogram2d element has the histograms 
title as its only attribute. 
-->
<!ELEMENT histogram2d
((axis, axis), outOfRangeData2d?, statistics?, bincontents, data)
>

<!ATTLIST histogram2d
title CDATA #REQUIRED
>

<!--
The axis element is parent to the optional element variableWidthBins, and has
the following attributes, which are used to define the axis:

axis = which axis does this apply to? (currently x or y), default of 
x is assumed if the attribute is not found in the xml file.
min = the smallest value on the axis.
max = the largest value on the axis.
numberOfBins = number of bins to divide the axis into (integer).
-->
<!ELEMENT axis (variableWidthBins?)>
<!ATTLIST axis
axis CDATA "x"
min CDATA #REQUIRED
max CDATA #REQUIRED
numberOfBins CDATA #REQUIRED

>

<!--
The element varaibleWidthBins details the bin edges for an axis containing variable width bins.
This element should only be used if the histogram was created with variable width bins. In
between the elements opening and closing tags should be a list of numbers, one per line, which 
define the value of the bin edges used, this should include the min and max values of the axis.
-->
<!ELEMENT variableWidthBins (#PCDATA)>

<!--
The outOfRangeData1d element has no children. It has two attributes, overflow and underflow, which are used to record
the number of entries in the underflow and overflow bins for 1D histograms. 
-->
<!ELEMENT outOfRangeData1d EMPTY>

<!ATTLIST outOfRangeData1d
underflow CDATA #REQUIRED
overflow CDATA #REQUIRED
>

<!--
The outOfRangeData2d element has no children or attributes. In between the elements closing tags
should be a list of numbers, one per line, specifiying the number of entries in the out of range
bins for 2D histograms. The out of range data should be entered in a counter clockwise direction starting
from the bottom left corner, as shown in the example diagram below:
       <<<<< 
     |   xxxxxx ^       0 = in range bin data
     |  x0000x ^       x = out of range bin data
     e x0000x ^
     n x0000x ^
     d xxxxxx  ^
start: >>>>> 

-->
<!ELEMENT outOfRangeData2d (#PCDATA)>

<!--
The statistics element has no attributes and acts as parent to any number of statistic
elements.
-->
<!ELEMENT statistics (statistic*)>

<!--
The statistic element has no children. Its two attributes, name and value, define the statistic.
The value attribute should be castable to a double type. 
-->
<!ELEMENT statistic EMPTY>
<!ATTLIST statistic
name CDATA #REQUIRED
value CDATA #REQUIRED
>

<!-- The bincontents element is used to describe the bin data supplied for the histogram. It has no children and a single attribute, 'order', which can have values "xy" "yz". This attribute is only necessary for 2d histograms as it specifies the order that the bin data is listed in. In between the elements opening and closing tags should be a comma seperated list of strings that describe each number that is supplied for the bin data. The acceptable string values are bin, binx, biny, binz, height, error, pluserror, minuserror, entries, ignore, x, y, z.  The string values x, y, x, bin ,binx, biny, binz are used to specify which bin the data belongs to. All other strings relate to the actual bin data. The parsing of the data between the opening and closing tags is left to the file reader. -->

<!ELEMENT bincontents  (#PCDATA)>
<!ATTLIST bincontents order (xy|yz)  #IMPLIED>


<!--The next two elements, data1d and data2d have no children but instead contain #PCDATA (which basically is unstructured character data).
This data is the bin data... and there is a LOT of it. A typical 2D histogram has around 2000 bins, and a typical 2D scatterplot has around
20,000 points. In theory we could make each bin or point an XML element, but this would be a Very Bad Idea for these reasons:

1. It would waste a ton of space. Although XML is not terse, this would be excessive even by XML standards. Visualize this repeated 2000 or
20,000 times: <bin contents="0.0"/> That's a lot of characters when we only need three ("0.0").

2. Many implementations (such as the one in JAS) will parse plotML files using the DOM. Having a tree with 2000 or 20,000 nodes (one for each
bin or point) is absurd.

So we will define our own format of this data rather than use XML to do the job for us. Yes, this means that aida xml parsers have to do the extremely trivial job of parsing this data themselves.

Format Rules:

1)Each bin data is ALWAYS on its own line.

2)The bin data is a row of numbers with comma delimeters. It is left up to the user to make sure that the order of the numbers matches the order specified by the bincontents element. For example if the bincontents element lists the following between its opening and closing tags: bin, height, pluserror, minuserror, entries. Then the each row of bin data should look like ####, ####, ####, ####, ####. 

3)The data SHALL NOT be tabbed in to make it look nicely formatted. This, although easy to do, would be a very inefficient use of storage. So it
is NOT the responsibility of implementations to strip leading whitespace.

4)The bin data for 1d histograms should be ordered with the axis' lowest edge bin data entered first
and the axis upper edge bin entered last.

5)The bin data for 2d histograms should be ordered (as shown in diagram below) by starting from the x axis' lowest edge bin
and proceeding from the y axis' lower edge bin to the y axis' upper edge bin. Then moving to the next
lowest bin on the x axis and proceed back up the y axis. 

     | ^    ^    ^    ^
     | ^    ^    ^    ^
  y | ^    ^    ^    ^
     | ^    ^    ^    ^
     | ^    ^    ^    ^
     _^__^__^__^__
start: x  -->


<!ELEMENT data (#PCDATA)>