Geographic Data Resources

Formats for Geographic Data

Geographic data formats are the vehicles that we use for encoding and exchanging observations and ideas about places and things that happen in space. At their foundation, geodata provide containers for storing ordered references that relate to entities and events -- references such as Names, Dates, Numbers and Locations. The other critical role of data formats, in research and scholarship is as a means of engaging with a variety of tools that create associations among entities represented in data.

One of the more interesting aspects of human evolution has been the continuing improvement in our ability to encode and exchange information. From the establishment of Latitude and Longitude as a means of encoding references to earth location more than 2000 years ago, to more recently established formats for storing geographically referenced images in formats like JPEG. Or the means for sensing and encoding four-dimensional time series information. There is a steady churn of innovation in the improvements in geographic data formats. Because the activity of research involves gathering data from many sources of different vintages, and for exchanging these data in such a wide variety of specialized tools is important to understand the trade-offs that one makes when working with different data formats.

Related Documents

As we look more deeply into these structures, we will see that in each of the formal categories of data structure: Tabular, Vector & Raster, there are many choices in terms of the Technical Implementation for the way that they are encoded. In choosing one encoding scheme over another, we are often trade one virtue for another. For example: choosing an old data format that is stable and easily exchanges may involve foregoing some useful features that may be available in newer datasets that are in active development.

Here are several more considerations that may come into play when evaluating a particular data format.

There is so much to say about the relationships and the trade-offs involved with these considerations that the best way I can think of to communicate it all is by telling a story that explains some of the origins of the problem of data encoding and exchange and some of the work that has got us where we are. The same illustration will show us where the continuing innovation of data encoding and exchange standards is taking us. Or perhaps more precisely -- where we are taking it!


Numbers and Text to Tables, to Feature Classes

A Tale of Semantic Capacity and Interoperability

Some of the earliest (prehistoric) known means of encoding and exchanging geographic information involved simple narrative explanation in tales that are told and repeated or embossed on clay tablets. Records of observations for more systematic navigational use sometimes involved knots on a piece of string as with a Kamal, or arrangements of sticks and shells as created by Polynesian islanders to represent spatial relationships of islands including wave patterns.

The value of Written Encodings

Consider the great leap forward that happened when Hipparchus established a convention of referring to Latitude and Longitude using angles. WIth this innovation, people could very easily jot down their observations and save them for later and share them with others. The written encoding scheme is easier to duplicate faithfully. And it is also more hygenic than knots on a Kmal string!

Ease of duplication and exchange is one of important factors determining the overall utility of a means of encoding. The ability to attach time reference to events according to commonly recognized calendars is another such breakthrough. The power of spatial and temporal referencing systems is their ability to carry meaning (Semantic Capacity). An interesting aspect of these systems is that spatially and temporally referenced information from different sources may be put together to create meaningful patterns and relationships. To some degree, this interoperability of data requires that the people making these inferences, or their tools, have some understanding of the domain of the references. For Latitude and Longitude and Time. For example, we know that to be precise, Latitude and Longitude references must be elaborated with an Earth Model, and time references must specify a time zone.

We will jump into this story with the advent of computers. We will skip over the whole epoch of telegraphs and Morse code, and teletypes. We will pick up the story with the development of an early encoding for the characters that can be typed on a normal american typewriter. ASCII, the American Standard Code for Information Exchange. ASCII is an early example of encoding schemes that translate the characters used in text into ones and zeros that could be conveyed as pulses on electrical wires between teletype machines. Later this also proved useful for exchanging text between computers. ASCII used clusters of 7 bits to express 128 characters. IN time it has been replaced by a system called Unicode which can use as many as 21 bits and different character mappings so that it is in effect indefinitely extensible. In this case, the increase in the number of bits used to convey a character deepened the semantic capacity of the encoding and exchange format.

The ability of animals to easily pass useful messages to each other is truly momentous aspects of the evolution of our species. What is more interesting is that we are very much on the frontier of this development with a lot of very clear problems left to work out!! To take this story forward we will look at how a humble text file can carry a representation all of the businesses in Somerville Massachusetts in 1999. We will take this text file through a series of transformations through a ladder of more sophisticated encoding to see how the utility of this model of businesses can be enhanced and exploited in various tools. Through each of these transformations there will be specific trade-offs in terms of Ease of Use, Semantic Capacity, and Interoperability.

Note: The following section is under construction. It is intended to accompany an actual in-class demonstration.

Introducing Our Plain Text File

Click here for an example comma-delimited text file. Take a look at how you have a first row of values that define the names of column headings. The values are separated (or delimited) by commas. The succeeding rows are sequences of values that would fill in the cells of a table. We will have more to say about tables as representations of classes of features, each of which is distinguished by values for a fixed set of attributes. You can see that some of these values are spatioal references: Street Addresses, Zip Codes, Town Names, Latitude, Longitude. You may not see the date, but this is the values, e.g. "9812" which fall under the column heading "Since".

As a plain text file, this information may be easily transfrered from one computer to another and can be opened in a very wide variety of tools, some of which will recognize the comma delimited values and will create a table of references that may be operated on systematically. To demonstrate this interoperability of plain text files, we will open the text file in ArcMap using the Add Data button, Then we can right-click it and choose Open to open it in the table viewer, and we can choose Properties to examine the properties that ArcMap has assigned to the table's columns and values.

I'll list a few of the things that happen in this exchange of plain text to ArcMap. Several of the columns have been appropriately assigned to text or numeric data-types. ArcMap actually makes guesses about these based on the contents it sees as it scans the first several values in each column. ArcMap gets this wrong, however when it interprets the Zip_5 column near the far right hand side of the table. Zip codes are made of numeric characters (numerals) so arcmap has decided to translate these as numbers. As a result, the leading zeros in Somerville's zipcodes have been thrown away. Everyone knows that a leading zero in a number is meaningless. But zipcodes are not numbers!!! The Dates in the Column, Since are an interesting case. These are also treated as numbers. But adding and subtracting the dates as they are would not yield any useful result. However, sorting them will put these rows into some sort of chronological order.

As you can see, some of the meaning that has been encoded in our comma delimited text file has been lost in the translation to ArcMap. This situation is less than optimal. We will soon see a more robust way of encoding various logical data types for more reliable exchange as we transform this text table first into an Events Layer in ArcMap by right-clicking it and choosing: Display XY Data . Then we export this as a Shape File Feature Class. By right clicking and choosing Data > Export Data.

There are lots of things we can do with a shape file in ArcMap that we cannot do with a plain text file. For example we can add new columns of various numeric types, as well as temporal references and of course the very useful spatial references that can be either Points, Lines, Polygons. In class we could demonstrate how each of these sorts of references has its own logic that we can use to explore associations between businesses represented in the table in terms of their locations or similarities in terms of their dates or Standard Industrial Class Codes (in a future workshop!) Shape files are an amazing thing, Hipparchyus of Rhodes, the inventor of Latitude and Longitude and Claudius Ptolomaeus, the inventor of map projectiosn would definitely applaud this means of encoding.

In class we will look at how a shape file is actually a complex of files. The tabular information is stored in a DBase format file, which is a very early way of encoding tables invented in the 1970s by a company that has gone defunct. One of the advantages for DBase as an information vehicle that are due to the fact that it has no owner -- is that the formatting has not changed for decades. DBase does what it does, and will continue to do so. The recipe for reading and writing a DBase file from scratch are well known and open source. The fact that the Dbase format is Stable and Open Source make this a very interoperable format with much better semantic carrying capacity than plain text files. You can trust DBase as a container for your precious observations. DBase has become a de facto standard for encoding and exchange of tables.

At this point in the live demonstration, we will create a polygon graphic in arcmap and then create a shape file form it using the Convert Graphics to Features option in the Draw toolbar. We wil then use this shape file to explore spatial relationships with the businesses shape file.

To simplify what is undoubtedly a much richer story, in the early 1990s ESRI (the creators of AcrMap) extended DBase by adding extensions to link geometric objects (stored in the .shp file with the attribute information stored in the DBase file. This is accomplished by an index file (.shx). Later, in the 1990s they added another extension (.prj) that carries information about the projection of the coordinates in the .shp file. And more recently a shape file can be accompanied by a .xml file that contains related metadata. To emulate the success of DBase, ESRI has published all of the recipes for reading and writing the shapefile extensions and has promised not to make changes to these. As a result, shape files are considered a stable standard and many content creators and too makers have put a lot of value into creating content in shape files.

Stability has its drawbacks, however. For one thing there are aspects of dBase files and shape files that did not seem like a problem back in the day: including the fact that Dbase and Shapefile datasets can't handle datasets larger than 2 gigabytes. Field names can't be longer than 9 characters. The shape file has all of these separate files that means that they can easily become corrupt due to mishandling. The stability that makes these formats default standards turns out to be a barrier to innovation.

So, while ESRI locked in the capabilities of shapefiles, they commenced work on new containers for storing tables that hold geometric features and a variety of attribute types. Their first attempt at this was called the Personal Geodatabase. This format was an extension to the Mictrosoft Database file format, .mdb. After a while, ESRI decided to create a brand new container called the File Geodatabase. File geodatabases have tones of useful features too numerous to mention here. But you can read about all of them on this blurb from ESRI.

It is a fact that storing your feature classes in a File Geodatabase makes a lot of sense -- when you are planning on using them or exchanging them strictly within the world of ESRI software. If you are concerned about interoperability with other tools, then you ought to think twice about relying on File Geodatabases and even features such as long column names. The reason is that at the date of this writing (2014) the recipe for creating a file geodatabas is proprietary and that means that ESRI is the only company that makes tools that read and write this format. In a sense, this is a good thing, since it gives them maximum freedom to innovate. Nevertheless, users of GIS and creators of GIS data and tools need to look elsewhere for a means of stable, interoperable encoding and exchange (beyond shape files).

The example of dBase and Shapefile and Geodatabase formats seem to indicate that encoding and exchange formats are stable and interoperable only when they are frozen, and can no longer be extended. It may also appear that proprietary control of formats, while fostering innovation are not compatible with interoperability. How can humanity have stable, open source standards that can evolve? For an answer to this, see the Open Geospatial Consortium. The OGC is an international standards organization that brings together content creators, tool makers and users to think about how to develop stable open source file formats for information encoding and exchange. The OGS has just (last January) approved a standard for a GeoPackage format that seems to have many of the features of a geodatabase but is open source.


Tables

We have discussed how tables serve as a containre for records about entities that can be distinquished by their attributes. We will discuss how this simple construction can lead to really interesting models later in the term. For now we will consider some of the pros and cons of different ways of authoring and exchanging tables. For more information, see Overview of Tables and Attribute Information from ESRI Online Help.


Vector Data Formats

With the exception of the popular CAD formats, .dwg and .dxf, vector formats supported in GIS are extensions of the tabular types discussed above. In effect, Points, Lines and Polygons merely extend the range of datatypes and associated logic traditionally offered in tables The ISO standards for logical datatypes were extended to handle spatial data: points, lines, polygons, and surfaces; in the mid 1990s. In the world of ArcMap, individual datasets representing collections of objects each having the same type of feature are known as Feature Classes.


Raster Image Formats

While Tabular data structutes allow us to distinquish and form associations among different classes of discrete entities, Raster Images provide containers for representations of locations. Locations are identified by cells or pixels, that can be associated with attributes. From a perspective of GIS, there are a couple of important aspects of rasters:


Styling and Portrayal

Most data encodings used for vector GIS data do not contain a built-in way of speecifying the way that features will be symbolized and labeled. The best argument for keeping such styling or portrayal information separate form the data encoding is that there are any number of ways that a person may want to style data. Furthermore the specific options for doing this will vary depending on what sort of tools and delivery mechanisms may be involved. Therefore, it makes sense to have this sort of information separate.


Data Models and Schema

Typical GIS databases that we collect frm sources are fairly elemental, consisting of discrete collections of vector features or individual raster layers. However, it is also useful to consider how complexes of feature classes and rasters can be organized to make data models that are coherent in terms of the relationships among features and potenmtially also engaged with rasters. Higher order data collections that operate this way are thought of as Schema in the sense of their abstract organization, or as Data Models when they are implemented and used. An advantage of thinking of schema in this way is that toolkits may be developed that develop inferences and perform experiments involving the consitiution of elements and relationships among them. Data models are discussed in more detail on the page, Modeling for Decision Support and Scholarship

Most schema that we make are relatively ad-hoc. However a very important movement in the field of GIS an other information management endeavors is to develop elaborate schema with organizations of people who will be better able to exchange very deep information and tools. A hallmark of this movement is the use of XML (Extensible Markup Language) -- which is a sort of meta schema. The development XML schema by communities of interest is driving a revolution in collaborative information models that can be systematically exchanged -- known as the Semantic Web. A branch of the open source movement, these efforts usually involve cross-disciplinary collaborations in which participants undestand that the development of a shared language will enhance their niche in the ever-more diverse information ecology. There are several non-profit collaborations that are very active in developing very useful schema. For example, see The general Transit Feed Specification or CityGML