Geographic Data Resources
Formats for Geographic Data
Geographic data formats are the vehicles that we use for encoding and exchanging observations and ideas about places and things that happen in space. At their foundation, geodata provide containers for storing ordered references that relate to entities and events -- references such as Names, Dates, Numbers and Locations. The other critical role of data formats, in research and scholarship is as a means of engaging with a variety of tools that create associations among entities represented in data.
One of the more interesting aspects of human evolution has been the continuing improvement in our ability to encode and exchange information. From the establishment of Latitude and Longitude as a means of encoding references to earth location more than 2000 years ago, to more recently established formats for storing geographically referenced images in formats like JPEG. Or the means for sensing and encoding four-dimensional time series information. There is a steady churn of innovation in the improvements in geographic data formats. Because the activity of research involves gathering data from many sources of different vintages, and for exchanging these data in such a wide variety of specialized tools is important to understand the trade-offs that one makes when working with different data formats.
- Understanding GIS Data, Referencing Systems and Metadata. discusses a critical mindset for understanding the quality and fitness of data for use in particular situations.
- ArcMap 10 Help on Data Formats Supported in ArcMap
As we look more deeply into these structures, we will see that in each of the formal categories of data structure: Tabular, Vector & Raster, there are many choices in terms of the Technical Implementation for the way that they are encoded. In choosing one encoding scheme over another, we are often trade one virtue for another. For example: choosing an old data format that is stable and easily exchanges may involve foregoing some useful features that may be available in newer datasets that are in active development.
Here are several more considerations that may come into play when evaluating a particular data format.
- Ease of Handling: Are data represented as simple files that you can manage on your computer? Or do the data require special software tools and internet-based services?
- Semantic Capacity: Does the data format have a capacity to represent meanings and logical capabilities that meet your modeling needs?
- Stability & Interoperability: How confident are you that the format in question will be readable by a variety of tools for as long as you feel that your data may be of interest?
There is so much to say about the relationships and the trade-offs involved with these considerations that the best way I can think of to communicate it all is by telling a story that explains some of the origins of the problem of data encoding and exchange and some of the work that has got us where we are. The same illustration will show us where the continuing innovation of data encoding and exchange standards is taking us. Or perhaps more precisely -- where we are taking it!
Numbers and Text to Tables, to Feature Classes
A Tale of Semantic Capacity and Interoperability
Some of the earliest (prehistoric) known means of encoding and exchanging geographic information involved simple narrative explanation in tales that are told and repeated or embossed on clay tablets. Records of observations for more systematic navigational use sometimes involved knots on a piece of string as with a Kamal, or arrangements of sticks and shells as created by Polynesian islanders to represent spatial relationships of islands including wave patterns.
The value of Written Encodings
Consider the great leap forward that happened when Hipparchus established a convention of referring to Latitude and Longitude using angles. WIth this innovation, people could very easily jot down their observations and save them for later and share them with others. The written encoding scheme is easier to duplicate faithfully. And it is also more hygenic than knots on a Kmal string!
Ease of duplication and exchange is one of important factors determining the overall utility of a means of encoding. The ability to attach time reference to events according to commonly recognized calendars is another such breakthrough. The power of spatial and temporal referencing systems is their ability to carry meaning (Semantic Capacity). An interesting aspect of these systems is that spatially and temporally referenced information from different sources may be put together to create meaningful patterns and relationships. To some degree, this interoperability of data requires that the people making these inferences, or their tools, have some understanding of the domain of the references. For Latitude and Longitude and Time. For example, we know that to be precise, Latitude and Longitude references must be elaborated with an Earth Model, and time references must specify a time zone.
We will jump into this story with the advent of computers. We will skip over the whole epoch of telegraphs and Morse code, and teletypes. We will pick up the story with the development of an early encoding for the characters that can be typed on a normal american typewriter. ASCII, the American Standard Code for Information Exchange. ASCII is an early example of encoding schemes that translate the characters used in text into ones and zeros that could be conveyed as pulses on electrical wires between teletype machines. Later this also proved useful for exchanging text between computers. ASCII used clusters of 7 bits to express 128 characters. IN time it has been replaced by a system called Unicode which can use as many as 21 bits and different character mappings so that it is in effect indefinitely extensible. In this case, the increase in the number of bits used to convey a character deepened the semantic capacity of the encoding and exchange format.
The ability of animals to easily pass useful messages to each other is truly momentous aspects of the evolution of our species. What is more interesting is that we are very much on the frontier of this development with a lot of very clear problems left to work out!! To take this story forward we will look at how a humble text file can carry a representation all of the businesses in Somerville Massachusetts in 1999. We will take this text file through a series of transformations through a ladder of more sophisticated encoding to see how the utility of this model of businesses can be enhanced and exploited in various tools. Through each of these transformations there will be specific trade-offs in terms of Ease of Use, Semantic Capacity, and Interoperability.
Note: The following section is under construction. It is intended to accompany an actual in-class demonstration.
Introducing Our Plain Text File
Click here for an example comma-delimited text file. Take a look at how you have a first row of values that define the names of column headings. The values are separated (or delimited) by commas. The succeeding rows are sequences of values that would fill in the cells of a table. We will have more to say about tables as representations of classes of features, each of which is distinguished by values for a fixed set of attributes. You can see that some of these values are spatioal references: Street Addresses, Zip Codes, Town Names, Latitude, Longitude. You may not see the date, but this is the values, e.g. "9812" which fall under the column heading "Since".
As a plain text file, this information may be easily transfrered from one computer to another and can be opened in a very wide variety of tools, some of which will recognize the comma delimited values and will create a table of references that may be operated on systematically. To demonstrate this interoperability of plain text files, we will open the text file in ArcMap using the Add Data button, Then we can right-click it and choose Open to open it in the table viewer, and we can choose Properties to examine the properties that ArcMap has assigned to the table's columns and values.
I'll list a few of the things that happen in this exchange of plain text to ArcMap. Several of the columns have been appropriately assigned to text or numeric data-types. ArcMap actually makes guesses about these based on the contents it sees as it scans the first several values in each column. ArcMap gets this wrong, however when it interprets the Zip_5 column near the far right hand side of the table. Zip codes are made of numeric characters (numerals) so arcmap has decided to translate these as numbers. As a result, the leading zeros in Somerville's zipcodes have been thrown away. Everyone knows that a leading zero in a number is meaningless. But zipcodes are not numbers!!! The Dates in the Column, Since are an interesting case. These are also treated as numbers. But adding and subtracting the dates as they are would not yield any useful result. However, sorting them will put these rows into some sort of chronological order.
As you can see, some of the meaning that has been encoded in our comma delimited text file has been lost in the translation to ArcMap. This situation is less than optimal. We will soon see a more robust way of encoding various logical data types for more reliable exchange as we transform this text table first into an Events Layer in ArcMap by right-clicking it and choosing: Display XY Data . Then we export this as a Shape File Feature Class. By right clicking and choosing Data > Export Data.
There are lots of things we can do with a shape file in ArcMap that we cannot do with a plain text file. For example we can add new columns of various numeric types, as well as temporal references and of course the very useful spatial references that can be either Points, Lines, Polygons. In class we could demonstrate how each of these sorts of references has its own logic that we can use to explore associations between businesses represented in the table in terms of their locations or similarities in terms of their dates or Standard Industrial Class Codes (in a future workshop!) Shape files are an amazing thing, Hipparchyus of Rhodes, the inventor of Latitude and Longitude and Claudius Ptolomaeus, the inventor of map projectiosn would definitely applaud this means of encoding.
In class we will look at how a shape file is actually a complex of files. The tabular information is stored in a DBase format file, which is a very early way of encoding tables invented in the 1970s by a company that has gone defunct. One of the advantages for DBase as an information vehicle that are due to the fact that it has no owner -- is that the formatting has not changed for decades. DBase does what it does, and will continue to do so. The recipe for reading and writing a DBase file from scratch are well known and open source. The fact that the Dbase format is Stable and Open Source make this a very interoperable format with much better semantic carrying capacity than plain text files. You can trust DBase as a container for your precious observations. DBase has become a de facto standard for encoding and exchange of tables.
At this point in the live demonstration, we will create a polygon graphic in arcmap and then create a shape file form it using the Convert Graphics to Features option in the Draw toolbar. We wil then use this shape file to explore spatial relationships with the businesses shape file.
To simplify what is undoubtedly a much richer story, in the early 1990s ESRI (the creators of AcrMap) extended DBase by adding extensions to link geometric objects (stored in the .shp file with the attribute information stored in the DBase file. This is accomplished by an index file (.shx). Later, in the 1990s they added another extension (.prj) that carries information about the projection of the coordinates in the .shp file. And more recently a shape file can be accompanied by a .xml file that contains related metadata. To emulate the success of DBase, ESRI has published all of the recipes for reading and writing the shapefile extensions and has promised not to make changes to these. As a result, shape files are considered a stable standard and many content creators and too makers have put a lot of value into creating content in shape files.
Stability has its drawbacks, however. For one thing there are aspects of dBase files and shape files that did not seem like a problem back in the day: including the fact that Dbase and Shapefile datasets can't handle datasets larger than 2 gigabytes. Field names can't be longer than 9 characters. The shape file has all of these separate files that means that they can easily become corrupt due to mishandling. The stability that makes these formats default standards turns out to be a barrier to innovation.
So, while ESRI locked in the capabilities of shapefiles, they commenced work on new containers for storing tables that hold geometric features and a variety of attribute types. Their first attempt at this was called the Personal Geodatabase. This format was an extension to the Mictrosoft Database file format, .mdb. After a while, ESRI decided to create a brand new container called the File Geodatabase. File geodatabases have tones of useful features too numerous to mention here. But you can read about all of them on this blurb from ESRI.It is a fact that storing your feature classes in a File Geodatabase makes a lot of sense -- when you are planning on using them or exchanging them strictly within the world of ESRI software. If you are concerned about interoperability with other tools, then you ought to think twice about relying on File Geodatabases and even features such as long column names. The reason is that at the date of this writing (2014) the recipe for creating a file geodatabas is proprietary and that means that ESRI is the only company that makes tools that read and write this format. In a sense, this is a good thing, since it gives them maximum freedom to innovate. Nevertheless, users of GIS and creators of GIS data and tools need to look elsewhere for a means of stable, interoperable encoding and exchange (beyond shape files).
The example of dBase and Shapefile and Geodatabase formats seem to indicate that encoding and exchange formats are stable and interoperable only when they are frozen, and can no longer be extended. It may also appear that proprietary control of formats, while fostering innovation are not compatible with interoperability. How can humanity have stable, open source standards that can evolve? For an answer to this, see the Open Geospatial Consortium. The OGC is an international standards organization that brings together content creators, tool makers and users to think about how to develop stable open source file formats for information encoding and exchange. The OGS has just (last January) approved a standard for a GeoPackage format that seems to have many of the features of a geodatabase but is open source.
We have discussed how tables serve as a containre for records about entities that can be distinquished by their attributes. We will discuss how this simple construction can lead to really interesting models later in the term. For now we will consider some of the pros and cons of different ways of authoring and exchanging tables. For more information, see Overview of Tables and Attribute Information from ESRI Online Help.
- Plain Text, Comma Delimited files. The ultimate in Simple, Open data formats. Text files are very shallow in terms of their ability to assign logical types to specific data fields, but software like ArcGIS will try to guess whetehr a specific column of text is referring to Text, Numbers, or Dates. Sometimes the software gets it wrong, such as treating zipcodes as numbers. Click here for an example comma-delimited text file.
- DBase Format. Dbase tables are an open format for exchanging tabular information that offers a deeper capacity to assign specific logical datatypes to fields. One drawback of dbase tables is that column names can't be more than 9 characters. Dbase is a static specification, which makes it stable, and used in may tools. Yet it seems to be being abandoned by Microsoft.
- Excel Worksheets Excel is a wonderrful tool that offeres a very deep ability to represent tabular information. The fact that one column may be specified to be a dynamic function of other columns is an example of this depth. A drawback of excel is that it is much more complex than text or dbase as a file format, and should not be expected to be transferrable toi a lot of other applications -- especially as this format may change at any time at the whim of its owner.
- Desktop-Oriented Database Formats Fro example, Microsoft Access is a tool that lets you make very complex arrangements of tables. From the filesystem, the whole complex of tables is represented as a single file. These database applications offer advantages in terms of the size of tables that may be supported, To get inside one of these things a software developer must pay licensing fees to Microsoft for the special programs involved.
- Enterprise Scale, Server-Oriented Databases IN an organization where many people and applications may need to access and even update the latest version of a table, we find server-based database management systems liuke Oracle, MYSQL, Postgres Microsoft SQL, and others. These Relational Database Management Systems (RDBMS) can handle (theoretically) unlimited amounts of data and serve it very efficiently to applications that make requests according to standardized protocols. These tools can also manage passwords and permissions and grant privileges to particular users. It is worth noting that the International Standards Organization, (ISO) has specific standards for the fundamental logical data teypes that must be supported in a relational database management system, and the protocols that they must support for exchanging data with applications that requst it. Even if the technical data format used to store the data on disks is completely incomprehnsible, these systems are cosidered to be open and stable, to the degree that they support the ISO specification. For archival purposes, dumps must be made in open format exchange files.
Vector Data Formats
With the exception of the popular CAD formats, .dwg and .dxf, vector formats supported in GIS are extensions of the tabular types discussed above. In effect, Points, Lines and Polygons merely extend the range of datatypes and associated logic traditionally offered in tables The ISO standards for logical datatypes were extended to handle spatial data: points, lines, polygons, and surfaces; in the mid 1990s. In the world of ArcMap, individual datasets representing collections of objects each having the same type of feature are known as Feature Classes.
- Text Files By including a numeric fields for X and Y coordinates, you can associate references to point entities in a text file. Note that the way that these coordinates relate to places on the planet requires some metadata that specifies the specific Spatial Referencing System that is employed.
- DWG and DXF Formats These are proprietary formats of the AutoDesk corporation, though they are considered to be open and documented, there is nothing preventing AutoDesk from changing these. The ability to attach semantic infomration to geometry in these formats is very limited and fixed. This leads people to develop elaborate schema that distinquish among entities according to their layer and color. These formats are also very flexible in terms of the ability to mix all sorts of geometry types in a single dataset, which canmake it difficult to create models that function predictably from a GIS perspective. ArcGIS can read DWG and DXF directly and can export to CAD format files and attach georeferencing metadata to them. See CAD Data Support in ArcGIS and related links. Also Converting GIS Data to CAD.
- Shape Files Shape files are more or less an extension of DBase data format to handle spatial data types. The Shapefile format is owned by ESRI, but it is openly documented and considerd to be a stable exchange format, for now. Shape files can cary their spatial referencing information in an associated .prj file, however, shape files created before 2003(?) wil not have a prj file and will therefore not cooperate with ArcGIS on-the-fly projection.
- File Based, and Personal Geodatabases ArcGIS has extended the Microsoft Access data format .mdb files for handling spatial data types. This gives the flexibility to have long field names and large datasets, and complex relationships among tables. One really useful thing about geodatabses is that the geometric properties of features, i.e. their area and perimter is automatically updated when the geometry is changed. In version 9.0, ESRI introduced its own File Geodatabase format for geodatabases that is not tied to the mixrosoft database format.
- Enterprise-Scale Geodatabases An Enterprise Scale Geodatabase is a means of using server-based relational database management tools, like Oracle of Postgres to store feature classes. The nature of relational database management systems is such that these datasets have no limit in terms of size.
- Web Feature Services (WFS) is a means of sending styled points lines and polygons to web clients (not supported in simple web browsers. Open protocols allow GIS operations to be performed on these features. A Transactional Web Feature Service (WFST)even allows clients to edit features through the web. Web Feature Services are stable community-based standards of the OGC..
Raster Image Formats
While Tabular data structutes allow us to distinquish and form associations among different classes of discrete entities, Raster Images provide containers for representations of locations. Locations are identified by cells or pixels, that can be associated with attributes. From a perspective of GIS, there are a couple of important aspects of rasters:
- Do they carry georeferencing information, linking the cell locations to specific coordinates and or relationship with places on the globe? ArcMap can Georeference any sort of image using its own scheme of world files and aux.xml files that can be associated with an image. But some image fomrats contain their own internal georeferencing information.
- What is the bit depth that is supported by the attribute referencing system.
- Does the raster format support compression, and how lossy is it?
- Bitmap Images Each cell can have one of two values: 1, or 0, Black or White.
- Color Mapped Images, 8-bit GIF, and PNG images typically support 256 different distinctions which can be mapped to specific colors, or transparancy.
- 8-Bit Gray Scale Images Each pixel is assigned an attribute from 0-255, representing a gradient.
- True-Color, Multi-Band Images TIFF is a format that integrates three channels of 8 bit inteinsity that is useful for representing mixtures of Red, Green and Blue. Because of these 3 8-bit channels, supporting 16,777,216 distinct combinations of color. There is a GeoTIFF format which supports internal georefencing.
- Compressed Image Formats Jpg is an example of a means of compressing image information. These images have a 24 bit depth, but depending on the compression, the edges of things, the actual assignment of data to pixels ins not necessarily predictable, even if it looks right.
- Wavelet Compression SID files and JPEG2000 format files use a progressive compression method that responds to how close you are zoomed in. The compression is much better than plain jpeg and both of these formats carry internal georeferencing capabilities -- though they are not always used.
- Deeper Raster Formats When you have a raster GIS dataset that requires a continuous range of attributes that extends beyond 256 values in a single channel, you need a deeper raster structure. These are provided by the proprietary ArcInfo GRID format or the simpler and more open Imagine IMG format.
- Image Map Services (WMS) are a means of sending georeferenced images to a web browser. These images may be pre-tiled or composed on the fly -- effectively making a map the server, taking an imae of it and sending it to the browser. Google Maps is an example of an image map services. The open Geospatial Consortium has a very stable and popular specification for these known as the Web Map Service (WMS). These may be viewed in ordinary web browsers.
Styling and Portrayal
Most data encodings used for vector GIS data do not contain a built-in way of speecifying the way that features will be symbolized and labeled. The best argument for keeping such styling or portrayal information separate form the data encoding is that there are any number of ways that a person may want to style data. Furthermore the specific options for doing this will vary depending on what sort of tools and delivery mechanisms may be involved. Therefore, it makes sense to have this sort of information separate.
ArcGIS Layer Files: Layer files are a proprietary format that permits very detailed associations between vector features with symbolic treatments including line colors, line weights, point symbols, polygon fills, and automatically generated labels. A Layer File can be used to over-ride the default coloring of images.
Styled Layer Descriptors (.sld): SLD are an open standard for layer styling that accomplishes what ESRI .lyr files do. This standard is goverbed by the Open GeoSpatial Consortium.
Data Models and Schema
Typical GIS databases that we collect frm sources are fairly elemental, consisting of discrete collections of vector features or individual raster layers. However, it is also useful to consider how complexes of feature classes and rasters can be organized to make data models that are coherent in terms of the relationships among features and potenmtially also engaged with rasters. Higher order data collections that operate this way are thought of as Schema in the sense of their abstract organization, or as Data Models when they are implemented and used. An advantage of thinking of schema in this way is that toolkits may be developed that develop inferences and perform experiments involving the consitiution of elements and relationships among them. Data models are discussed in more detail on the page, Modeling for Decision Support and Scholarship
Most schema that we make are relatively ad-hoc. However a very important movement in the field of GIS an other information management endeavors is to develop elaborate schema with organizations of people who will be better able to exchange very deep information and tools. A hallmark of this movement is the use of XML (Extensible Markup Language) -- which is a sort of meta schema. The development XML schema by communities of interest is driving a revolution in collaborative information models that can be systematically exchanged -- known as the Semantic Web. A branch of the open source movement, these efforts usually involve cross-disciplinary collaborations in which participants undestand that the development of a shared language will enhance their niche in the ever-more diverse information ecology. There are several non-profit collaborations that are very active in developing very useful schema. For example, see The general Transit Feed Specification or CityGML