User:SK53/NHD Upload

From OpenStreetMap Wiki
Jump to navigation Jump to search

A bit more on NHD Uploads

Sub-basin progress & plans

Currently working through the 6 Colorado Headwater basins (14010001..6), with Blue River WaterBodies, Flowlines and Lines imported. Starting to import all WaterBody data for the other 5 basins in one sequence of uploads. SK53 21:26, 2 September 2009 (UTC). Data import completed around end Sept 2009. Data tidy-up proceeding tile-by-tile at level 11 (see below).

  • 140010001. All NHD import complete. No conflicts noticed. Joining water bodies, waterways etc. in progress.
  • 140010002. All NHD data imported, small number of conflicts. Major one is Dillon Reservoir which is not yet resolved (both original OSM & NHD data currently in the database). Sections of river near Keystone already existed but were very incomplete and had various errors. These were deleted and replaced by NHD data, which whilst only slightly better was accurate with respect to flow direction and complete.
  • 140010003. All NHD import complete. Joining water bodies, waterways etc. in progress.
  • 140010004. All NHD import complete. Joining water bodies, waterways etc. in progress.
  • 140010005. All NHD import complete. Joining water bodies, waterways etc. in progress. Some conflict with Colorado riverbank already imported from NHD (with a lot of untagged segments).
  • 140010006. All NHD import complete. Joining water bodies, waterways etc. in progress. Some conflict with Colorado riverbank already imported from NHD (with a lot of untagged segments).

Process

With Blue River done I'm trying to work out a fairly routine process. Major problem I encountered was bug in bulk_upload.py running on Windows which left lots of lonely nodes all over central Colorado. Result was that I had to do quite a lot of manual reverting, being unwilling to try the perl script. It helps if ways (not nodes) are tagged with some kind of upload sequence to facilitate this type of backing out.

Now, I'm using xapi queries to analyse an area before performing uploads. Together with a special Kosmos rule file I can quickly visualise any likely conflicts. Still have to decide what to do on conflicts: e.g., manually mark duplicated ways in NHD data to quickly retrieve on uploads.

Having now done most of the import, this is the process I envisage using in the future:

  • Download NHD data for all subbasins before start of any data manipulation.
  • Determine bbox for the basin.
  • Run xapi queries for major features which might cause conflict.
  • Prepare all waterbody & area data (lakes, ponds, marshes, riverbanks etc) first.
  • Check for conflicts so as to plan how to resolve 'em (e.g., remove from import, load and then merge etc).
  • Import all waterbody & area data
  • Prepare flowline data by subbasin.
  • Check for conflicts (less likely than for areas)
  • Import by subbasin
  • Once each subbasin is fully imported join ways together (see below)

Problems

Problems experienced so far include:

  • GDAL python bindings caused failure with some of the shp2osm routines. Import to ensure that these bindings work: it's easy to disrupt a working environment.
  • Bug in the area shp2osm script for feature xxxxx.
  • bulk_upload.py failing mid-upload creating lots of orphan nodes. I probably need to understand how to use this a bit better.
  • Nodes from the Flowline data and the Area and Water-Body are slightly displaced. This makes tidying the data much slower than I hoped.

Merging the data sets

After data has been imported it looks OK, but the data needs to be tidied up. First any bad data from failed or duplicate uploads needs removing, then waterways joined together, and finally these joined to riverbanks and lakes/ponds. I use JOSM with the validator plugin. For NHD 1401 (Upper Colorado) a level 11 tile typically has about 40000 nodes once the NHD data has been imported, so I just maintain a list of tile co-ordinates and work through them usually from W->E alternating N->S and S->N so that I always work with an adjacent tile. Its a good idea NOT to change data outside the download area to avoid the time-consuming need for conflict resolution. Here's what I do:

  • Validate the whole set of data
  • Delete any single untagged nodes (orphans) resulting from failed data imports.
  • Upload changes in a changeset
  • Select attribution=NHD & waterway=stream (to avoid trampling on other data)
  • Validate this set, and bulk fix all duplicate nodes.
  • Upload changes in a changeset. Failure to do this can result in conflicts in the next stages.
  • Select attribution=NHD & waterway=canal (as above)
  • Validate this set, and bulk fix all duplicate nodes.
  • Upload changes in a changeset (optional)
  • Select attribution=NHD & man_made=pipeline (as above)
  • Validate this set, and bulk fix all duplicate nodes.
  • Upload changes in a changeset (optional)
  • Select attribution=NHD (as above)
  • Validate this set, and manually fix duplicate nodes. This joins canals to streams, canals to pipelines etc, when appropriate.
  • Upload changes in a changeset
  • Select attribution=NHD and add source=NHD (different for closed ways from open ways)
  • Validate this set, and manually fix duplicate and close nodes.
  • Upload changes in a changeset

Throughout the process any conflicts with existing data may well become noticeable.

These historical problems with bulk-upload.py were substantially eliminated by various versions of ogr2osm (originally written by Ivan Sanchez).

Other Issues

  • GNIS data. There appear to be at least two conflicting imports of GNIS data for the area of central Colorado. Many reservoirs have two nodes, often with different names and GNIS identifiers, but clearly applying to the same water body when checked against NHD and aerial photos. Ideally all the GNIS stuff would be merged onto the NHD closed area, but for now I am leaving them alone.
  • Intermittent waterways. These are not separately tagged from other waterways.
  • Dried-out reservoirs. I have imported a number of reservoirs which clearly have not held substantial quantities of water for some time (judging by scrub development in the aerial photos). Some of these are recorded as important wildlife sites, so I presume they are still substantially wet in spring and autumn.
  • Irrigation canals. The default mapping of NHD class CanalDrain is to waterway=canal. In many cases this is at best misleading, but I'm yet to be fully convinced that the waterway=drain is any better. Many of these are difficult to see on aerial photography and may be partially covered.

Working Tags

Errors in uploads, potential duplicate data etc. showed the need to readily identify each data set during the upload process. In the end I used a key SK53:bulk_upload=* with values which reflected the NHD file in use. This key is automatically deleted from ways which are touched during an edit, so the total number of ways will gradually decline (provided people are interested in the waterways in the Colorado basin).