Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAN plan #4

Closed
mdsumner opened this issue Aug 26, 2017 · 11 comments
Closed

CRAN plan #4

mdsumner opened this issue Aug 26, 2017 · 11 comments

Comments

@mdsumner
Copy link
Member

mdsumner commented Aug 26, 2017

I think we've learnt enough to put out a bare bones release. The key C++ functionality to wrap is these, either as slightly overlapping implementations or as building blocks:

  • open and read from data source (return a minimal summary)
  • execute SQL against data source, or pass in an index for features to read
  • read the attributes only
  • read the geometry only (in an optional format)
  • read the geometry bounds only

Focus on just these features as a minimal core since we can build higher-level collections and descriptions of multiple data sources from these components

These functions

  • vapour_read_attributes - a list of column vectors
  • vapour_read_geometry - a list of native blobs
  • vapour_read_geometry_text - a vector of JSON, WKT, KML, or GML
  • vapour_read_extent - a list of extent vectors

Each function is standalone, opening and closing the data source, defaulting to the 0-th layer and only accepting a layer index, optionally executing a SQL string, and ignoring the layer index if sql is given.

@mpadge
Copy link
Member

mpadge commented Sep 4, 2017

I really like this minimal structure and think it should be largely maintained as is, so these are just a couple of thoughts as minor modifications:

  • Implement non-SQL sub-selection for simple attribute/feature extraction, via additional args naming attributes or features to extract. This will probably best be done along with next point:
  • Wrap Rcpp functions inside pure-R functions because this is generally much safer. Extra args from above could then be converted to corresponding SQL statements in R prior to submitting to Rcpp
  • Check for examples for which GDAL properly returns named features - there must surely be some? - and implement a vapour_read_features function that just returns the names.
  • Enable a chunk option in all functions to read only a limited number of attributes or geometries. This will be the "killer app" because it will enable proper scaling to read and process files way beyond memory size, which sf (in it's current implementation) will not be able to do. Obviously ultimately handled also by conversion within R into simple SQL statements.
  • (Becuase I ain't compiling at the mo, so just a lazy thought...) Check matching of parellel attribute and geometry extractions. In cases where features are named, all NULL entries can then be safely removed, making for much more compact attribute lists. Alternatively, name the list items with sequential ints so they can be properly re-indexed anyway. This might be a good idea because many operations in R will automatically remove NULL list items anyway, potentially causing issues in matching attributes to geometries.
  • This last point might require one additional function, which I suspect would be useful anyway, to easily enable attributes and geometries to be properly matched.

@mdsumner
Copy link
Member Author

rhub checks

✔  checking whether package 'vapour' can be installed
N  checking installed package size
     installed size is 78.6Mb
     sub-directories of 1Mb or more:
       gdal   3.9Mb
       libs  68.7Mb
       proj   5.0Mb
✔  checking package directory
✔  checking DESCRIPTION meta-information
W  checking top-level files
   Conversion of 'README.md' failed:
   pandoc.exe: Could not fetch http://badges.herokuapp.com/travis/hypertidy/vapour?branch=master&env=BUILD_NAME=trusty_clang&label=trusty_clang


✔  ** running examples for arch 'i386'
N  ** running examples for arch 'x64'
   Examples with CPU or elapsed time > 5s
                             user system elapsed
   vapour_read_geometry_text 5.89   0.04    5.94
   vapour_read_feature_what  5.78   0.00    5.91

@mdsumner
Copy link
Member Author

@mpadge btw totally with you on these points, now much more capable of applying them too - when I can get to it! Appreciate them as always

@mdsumner
Copy link
Member Author

Winbuild checks

Possibly mis-spelled words in DESCRIPTION:
  Geospatial (2:34, 13:17)
  geospatial (13:88)

Found the following (possibly) invalid URLs:
  URL: https://github.com/mdsumner/RGDASQL
    From: README.md
    Status: Error
    Message: libcurl error code 35:
      	error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
* checking package namespace information ... OK

@mpadge
Copy link
Member

mpadge commented Apr 25, 2018

See here for libcurl error - you can ignore that. On all my CRAN submissions I just state, "R CMD check generates notes on R-oldrelease regarding (possibly) invalid URLs promted by libcurl error code 35, which Jeroen Ooms has fixed on all later releases."

@mdsumner
Copy link
Member Author

Oh sweet, thanks! It's actually typo as well in "RGDALSQL", though - fixed now

@mdsumner
Copy link
Member Author

roaming TODO

  • [x ] license, MIT?
  • check coverage of attribute types, check raw, logical, factor/character, blob, etc
  • can we support WKB or EWKB, as simple option? (yes, should be a read_geometry_what option
  • is a NULL for any geometry an acceptable empty option? (revisit the EP issue about this, surely it's not needed to be so achey) , though see padgham's matching notes above
  • need some handling for sense if FID used in SQL, maybe build a list of which drivers are 0 or 1 based for the docs - belongs elsewhere given vapour philosophy
  • will this package include the DBI interface? NO
  • no high level packages, this is GDAL functionality
  • revisit padgham's list above

@mdsumner
Copy link
Member Author

mdsumner commented May 10, 2018

@mpadge no code but I just re-read your "only returns names" note above, and it's quite like the FID thing, OGRSQL uses 1 or 0 for the special FID virtual column, depending on the driver - but the OSM driver gives a literally arbitrary FID, the actual entity ID.

So, good idea - we just have a vapour_read_FID (or _names) function so there's a reference list for subsequent queries. I'm just putting a note-to-self: can we SELECT that in a useful way? Is that the best/only way to do it?

The danger is mapping that set to the data source which might have changed the FIDS (surely not the named-FIDS), but I think that's a clear "not vapour's problem" situation - maybe there's a way to specify "this really is a name" that GDAL can guarantee, but I doubt it works outside of named sources (like OSM). (Is there really no name, generally? Is that what's ultimately broken in GIS, yes ...)

mdsumner added a commit that referenced this issue May 10, 2018
@mdsumner
Copy link
Member Author

Regarding the "return by name" function, this works:

file <- "list_locality_postcode_meander_valley.tab"
mvfile <- system.file(file.path("extdata/tab", file), package="vapour")

vapour_read_attributes(mvfile, sql = "SELECT POSTCODE FROM list_locality_postcode_meander_valley WHERE FID IN (1, 23, 45)")

Where those FID names can be gotten with

vapour_read_names(mvfile)

I'm a bit reluctant to wrap this here because it means another level over the source, layer, sql, format, attributes, extent - and that might mean a redesign is needed, but it can be achieved now with just R and strings so seems to be enough. It does work with OSM, but I haven't explored real sources extensively

Obviously this might be a problem for a very long list of names!

osmfile <- system.file("extdata/osm/myosmfile.osm", package = "vapour")
vapour_read_attributes(osmfile, 
                       sql = sprintf("SELECT * FROM %s WHERE FID IN (%s)", 
                                     vapour_layer_names(osmfile)[1L], 
                                     paste(vapour_read_names(osmfile), collapse = ",")))

@mpadge I'm pretty sure this will suit, but in case you have concerns.

@mdsumner
Copy link
Member Author

mdsumner commented May 24, 2018

I'm happy that all the above is covered now, but it's still worth re-reading the comments for details of what else to do.

We def want before CRAN

  • convert layer = integer to layer = character, allow user to use either (nearly forgot - this is now easy given that all functions are R wrapped)
  • review help pages
  • fix the vignette, it's the original readme without any polish
  • write more tests

We maybe want

  • non-sql filtering, allow user to give basic row-numbers, or a set of attributes to check, anything we can expose that doesn't require SQL (although might use SQL)
  • extent filtering for features, so that a graphics device can ask for its data (can do indirectly with read extent)

@mdsumner
Copy link
Member Author

mdsumner commented Jun 6, 2018

I'm cool with this now, will play around for a while and push to CRAN some time soon.

@mdsumner mdsumner closed this as completed Jun 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants