DataFrames 0.11+ compat via IndirectArrays #1090

andreasnoack · 2018-01-22T12:28:37Z

@tlnagy Here is my version of the DataFrame 0.11+ compatible version. This version passes locally (see caveat below) with this version of IndirectArrays. I decided that the wrapping behavior of CategoricalArrays was too inconvenient to get working with the current setup because it is implicitly assumed in most functions that "categorical" arrays return the same kand of values as non-categorical arrays. Maybe by restructuring the code to rely more on dispatch to smaller specialized methods it might be possible to use CategoricalArrays internally, but I'm not sure it is worth it. Basically, I don't think it is useful to have the pool handy for the use cases here in Gadfly.

Although the tests pass, it requires that Julia runs with --compilecache=no. Otherwise, it triggers what appears to be completely spurious ambiguity errors that we'd need @vtjnash or @JeffBezanson to comment on. It is hard for me to narrow down the issue but it could be related to module/using/include. E.g. I can trigger/detrigger the issue by including a file where all lines are commented. I've also been able to trigger/detrigger it with a foo() = 1 definition in the middle of the statistics.jl file.

pfitzseb · 2018-01-31T15:52:34Z

src/Gadfly.jl

@@ -1121,7 +1123,7 @@ end

 import Juno: Juno, @render, media, Media, Hiccup

-media(Plot, Media.Plot)
+# media(Plot, Media.Plot)


This will break Juno's plot rendering.

andreasnoack · 2018-02-01T10:11:20Z

So the issue mentioned above is fixed by JuliaLang/julia#25832. I've tried to cherry-pick the fix on release-0.6 branch and it works. This means that getting Gadfly working with the new Missing based data stack requires a new 0.6.x release which currently is not on the roadmap because of resource constraints. Hence there is a risk that we'll only get updated Gadfly supported after 0.7/1.0 is out. As a consequence, I've started the process of getting this branch working on Julia master.

tlnagy · 2018-02-01T20:44:19Z

Hence there is a risk that we'll only get updated Gadfly supported after 0.7/1.0 is out.

I think we're getting pretty close to a 0.7 release, no? I think that at this point it's been several months since the new DataFrames came out and the damage is likely done (people either pinned DataFrames or dropped Gadfly) so I wouldn't mind waiting till 0.7 came out.

As a consequence, I've started the process of getting this branch working on Julia master.

Awesome!

bjarthur · 2018-02-02T12:18:16Z

@andreasnoack could you add --compilecache=no to this PR in travis.yml? if it passes with that i'd suggest we merge. could also add a temporary warning to the README/docs about needing to run with that flag if you use DataFrames.

andreasnoack · 2018-02-02T12:36:05Z

I'd need to finish up JuliaArrays/IndirectArrays.jl#12 first but then I might try to add the flag here.

bjarthur · 2018-02-04T19:53:02Z

i tried checking out this PR, the PR in IndirectArrays, and using --compilecache=no. works great except test/testscripts/boxplot.jl. throws a DimensionMismatch error. can you confirm?

andreasnoack · 2018-02-06T21:47:40Z

I'm able to reproduce that error. I thought I had them all working. Will take a look.

andreasnoack · 2018-02-06T21:59:21Z

Oh. Now I know what is going on. I had added what I thought was a harmless deprecation fix. Should work now.

Btw. There'll be some issues with the testing when we transition to 0.7 because 0.7 doesn't produce the same random Int64s as 0.7 for the same seeds. Hence the uuids embedded in the produces images will be different so we'd need a way to handle that in the regression testing.

bjarthur · 2018-02-07T12:01:15Z

can we use a different seed to reproduce the old behavior?

andreasnoack · 2018-02-07T12:17:02Z

Unfortunately, I don't think that is possible. See also JuliaLang/julia#25058 (comment)

kleinschmidt · 2018-02-08T01:36:58Z

It looks like there's an issue somewhere here with levels/sorting. I'm getting a mismatch between group labels and the actual data plotted using Geom.subplot_grid.

current master (which is correct):

this PR (with @andreasnoack PR on IndirectArrays.jl as well):

nalimilan · 2018-02-08T09:21:10Z

It looks like there's an issue somewhere here with levels/sorting. I'm getting a mismatch between group labels and the actual data plotted using Geom.subplot_grid.

I don't really understand what's going on in the figure you showed, but if that's related to the order of levels then it might be due to the fact that IndirectArray does not support a custom ordering of levels? I've just realized that it could be a significant problem: with this PR, is the custom ordering of levels preserved when plotting for CategoricalArray variables? That's one of the main advantages of this type, and PooledDataArray supported it.

nalimilan · 2018-02-08T09:16:50Z

src/scale.jl

+discretize_make_pda(values::CategoricalArray, ::Void) = discretize_make_pda(values)
+function discretize_make_pda(values::CategoricalArray{T}, levels::Vector) where {T}
+    index = map!(t -> ismissing(t) ?
+                    findfirst(ismissing, levels) :


Really weird indentation here.

It is the first four space aligned position after the (. Arguments are aligned to the ( but this splits a ternary instead of two arguments so it should be indented. The call is not super easy to read as is but the indentation makes it easier.

AFAICT there are only three spaces, not four (assuming the count starts after (, i.e. starts with the t). More importantly, what bugs me is that the second operand isn't aligned with the first one.

Also, and more importantly, why is get needed here? Shouldn't be AFAICT. That's also going to be very very slow. Better do something like:

mapping = coalesce.(indexin(CategoricalArrays.index(values.pool), levels), 0) pushfirst!(mapping, coalesce(findfirst(ismissing, levels), 0)) mapping .+= 1 index = [mapping[x+1] for x in values.refs]

EDIT: I'm not sure what should happen for values which are not in levels: throw an error?

The same code should also work when levels is a CategoricalVector (i.e. no need for the special method below).

Could also drop the "pda" terminology.

andreasnoack · 2018-02-08T10:09:43Z

I'll have to take a closer look. Do you have a minimal example to reproduce this? My guess is that I'd have to be more careful when converting a CategoricalVector to an indirect Vector such that the values end up in the right order.

nalimilan · 2018-02-08T10:28:30Z

Sorry, I don't use Gadfly, but I guess something like this should do if you set a custom non-alphabetical order for countries (e.g. grouped by regions). Note that it may already be working, I just happened to think about it when reading the description of the issue above.

kleinschmidt · 2018-02-08T14:32:23Z

@nalimilan the problem with the figure I showed is that the labels on the facets (\alpha=0.01, 0.1, 1, 10) are wrong. with the PR they've been circshifted by one, while the facets are still plotted in the same order (just the labels have changed). The values that are used to group the facets are a data frame column that's an Array{String}.

I will look into this more later and figure out a minimal example but don't have time right at the moment.

bjarthur · 2018-02-11T23:11:15Z

i merged a PR which requires some conflicts to be resolved with this one. sorry, but they are small. good thing is that regression testing is even easier now!

andreasnoack · 2018-02-12T10:49:56Z

@kleinschmidt I've pushed a possible fix for this. Would be great if you could try it out.

@bjarthur No problem. I've rebased.

bjarthur · 2018-02-12T12:13:43Z

before we merge this, would it make sense to alter REQUIRE to put a ceiling on DataFrames of 0.11.4? if so, i'd say let's do that ASAP.

also, why were breaking changes in DataFrames made on a point release? shouldn't they have bumped the minor version?

andreasnoack · 2018-02-12T12:29:39Z

would it make sense to alter REQUIRE to put a ceiling on DataFrames of 0.11.4?

I'm not sure I follow. Why would you put a ceiling there? We could try to put a ceiling at the next breaking changing version but DataFrames doesn't follow SEMVER (yet) so it for to know if it 0.14, 0.17, or 1.0.

The breaking change in DataFrames happened at 0.11.0 but there have been some bug fixes after that. I think I decided to use 0.11.4 because it has a fix for the deprecated readtable function.

I'll try to finish the IndirectArrays PR soon such that we can have this merged.

bjarthur · 2018-02-12T13:22:06Z

the motivation to upperbound DataFrames now is to fix the master branch of Gadfly now and give you time to finish without hurrying. i assumed the breaking change was 0.11.4 since that's what this PR specifies in its REQUIRE.

CategoricalArrays

andreasnoack · 2018-03-28T17:56:07Z

The previous test error was because of an improvement in RData. Hopefully, tests will now pass.

tlnagy · 2018-03-29T00:02:54Z

REQUIRE

@@ -16,3 +16,10 @@ Loess
 Showoff 0.0.3
 StatsBase
 Juno
+IndirectArrays 0.4.1
+Missings
+# Gadfly doesn't use WeakRefString directly but because of a but in Julia 0.6.2, Gadfly


bug not but

tlnagy · 2018-03-29T00:36:18Z

Old behavior:

a = DataFrame(diff = PooledDataArray(Float64[1,2,3,3,3,4,3,2]))
plot(a, x="diff", Geom.histogram)

New behavior:

a = DataFrame(diff = CategoricalArray(Float64[1,2,3,3,3,4,3,2]))
plot(a, x="diff", Geom.histogram)

I think the new behavior makes sense with the transition from PooledDataArrays to CategoricalArrays. The latter is what I would expect for categorical data. And you can get the old behavior back if you just use an Array

a = DataFrame(diff = Float64[1,2,3,3,3,4,3,2])
plot(a, x="diff", Geom.histogram)

tlnagy · 2018-03-29T00:37:49Z

I'm okay with merging since tests are passing and there aren't any regressions. @bjarthur, thoughts?

nalimilan · 2018-03-29T08:57:10Z

src/scale.jl

@@ -173,6 +174,9 @@ function apply_scale(scale::ContinuousScale,
    for (aes, data) in zip(aess, datas)
        for var in scale.vars
            vals = getfield(data, var)
+            if vals isa CategoricalArray


Is there a strong reason to throw an error here? I fully agree that by default CategoricalArray should be treated as categorical even if it contains numeric values (that was the rationale for moving from PooledDataArray), but if the user explicitly requests a continuous scale, why not accept it? Or is it because the code fails later?

is there a strong reason to throw an error here?

Yes. It won't work. The CategorialcalValues will cause failures sooner or later so I thought an informative error would be more helpful. The specific error in the example I tried is a missing isfinite(::CategoricalValue). As mentioned previously, the core of this PR is to avoid that CategoricalValues escape to Compose. In the discrete case, all data vectors go through discretize where CategoricalArrays are converted to IndirectArrays to avoid CategoricalValues but in the continuous case there isn't a similar function we can use for making sure all CategoricalArrays are converted.

that was the rationale for moving from PooledDataArray

I don't really understand why it wouldn't be exactly the same thing for PooledDataArrays. I'd also expect them to be treated as categorical.

OK. That can always be improved later anyway if somebody needs it. Thanks for finishing this porting work!

I don't really understand why it wouldn't be exactly the same thing for PooledDataArrays. I'd also expect them to be treated as categorical.

Yeah, but there had always been an ambiguity, as PDAs were also used as an efficient storage type (e.g. they supported math operations). Now the distinction should be clearer, if only because of the CategoricalArray name.

bjarthur · 2018-03-30T13:07:11Z

subplot_grid and subplot_grid_free_axis still show regression differences for me.

andreasnoack · 2018-03-30T13:31:59Z

subplot_grid and subplot_grid_free_axis still show regression differences for me.

I get the plots below. Indeed the years have flipped but I think the new version is the correct one.

julia> barley = dataset("lattice", "barley");

julia> levels!(barley[:Year], ["1931", "1932"]);

julia> barley[(barley[:Year] .== "1931") .& (barley[:Site] .== "Morris") .& (barley[:Variety] .== "No. 475"),:]
1×4 DataFrames.DataFrame
│ Row │ Yield │ Variety │ Year │ Site   │
├─────┼───────┼─────────┼──────┼────────┤
│ 1   │ 22.6  │ No. 475 │ 1931 │ Morris │

julia> barley[(barley[:Year] .== "1932") .& (barley[:Site] .== "Morris") .& (barley[:Variety] .== "No. 475"),:]
1×4 DataFrames.DataFrame
│ Row │ Yield   │ Variety │ Year │ Site   │
├─────┼─────────┼─────────┼──────┼────────┤
│ 1   │ 44.2333 │ No. 475 │ 1932 │ Morris │

cached

genned

bjarthur · 2018-03-30T14:08:40Z

agreed! i've tagged gadfly and compose. as soon as that's merged into METADATA i'll merge this.

bjarthur · 2018-04-01T15:35:54Z

@kleinschmidt @cecileane have you two had a chance to checkout Gadfly master since this PR merged? just want to make sure all is okay before we tag a release. thanks!

kleinschmidt · 2018-04-01T16:44:16Z

I’ll have a look later today.

cecileane · 2018-04-01T18:51:17Z

Thanks @bjarthur! Yes I checked out Gadfly, and I first ran into a bunch of problems (see this issue). I didn't figure out why, but things are now working (on a different computer) after checking out the master branch of Compose 👍 . After you guys publish a new version of Gadfly (and Compose?), I will add Gadfly back into PhyloPlots.

cecileane · 2018-04-03T19:53:10Z

well, my earlier comment was using julia v0.6.1. Things don't work with julia v0.6.2. At compilation, I get the same error as mentioned here, with a very cryptic stacktrace:

ERROR: LoadError: MethodError: no method matching convert(::Type{AssertionError}, ::String)
Closest candidates are:
  convert(::Type{Any}, ::ANY) at essentials.jl:28
  convert(::Type{T}, ::T) where T at essentials.jl:29

and final error:

MethodError(Core.Inference.convert, (AssertionError, "invalid age range update"), 0x0000000000000ac6)

This is using the master version of Gadfly and Compose, and adding back "using Gadfly" in PhyloPlots.jl. Not even including the file that uses Gadfly functions. Any idea welcome!

andreasnoack · 2018-04-03T20:21:23Z

@cecileane Thanks for trying to test this. I think you might be hitting JuliaLang/julia#21653. It shouldn't be related directly to the recent changes in Gadfly.

kleinschmidt · 2018-04-03T20:35:36Z

All is well for my use case! Thanks for working so hard on this everyone!

cecileane · 2018-04-03T20:57:10Z

yes indeed, thanks @andreasnoack. It looks like the same kind of error that @mauro3 showed, and it does seem sensitive to using lots of modules together.

tlnagy · 2018-04-04T02:16:19Z

Is DataArrays still needed in REQUIRE?

Gadfly.jl/REQUIRE

Line 8 in e36d55c

DataArrays

bjarthur · 2018-04-06T17:56:11Z

so just to clarify, with this now merged into master and IndirectArrays modified, does one need to use the --compilecache=no flag when using Gadfly?

andreasnoack · 2018-04-06T18:14:21Z

does one need to use the --compilecache=no flag when using Gadfly?

No. Things should work and the flag is not present in .travis.yml.

The only regression that I'm aware of is #1130. I think I know how to fix it but I'm off for a week of vacation on Sunday so might not get around fixing it before that. It is mainly a matter of changing the part of

Gadfly.jl/src/misc.jl

Lines 408 to 414 in 12327dd

    
           function discretize_make_ia(values::CategoricalArray{T}, levels::Vector) where {T} 
        
               mapping = coalesce.(indexin(CategoricalArrays.index(values.pool), levels), 0) 
        
               unshift!(mapping, coalesce(findfirst(ismissing, levels), 0)) 
        
               index = [mapping[x+1] for x in values.refs] 
        
               any(iszero, index) && throw(ArgumentError("values not in levels encountered")) 
        
               return IndirectArray(index, convert(Vector{T},levels)) 
        
           end

that throws and instead convert the zeros in index to n+1 and add missing to levels. However,

Gadfly.jl/src/misc.jl

Lines 397 to 398 in 12327dd

    
           discretize_make_ia(values::AbstractVector, levels) = 
        
               IndirectArray(Array{UInt8}(indexin(values, levels)), levels)

should probably also be changed similarly in case the input is e.g. a Vector{String}.

andreasnoack force-pushed the anj/indirect branch from ea27bfd to ad42de7 Compare January 22, 2018 12:29

nalimilan mentioned this pull request Jan 26, 2018

Port dependencies to Missing and CategoricalArray JuliaData/DataFrames.jl#1232

Closed

pfitzseb reviewed Jan 31, 2018

View reviewed changes

andreasnoack force-pushed the anj/indirect branch from ad42de7 to 1980e9c Compare January 31, 2018 16:14

JeffBezanson mentioned this pull request Jan 31, 2018

fix a case of non-transitivity in method specificity JuliaLang/julia#25832

Merged

This was referenced Feb 1, 2018

[WIP] work towards updating to DataFrames v0.11+ framework #1088

Closed

Dataframes v0.11 is out #1065

Closed

bjarthur mentioned this pull request Feb 2, 2018

add example with a discrete scale and perserved order #1095

Merged

andreasnoack force-pushed the anj/indirect branch from 1980e9c to bc815d1 Compare February 6, 2018 21:55

nalimilan reviewed Feb 8, 2018

View reviewed changes

andreasnoack force-pushed the anj/indirect branch from bc815d1 to a10501b Compare February 12, 2018 10:45

andreasnoack added 5 commits March 28, 2018 19:31

remove nonzero_length

488f867

Make sure missing values are removed in color scales

3aca0ca

Change the way that classify_data(CategoricalArray) becomes :categorical

001d6d5

Add informative error message for continuous scales applied to

441db6e

CategoricalArrays

Only convert dates in timeseries_year_2 for old versions of RData

98f97ac

andreasnoack force-pushed the anj/indirect branch from 905db90 to 98f97ac Compare March 28, 2018 17:55

tlnagy reviewed Mar 29, 2018

View reviewed changes

tlnagy approved these changes Mar 29, 2018

View reviewed changes

Fix typo in REQUIRE

af8418d

nalimilan reviewed Mar 29, 2018

View reviewed changes

bjarthur merged commit c778af5 into GiovineItalia:master Mar 30, 2018

andreasnoack deleted the anj/indirect branch March 30, 2018 16:06

bjarthur mentioned this pull request Apr 23, 2018

fix problem with colors of different types #1135

Merged

tlnagy mentioned this pull request Apr 25, 2018

make Geom.label consider point size #1139

Merged

DataFrames 0.11+ compat via IndirectArrays #1090

DataFrames 0.11+ compat via IndirectArrays #1090

Conversation

andreasnoack commented Jan 22, 2018

Choose a reason for hiding this comment

andreasnoack commented Feb 1, 2018

tlnagy commented Feb 1, 2018

bjarthur commented Feb 2, 2018 • edited Loading

andreasnoack commented Feb 2, 2018

bjarthur commented Feb 4, 2018

andreasnoack commented Feb 6, 2018

andreasnoack commented Feb 6, 2018 • edited Loading

bjarthur commented Feb 7, 2018

andreasnoack commented Feb 7, 2018

kleinschmidt commented Feb 8, 2018 • edited Loading

nalimilan commented Feb 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

andreasnoack commented Feb 8, 2018

nalimilan commented Feb 8, 2018

kleinschmidt commented Feb 8, 2018

bjarthur commented Feb 11, 2018

andreasnoack commented Feb 12, 2018

bjarthur commented Feb 12, 2018

andreasnoack commented Feb 12, 2018

bjarthur commented Feb 12, 2018

andreasnoack commented Mar 28, 2018

Choose a reason for hiding this comment

tlnagy commented Mar 29, 2018

tlnagy commented Mar 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjarthur commented Mar 30, 2018

andreasnoack commented Mar 30, 2018

cached

genned

bjarthur commented Mar 30, 2018

bjarthur commented Apr 1, 2018

kleinschmidt commented Apr 1, 2018

cecileane commented Apr 1, 2018

cecileane commented Apr 3, 2018

andreasnoack commented Apr 3, 2018

kleinschmidt commented Apr 3, 2018

cecileane commented Apr 3, 2018

tlnagy commented Apr 4, 2018

bjarthur commented Apr 6, 2018

andreasnoack commented Apr 6, 2018 • edited Loading

bjarthur commented Feb 2, 2018 •

edited

Loading

andreasnoack commented Feb 6, 2018 •

edited

Loading

kleinschmidt commented Feb 8, 2018 •

edited

Loading

nalimilan Mar 9, 2018 •

edited

Loading

andreasnoack commented Apr 6, 2018 •

edited

Loading