Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_weather: don't transform into DataFrame for selecting columns #19

Open
VEZY opened this issue Jan 16, 2023 · 0 comments
Open

write_weather: don't transform into DataFrame for selecting columns #19

VEZY opened this issue Jan 16, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@VEZY
Copy link
Member

VEZY commented Jan 16, 2023

Having a dependency on DataFrames is quite bad because it has a very fast pace (many releases), which makes it a risk for us if we want to not spend much time on maintenance (although DataFrames is pretty stable by now).

We could implement a selection for columns by ourselves, or use the one provided by TableOperations.jl.

Now there is a weird thing, using select from TableOperations.jl makes a lot of copies, resulting in being not faster than transforming the TimeStepTable into a DataFrame and selecting the resultting df. We have to find where those copies are happening.

Tested with the following code:

using PlantMeteo, Dates, TableOperations, DataFrames

file = joinpath(dirname(dirname(pathof(PlantMeteo))),"test","data","meteo.csv")
w = read_weather(
    file,
    :temperature => :T,
    :relativeHumidity => (x -> x ./100) => :Rh,
    :wind => :Wind,
    :atmosphereCO2_ppm => :Cₐ,
    date_format = DateFormat("yyyy/mm/dd")
)

function select_weather_tableop(w, select=setdiff(propertynames(w), PlantMeteo.ATMOSPHERE_COMPUTED))

    Tables.istable(w) || throw(ArgumentError("The weather data must be interfaced with `Tables.jl`."))

    if select !== nothing
        select_ = [select...]
        for var in select
            # check if the variables are in the table, if not remove them from the selection:
            if !hasproperty(w, var)
                popat!(select_, findfirst(select_ .== var))
            else
                # Remove variables with all values at Inf (default value, we don't need to write it)
                if all(Tables.getcolumn(w, var) .== Inf)
                    popat!(select_, findfirst(select_ .== var))
                end
            end
        end
    end

    # select the variables and return:
    return TableOperations.select(w, select_...)
    # return w |> TableOperations.select(select_...) |> Tables.columntable
end

function select_weather_df(w, select=setdiff(propertynames(w), PlantMeteo.ATMOSPHERE_COMPUTED))

    Tables.istable(w) || throw(ArgumentError("The weather data must be interfaced with `Tables.jl`."))

    if select !== nothing
        select_ = [select...]
        for var in select
            # check if the variables are in the table, if not remove them from the selection:
            if !hasproperty(w, var)
                popat!(select_, findfirst(select_ .== var))
            else
                # Remove variables with all values at Inf (default value, we don't need to write it)
                if all(Tables.getcolumn(w, var) .== Inf)
                    popat!(select_, findfirst(select_ .== var))
                end
            end
        end
    end

    # select the variables and return:
    return DataFrames.select(DataFrames.DataFrame(w), select_)
end


df_tableop = select_weather_tableop(w) # 17.875 ms Memory estimate: 3.49 MiB, allocs estimate: 65415.
w_df = DataFrames.DataFrame(w)
df_df = select_weather_df(w) # 18.633 ms Memory estimate: 3.59 MiB, allocs estimate: 67114.

@benchmark select_weather_tableop($w) # 17.978 ms Memory estimate: 3.49 MiB, allocs estimate: 65415.
@benchmark select_weather_df($w) # 18.604 ms Memory estimate: 3.59 MiB, allocs estimate: 67114.
@benchmark select_weather_df($w_df) # 24.334 μs 18.82 KiB, allocs estimate: 251.

Results:

  • using TableOperations on a TimeStepTables takes 18ms (very slow!!).
  • using DataFrames on a TimeStepTables takes 19ms (very slow too, but expected as we transform into a df).
  • using DataFrames on a DataFrame takes 24 μs, which is fast and what should be equal with TableOperation.
@VEZY VEZY added the enhancement New feature or request label Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant