alternative (easier?) way to define datasets #11

richardjgowers · 2018-10-10T21:47:21Z

@orbeckst what do you think about this? I was looking at how to write tests for this package, and with every single dataset implementing how it should be downloaded, there was a lot of room for potential errors. Instead if we define a template for how a dataset should look, we can have the logic of how this is retrieved in a centralised location.

It takes ideas from how we auto-register Readers in MDAnalysis, so Datasets are added to a list of available datasets once defined.

richardjgowers · 2018-10-10T21:47:58Z

MDAnalysisData/adk_equilibrium.py

-    return records
+class ADK_Equilibrium(Dataset):
+    NAME = "adk_equilibrium"
+    DESCRIPTION = "adk_equilibrium.rst"


One downside I can see is we've lost the short description that the function to generate this had

Don't want to loose the description and don't want to loose the docs...

So currently it looks like:

>>> print(ds.__doc__) AdK 1us equilibrium trajectory (without water) Attributes ---------- topology : filename Filename of the topology file trajectory : filename Filename of the trajectory file DESCR : string Description of the trajectory.

So mostly there still

richardjgowers · 2018-10-10T21:48:33Z

MDAnalysisData/base.py

+        super().__init__(**contents)
+
+
+def fetch(dataset, data_home=None, download_if_missing=True):


This allows MDADATA.fetch('adk_equilibrium'), rather than a separate function for each dataset

There's a reason why explicit functions: tab completion and introspection. (sklearn does it and it works really well – much better than having to know the name of the dataset)

I'd like to keep explicit functions – both for ease of use and for same "look and feel" as sklearn.datasets (as well as getting docs!)

We can have a generic mechanism and generate the fetch_* functions.

Yes, that's nice for this case. But did you look at some of the other accessors like fetch_adk_transitions_DIMS where we get a tar file and unpack? We might be able to reduce our requirements to these two types of cases.

could put the compression/other info into the RemoteFileMetaData object

Ok yeah the namespace is nice, we could implement the static functions as

def fetch_adk(): return base.fetch('adk')

RemoteFileMetadata is verbatim from sklearn. Might be useful to keep it that way and really keep it simple.

If anything, we should build data structures that contain RemoteFileMetadata instances that map remote and local. Have a look at the transitions dataset to see what else we have.

Finally, have a look at sklearn.datasets (and the outstanding docs) to see the variance. I think one reason for copy&paste code is that ultimately each dataset in the wild might have slightly different requirements. Still, that's not to say that we can't try to get a bit of order in ;-).

richardjgowers · 2018-10-10T21:49:11Z

MDAnalysisData/adk_equilibrium.py

-
-    Returns
-    -------
-    dataset : dict-like object with the following attributes:


and another thing lost is the clear description of what the returned Bunch will have....

I don't want to loose this...

orbeckst · 2018-10-10T23:13:36Z

MDAnalysisData/adk_equilibrium.py

+    # cannot check its checksum. Instead we download individual files.
+    # separately. The keys of this dict are also going to be the keys in the
+    # Bunch that is returned.
+    ARCHIVE = {


If we normalize all of this then we might be able to just put all these data into JSON or YAML files.

Sure, this is essentially py-son at this point

orbeckst · 2018-10-10T23:18:33Z

It's a good idea to have a template for the most common types of accessing

separate files
archives that need to be unpacked

I still want to keep explicit accessor functions because of ease of use, docs, and keeping similarity to sklearn.datasets.

We still need to be able to allow other code that does not fit into the general templates.

orbeckst · 2021-07-17T16:42:57Z

Might be worthwhile reviving this PR.... if someone wants to look into it.

alternative (easier?) way to define datasets

bc28bdd

richardjgowers commented Oct 10, 2018

View reviewed changes

richardjgowers added 2 commits October 10, 2018 16:51

added adk_description

3b5dc82

fancy repr

7e1cbd3

orbeckst reviewed Oct 10, 2018

View reviewed changes

orbeckst force-pushed the master branch from 19b0dd8 to ed3e370 Compare October 11, 2018 00:07

richardjgowers added 3 commits October 11, 2018 09:47

readded explicit fetch_x functions

665517b

shared docstring

369266a

added fetch docstring

bfd2c88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alternative (easier?) way to define datasets #11

alternative (easier?) way to define datasets #11

richardjgowers commented Oct 10, 2018

richardjgowers Oct 10, 2018

orbeckst Oct 10, 2018

richardjgowers Oct 10, 2018

richardjgowers Oct 10, 2018

orbeckst Oct 10, 2018

orbeckst Oct 10, 2018

richardjgowers Oct 10, 2018

richardjgowers Oct 10, 2018

orbeckst Oct 11, 2018

richardjgowers Oct 10, 2018

orbeckst Oct 10, 2018

orbeckst Oct 10, 2018

richardjgowers Oct 10, 2018

orbeckst commented Oct 10, 2018

orbeckst commented Jul 17, 2021

		super().__init__(**contents)


		def fetch(dataset, data_home=None, download_if_missing=True):

alternative (easier?) way to define datasets #11

Are you sure you want to change the base?

alternative (easier?) way to define datasets #11

Conversation

richardjgowers commented Oct 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Oct 10, 2018

orbeckst commented Jul 17, 2021