Ideas for more accurate categorization, saving open files / projects, getting metadata without implementing new watchers #504

phiresky · 2020-11-03T11:36:06Z

I have searched the issues of this repo and believe that this is not a duplicate. [ searched for filename, accuracy, title etc]
I have searched the documentation and believe that my question is not covered.

I was working on a similar project as activitywatch a while ago, and though I've kinda abandoned it for now, I implemented some ideas that i think would benefit ActivityWatch as well to improve the accuracy of tracking:

Save the command line the program was run with. If you run a program from a file manager etc, it is always run with the to-open filename as the first argument. This means that by storing the cmdline, you can accurately get the actual filename of pdf files in a pdf viewer, image in an image viewer, video file in video player, audio players, text editors (libreoffice), gimp etc. This is even more useful when you are able to use the directory name of the file for further "bucketing", since the directory structure says a lot about what the file is about.

Example: Currently if I have my pdf viewer open, activitywatch only logs for example "Evince - Annual Report". It only shows the title of the pdf, and that it was a pdf. If you knew the filename, you could tell that the filepath was /home/x/projects/university/2020/some-topic-name/report.pdf, which is much more valuable. Also e.g. all movies are in a separate folder from tv shows, so you can categorize whether you spend more time with movies or shows.

Implementation: On X11, you can easily use _NET_WM_PID to get the PID (on all window managers i know of, e.g. Gnome, KDE, i3), then use system info (/proc/x/cmdline) to get the cmdline of that PID. Example code here.

On Windows, this should also be possible I think: https://github.com/phiresky/track-pc-usage-rs/blob/master/src/capture/winwins.rs#L115
Save the lsof of processes. This is separate and more janky than (1), and I see it's already been discussed here: https://forum.activitywatch.net/t/log-the-path-of-open-files-for-categorization/487
Parse structured data from window titles. In many programs, you can adjust what the displayed title is. This is a really easy trick to squeeze more useful data out of programs.
1. For browsers, there's simple addons that add the full URL to the title: firefox, chrome. This can replace aw-watcher-web in its current form by just parsing a regex of URL from the window title, although aw-watcher-web could also get more other useful information in the future (e.g. the creator of a video on youtube)
2. For shells, you can set the title using precmd / preexec to include the working directory, user (e.g. root), and command currently being run. In my case I simple add this as a JSON object to the title, which is then matched and parsed on the watcher side. The exact code is here: https://github.com/phiresky/track-pc-usage-rs#data-sources-setup This way I can track which project I was working on (via the cwd) as well as retrieve the full history of only that shell session since i store the session id.
3. For IDEs such as VSCode, you can add the project name and file name to the title, so you can tell which project the user was working on. This is a much easier though somewhat less flexible alternative to e.g. https://forum.activitywatch.net/t/bucket-and-event-design-fo-vs-code-extension/120 . You set the config window.title to ${dirty}${activeEditorShort}${separator}${rootName}${separator}🛤sd🠚proj=${rootPath}🙰file=${activeEditorMedium}🠘 VSCode. This appends a machine-parseable token to the end of the title. The reason I didn't just add a JSON object here and instead used the format 🛤[sd for software development🠚key=value<🙰>key=value<🙰>key=value🠘] is that VSCode as well as some other programs don't have a "JSON-Escape" functionality, so if the project name or filename included a " or a , the json would be broken, while files containing those unique unicode symbols like 🛤, 🙰 and 🠚 is less likely. Hacky, I know, but even using JSON should be fine for 99% of cases.
Currently it looks like aw-watch-window really only saves the window title and "appname" to the db. Imo, saving this little metadata is a bad idea, since if the app-name matchers change over time all old data collected is "worthless". I see there's been multiple issues here in the past about adding more browsers etc, which with the current model cannot be changed retroactively.
External data sources could be used to get software category etc. For example, you can detect the debian package that contains a program using dpkg -S /usr/bin/firefox-developer-edition. That will give you a unique name such as firefox. You can then look this up in wikidata using a query such as select ?software where archlinux_package = "vlc". Then you can use the instance_of relation to get which software_categorys that software is a part of: https://www.wikidata.org/wiki/Q171477 VLC is a "video player". This way you would use existing open data and also enable users to improve it easily. Note that this also needs storing more metadata than your "appname".

The text was updated successfully, but these errors were encountered:

ErikBjare · 2020-11-03T13:30:44Z

Nice project! Good to see you're also on the Rust track.

From the outset, I just want to say that over time the ActivityWatch approach has diverged a fair bit from arbtt. Many of the ideas are the same, but for the sake of UX, development speed, and the somewhat unique maintenance burden of building cross-platform stuff there has been a lot of tradeoffs along the way.

Many of your points are good/valid, but since we all work on projects like this with different goals/requirements/workflows in mind we all have a bias towards solutions that fit our specific problem. Our feedback from users over the years have guided us to solutions that are often a "one size fits all" that generalizes well, instead of an individually tailored solution (which can still be accomplished by forking).

Save the command line the program was run with.

We used to do this a long time ago. I can't remember why we removed it, but I suppose I decided on a tradeoff between space/speed and detail, it might also simply have been reliability/cross-platform issues I wasn't keen on resolving.

It's trivial to fork aw-watcher-window and add this yourself, if you really want it.

Making your PDF viewer (I use Okular) show the file path worked well enough for me. But for some applications that don't offer that option (like VLC, which only shows the name and not the path) it's obviously not as easy.

Save the lsof of processes.

As @johan-bjareholt mentioned in that forum link, it's way too much output to store by default. I'd do this seperately from ActivityWatch if I was really interested in this approach (but it seems messy).

Parse structured data from window titles.

The general idea about trying to get informative window titles is good (and would probably make a good addition to the docs, similar to how it's done for the arbtt docs).

The idea of adding JSON to the window title is neat, but seems messy and not something I expect most users to be interested in.

although aw-watcher-web could also get more other useful information in the future (e.g. the creator of a video on youtube)

It also currently tracks things like whether a tab is private or audible (playing sound), which is quite helpful (although not yet used for analysis in the web UI).

Imo, saving this little metadata is a bad idea, since if the app-name matchers change over time all old data collected is "worthless". I see there's been multiple issues here in the past about adding more browsers etc...

The old data collected wouldn't be worthless, but yes you'd need some minor changes to the classification rules, although I wouldn't really consider it an issue. App names have proven to be rather consistent, and the actual work needed to maintain the browser-mappings have been minimal (despite the several opened issues). Other solutions similar to what you propose have been considered, but have ultimately proven more complicated (considering the cross-platform nature of AW).

...which with the current model cannot be changed retroactively.

Not sure what you mean by this, changing it retroactively works just fine.

External data sources could be used to get software category etc...

Those are some very platform-specific scenarios that make them difficult to generalize. I've been down the rabbit hole of trying to use Wikidata for things like this, but I think it's better suited for external analysis than making it a part of ActivityWatch itself (to avoid scope creep).

Unrelated to the things you mentioned, I'm curious about https://github.com/phiresky/sqlite-zstd. In the long-term, I'd be very interested in a solution like this (I've probably mentioned compressing the database somewhere in the issues).

However, our users haven't really complained about database size, and aren't really expected to (with normal use). As an example, my DB (with a lot of data) is still less than 500MB, which is reduced even further by filesystem-level lz4 compression. So not at all a priority, but still of interest.

phiresky · 2020-11-04T13:31:00Z

ActivityWatch approach has diverged a fair bit from arbtt

I didn't actually realize activitywatch was related to arbtt :)

We used to do this a long time ago. I can't remember why we removed it, but I suppose I decided on a tradeoff between space/speed and detail, it might also simply have been reliability/cross-platform issues I wasn't keen on resolving.

Mh, I can't really imagine saving the cmdline being much more expensive than just storing the executable. I could imagine that it doesn't work trivially on Windows though.

The idea of adding JSON to the window title is neat, but seems messy and not something I expect most users to be interested in.

Can't really judge this one. For me the window titles aren't actually visible anywhere, so I don't care. But, if you append the data at the very end I don't think it's much of a problem, since long and cut off Window titles are pretty common (e.g. Browser title is almost always cut off).

App names have proven to be rather consistent

Looks like I didn't look closely enough at the code: I thought that what was stored in the DB was the app-category and not the app-name. But it is the app-name, so retroactively changing the mapping app-name -> app-category is possible. At least as long as the appname is sufficient to identify the program. I see it's using window-class on X11... I remember a few years ago that was not enough since many dialogs had generic window classes, as well as many java programs just had the window class Awt-Window or similar?

better suited for external analysis than making it a part of ActivityWatch itself (to avoid scope creep).

Yep, that's the main problem with my (incomplete) tool. I don't want to compromise so the scope gets too large.

Unrelated to the things you mentioned, I'm curious about https://github.com/phiresky/sqlite-zstd.

Since my approach to storing data is very different from AW, it's basically necessary for me: I store as much information as possible in a as raw format as possible to the events db - which are then mapped into a generic event later. E.g. an X11 event contains tons of information about every single program that was open at a time:
image which is then mapped into a general extracted info per event:
image
I even store all programs not just the focused one since focused webbrowser that is on rustlang.org + open IDE with Rust code probably means the browsing is related to that software project.

That means a single watcher event is 20kB, which totals to 5GB per year.

That's ridiculous, but it's necessary to stay with "my philosophy" to store the data as raw as possible. Trying to squeeze the data as tiny as possible IMO makes it very easy to make mistakes (transform data in a lossy way), to remove data that might turn out to be useful later, and make migrations harder. By moving the choice of which data is useful to the analysis step, you leave yourself and others more options on what to do with it, especially stuff you might not even have thought of. The raw data is large, but most of that is redundancy and not real information.

If you have a huge sqlite db of 16 different but always the same json objects in DB rows, sqlite-zstd should in theory be able to reduce that to only one byte per row. zstd does the main lifting here by creating a single dictionary that includes all the important commonalities between rows. sqlite-zstd already works well, the issue is that it's not flexible enough to be used in production (how do you decide when to retrain the compression dictionary?). It also needs more performance tests to see how much overhead decompressing rows causes (the decompression itself is >1GB/s, so more likely to improve performance by reading less data from disk than make it worse, it's just the initialization time of the decompressor that might be bad).

My approach here would probably be to split the database into monthly files, then when each month is over apply the transparent compression to the past month. Since opening and closing sqlite dbs is very cheap, this should scale fine (e.g. with ATTACH DATABASE). This way you don't have to think about retraining the dictionaries etc. Similar should even work without an sqlite extension, since you could zip the monthly sqlite dbs, and just load them read-only and fully into memory when historical data is requested.

I wanted to give you instructions to test sqlite-zstd on your db, but sadly it's based on an unmerged pr that's missing unchecked_transaction, so it doesn't currently compile.

stale · 2021-05-04T05:32:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale label May 4, 2021

stale bot closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for more accurate categorization, saving open files / projects, getting metadata without implementing new watchers #504

Ideas for more accurate categorization, saving open files / projects, getting metadata without implementing new watchers #504

phiresky commented Nov 3, 2020 •

edited

Loading

ErikBjare commented Nov 3, 2020 •

edited

Loading

phiresky commented Nov 4, 2020 •

edited

Loading

stale bot commented May 4, 2021

Ideas for more accurate categorization, saving open files / projects, getting metadata without implementing new watchers #504

Ideas for more accurate categorization, saving open files / projects, getting metadata without implementing new watchers #504

Comments

phiresky commented Nov 3, 2020 • edited Loading

ErikBjare commented Nov 3, 2020 • edited Loading

phiresky commented Nov 4, 2020 • edited Loading

stale bot commented May 4, 2021

phiresky commented Nov 3, 2020 •

edited

Loading

ErikBjare commented Nov 3, 2020 •

edited

Loading

phiresky commented Nov 4, 2020 •

edited

Loading