Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to notify systemd on olad startup and config reload #1444

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

shenghaoyang
Copy link
Contributor

@shenghaoyang shenghaoyang commented Jul 12, 2018

  • This helps users who run olad under the management of systemd to
    order their services after olad startup, (i.e. starting up a daemon
    that depends on the olad RPC socket being available), avoiding
    the need for a manual delay.

  • If anyone's versed in autotools magic, please help check the
    changes to configure.ac. I've only dabbled with autotools one
    other time in my life, and I'm not sure if what I did was exactly correct.

  • Tested on my local machine with a unit file running the olad binary
    as a notifying service. Reload messages tested as well. And I do
    see the systemd generated reload notifications bounding the
    output from olad during the reload process.

  • Not sure how we go about writing test cases for this, though.
    Seems a bit pathological to formally test notification functionality,
    but if the CI environment has systemd and libsystemd installed,
    we could cook up a test for the notification functionality using
    systemd-run - if the notification didn't go through, then
    systemd-run should notify on an error starting the service.

Thanks for reviewing.

Shenghao

This helps users who run olad under the management of systemd to
order their services after olad startup, (i.e. starting up a daemon
that depends on the olad RPC socket being available), avoiding
the need for a manual delay.
Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configure.ac Outdated
@@ -387,6 +387,13 @@ esac
have_libftd2xx="no"
AM_CONDITIONAL([HAVE_LIBFTD2XX], [test "x$have_libftd2xx" = xyes])

# libsystemd
AC_SEARCH_LIBS([sd_notify], [systemd], [have_libsystemd="yes"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see have_libftd2xx above, you should really initialise it to no before you use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I'll follow the examples from the above location.

What do you think about making linking against libsystemd a command line option to the build system? I'm thinking build systems that have libsystemd may silently cause built binaries to link against libsystemd, and this could lead to a bit of a dependency problem, unless they're aware of this new change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it may make sense, see e.g. https://github.com/OpenLightingProject/ola/blob/master/configure.ac#L494-L518 .

I was wondering, but at least it's only libsystemd, it doesn't require systemd itself.

#ifdef HAVE_LIBSYSTEMD
// Return value is intentionally not checked for both calls.
// See return value section under sd_notify(3).
sd_notify(0, "RELOADING=1\nSTATUS=Reloading plugins\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having read the man page, I understand why it's not checked, but for user sanity, we should add some logging. Probably both to signify sd_notify has been run/the return code and particularly a warning if it failed.

I'd probably suggest OLA_WARN if we get <0 return code, and then maybe OLA_INFO for 0 and OLA_DEBUG for >0, but that may be a bit excessive. Likewise others may argue those log levels should all be reduced by one...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the system is too old to support RELOADING? I assume they just get discarded as they're plain text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the system is too old to support RELOADING? I assume they just get discarded
as they're plain text?

sd_notify(3) mentions this: The string may contain any kind of variable assignments, but the following shall be considered well-known.

I think that does indeed imply that if the service manager doesn't recognize the variable being assigned to, it'll either log that somewhere or ignore that assignment, so we shouldn't have a problem with old systems.

I understand why it's not checked, but for user sanity, we should add some logging. Probably
both to signify sd_notify has been run/the return code and particularly a warning if it failed.

If that's the case, would a message at startup do? libsystemd provides capabilities for determining whether a binary has been setup under the control of systemd, and that could be used to indicate that notification messages are being used. As for the failures, though, we could log those as well, but I think conditionally - only when being run under systemd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, there's no library function to determine whether the service has been run under systemd, but the call shouldn't return a negative value if the socket was not passed, anyway. I'll log failures and log from main to indicate whether notifications are being passed.

OLA_INFO << "Reloading plugins";
StopPlugins();
m_plugin_manager->LoadAll();
#ifdef HAVE_LIBSYSTEMD
sd_notify(0, "READY=1\nSTATUS=Plugin reload complete\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, some logging to help the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

olad/Olad.cpp Outdated
#if HAVE_LIBSYSTEMD
// Return value is intentionally not checked. See return value section
// under sd_notify(3).
sd_notify(0, "READY=1\nSTATUS=Startup complete\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a bit of logging please.

Copy link
Contributor Author

@shenghaoyang shenghaoyang Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I don't think we need it for this one, though. It could be useful to tell the user that we're not running under systemd, or that we have failed to access the notification socket due to a configuration error.

However, if olad was launched as a systemd service of type notify (which is the intended use case): then systemd will report that the start of the service failed, anyway, which should alert the user to the error. There also shouldn't be an error accesing the notification socket, because systemd implicitly allows services with type=notify to access the notification socket.

A failure looks like this:

● olad.service - ola daemon
   Loaded: loaded (/home/shenghao/.config/systemd/user/olad.service; static; vendor preset: enabled)
   Active: failed (Result: timeout) since Sat 2018-07-14 11:43:42 +08; 9s ago
  Process: 25379 ExecStart=/home/shenghao/Projects/ola/olad/olad (code=exited, status=0/SUCCESS)
 Main PID: 25379 (code=exited, status=0/SUCCESS)

Jul 14 11:42:13 shenghao-crbook15 olad[25379]: *** WARNING *** Please fix your application to use the native API of Avahi!
Jul 14 11:42:13 shenghao-crbook15 olad[25379]: *** WARNING *** For more information see <http://0pointer.de/blog/projects/avahi-compat.html>
Jul 14 11:42:13 shenghao-crbook15 lt-olad[25379]: *** WARNING *** For more information see <http://0pointer.de/blog/projects/avahi-compat.html>
Jul 14 11:42:13 shenghao-crbook15 olad[25379]: common/io/IOUtils.cpp:39: open(/dev/dmx0): No such file or directory
Jul 14 11:42:13 shenghao-crbook15 olad[25379]: plugins/opendmx/OpenDmxPlugin.cpp:81: Could not open /dev/dmx0 No such file or directory
Jul 14 11:42:13 shenghao-crbook15 olad[25379]: common/io/IOUtils.cpp:39: open(/dev/kldmx0): No such file or directory
Jul 14 11:42:13 shenghao-crbook15 olad[25379]: plugins/karate/KaratePlugin.cpp:80: Could not open /dev/kldmx0 No such file or directory
Jul 14 11:43:42 shenghao-crbook15 systemd[3896]: olad.service: Start operation timed out. Terminating.
Jul 14 11:43:42 shenghao-crbook15 systemd[3896]: olad.service: Failed with result 'timeout'.
Jul 14 11:43:42 shenghao-crbook15 systemd[3896]: Failed to start ola daemon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's those edge cases that can be frustrating to troubleshoot though. If I forget to put type=notify, I want olad to tell me the socket isn't available, so I know to tweak my systemd config. Likewise in that timeout case, I'd like to know olad did/didn't send the message so I can troubleshoot ifdef type stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've added this in commit a5e57fe. A success/fail message everytime may clog up the logs if we want to use notify() a little more, though....

@peternewman peternewman added this to the 0.11.0 milestone Jul 14, 2018
@shenghaoyang
Copy link
Contributor Author

shenghaoyang commented Jul 14, 2018

Also should we add MAINPID https://www.freedesktop.org/software/systemd/man/sd_notify.html#MAINPID=%E2%80%A6 like the extended example here https://www.freedesktop.org/software/systemd/man/sd_notify.html#Examples

MAINPID is probably only useful if we run olad with --daemon and have it fork itself. However, when running under systemd, that's probably not required, and we can rely on systemd to fork us off and keep us happily tracked, which is what it's good at, anyway. Unless olad requires some special magic while forking, we shouldn't really need to bother with this. MAINPID without forking is considered to be the first process that is created when executing the olad binary.

If olad is going to adopt the creation of multiple processes, though, then we'll need to set this MAINPID accordingly - but I don't think this is the case...

Finally, should we add STOPPING when triggered via the web UI? https://github.com/OpenLightingProject/ola/blob/master/olad/OlaServer.h#L126

Hmm... this could be useful, if you do foresee problems terminating the server through the web UI. The user can be notified that we're stopping - and if we get stuck - he'll probably have a better idea why. But I think this place is a bit too low in the call chain, though, since the SIGTERM handler also calls m_ss->Terminate();. Is there another site where this message could be placed higher up?

@shenghaoyang
Copy link
Contributor Author

shenghaoyang commented Jul 14, 2018

@peternewman is there a way to add some CFLAGS when compiling the sources that need to be linked with libsystemd? I did see some CFLAG variables set, but those were for protobuf and testing.

On a side note, I could stick a systemd unit file along as well...

Changes from v1:

- Systemd support is now switched in via a flag to the configure script.
- Failures to notify systemd are logged, but only when the notification
  socket is actually passed.
- If the socket is not passed, a warning is displayed.
- Added a utility library for the sd_notify calls.
shenghaoyang and others added 2 commits July 14, 2018 17:13
- License error in both files, LGPL -> GPL
- Wrong usage of @code in Systemd.h, should be @c
Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, a few more comments


namespace ola {

int notify_systemd(int unset_environment, const char *state) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We go CamelCase for our method names.

return rtn;
}

bool notify_available() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again

if (rtn < 0) {
char buf[1024];
OLA_WARN << "Error sending notification to systemd: " <<
strerror_r(-rtn, buf, sizeof(buf));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll only print the error code from strerror_r currently. Looking at https://linux.die.net/man/3/strerror_r it seems you should be able to just do << strerror(-rtn).

Can you also move the second << to the next line and align with the one above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think you should add a configure check for strerror_r (see https://github.com/OpenLightingProject/ola/blob/master/configure.ac#L185 and e.g. https://github.com/OpenLightingProject/ola/blob/master/common/base/Credentials.cpp#L342 ), it's the first time it's being used in the codebase and we run on a variety of weird platforms. See also some similar fun and discussion with readdir_r #1055 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll only print the error code from strerror_r currently. Looking at https://linux.die.net/man/3/strerror_r it seems you should be able to just do << strerror(-rtn).

I saw that OLA was compiling with GNU extensions - in that case strerror_r returns a pointer to either an impelemtation-allocated array or a pointer to the buffer that's passed in. We always have to use the returned pointer to reference the error message. I could mangle the defines a little bit and get the POSIX compatible one, but I'd like to confirm the usage of GNU extensions - I dug through the configure file and found loads of references to std=gnu++XX instead of std=c++XX.

Yeah, I'll be able to do a simple strerror() call, but I'm a little concerned that the other threads could be calling into it at the same time, because (I think) some of the plugins do spawn threads for their own use - I see a bunch of threads under htop for olad. Unless they don't ever call strerror(), then it's unsafe to call strerror() at this site.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, in which case can we just have a comment for our future selves if we switch away from the GNU extensions in future.

Well currently multiple plugins call strerror (and some do indeed spawn their own threads, at least one call is in a spawned thread), so we may have that issue already.

We should probably be moving to strerror_r throughout, but equally I wouldn't want this one usage to break the build on machines that don't have it, or by adding it to configure stop it building on those OSes.

Copy link
Contributor Author

@shenghaoyang shenghaoyang Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, in which case can we just have a comment for our future selves if we switch away from the GNU extensions in future.

Sure, alright.

We should probably be moving to strerror_r throughout, but equally I wouldn't want this one usage to break the build on machines that don't have it, or by adding it to configure stop it building on those OSes.

Any modern system with systemd should have this - deliberately using strerror() feels pretty icky. Maybe we can take a page from nginx: they call strerror() in a section where it is known that no other thread can call strerror() and caches that for use.

Have a look at their implementation: https://github.com/nginx/nginx/blob/f8a9d528df92c7634088e575e5c3d63a1d4ab8ea/src/os/unix/ngx_errno.c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 469361d

configure.ac Outdated
AC_CHECK_HEADER([systemd/sd-daemon.h], [], [have_libsystemd="no"])])

AS_IF([test "x$have_libsystemd" = xyes],
[AC_DEFINE([HAVE_LIBSYSTEMD], [1], [define if systemd support wanted])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd probably say this is if systemd support present (or present and wanted).

configure.ac Outdated
# systemd support
AC_ARG_ENABLE(
[systemd],
[AS_HELP_STRING([--enable-systemd], [Enable systemd support])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to negate this, so it's turned on by default where available (or no-one will realise it exists/use it). See e.g. https://github.com/OpenLightingProject/ola/blob/master/configure.ac#L573-L589 and don't forget to update the AS_IF below too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eek.. an automatic dependency feels a little weird to me, unless we put it out for people to know. If we put the existence of this option in NEWS (which is probably where it will end up in) and you suspect:

no-one will realise it exists/use it

then, if we put in a warning that libsystemd is going to become an automatic dependency, I kinda also think that

no-one will realise it exists/use it

until bug reports come in regarding libsystemd not being found on non-build systems...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we have the best of both worlds with our plugins, a literal auto mode, so by default if the bits are there, you get your plugin and if they aren't you don't, but you can also force it on (so it complains if dependencies are missing) or force it off (so it doesn't use it even if present).

You may be right about bug reports, I guess I'd tend towards the nudge theory, if we turn it on, then people who read the docs will turn it off easily, those who don't will ideally search or file a bug/ask on the mailing list and there's a workaround.

If we leave it off, we should at least enable it in our Debian build config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, alright. Let's default it to enabled. We're going to be playing devil's advocate (to some people out there) and silently promote systemd adoption...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 83807d4

olad/Makefile.mk Outdated
@@ -42,6 +42,11 @@ ola_server_sources += olad/HttpServerActions.cpp \
ola_server_additional_libs += common/http/libolahttp.la
endif

if HAVE_LIBSYSTEMD
ola_server_sources += olad/NotifySystemd.cpp olad/Systemd.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally match .h and .cpp filenames.

olad/Systemd.h Outdated
/*
* @brief Notify systemd about daemon state changes.
*
* This function logs on failures to queue notifications, but only if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPaG "logs failures to queue" or "logs on failure to queue"

@@ -472,9 +476,15 @@ bool OlaServer::InternalNewConnection(
}

void OlaServer::ReloadPluginsInternal() {
#ifdef HAVE_LIBSYSTEMD
ola::notify_systemd(0, "RELOADING=1\nSTATUS=Reloading plugins\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the socket have gone away by the time this gets fired though? Potentially minutes or hours/days later?

My previous comment was more along the lines of if I'm trying to troubleshoot why my reload isn't getting to systemd, I'd want to see (at info/debug level) the positive return code from calling sd_notify, so I know it's the systemd end I need to look at, and not OLA, or vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the socket have gone away by the time this gets fired though? Potentially minutes or hours/days later?

Well, I'm not sure that this could even happen, unless someone removes the environment variable. The socket is actually passed as a socket address (path name or name in abstract namespace, a-la unix domain sockets), and sd_notify establishes a new socket sending data to that address, closing it on every return.

If systemd did somehow... fail, I think the whole system would be going up in flames, by then. But if it did happen and nobody is listening on the socket address, then the sendmsg() call in sd_notify() will fail and the error will be logged through OLA_WARN.

The only way for the notify_systemd call to fail silently is the removal of the NOTIFY_SOCKET environment variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fair enough.

shenghaoyang and others added 3 commits July 16, 2018 12:37
- Scenarios in which HAVE_LIBSYSTEMD is defined is made clearer.
- NotifySystemd.cpp and Systemd.h modified to match existing code
  style.
- NotifySystemd.cpp renamed so header and implementation filenames
  match.
- Call sites for functions declared in Systemd.h modified to use
  new names.
- Added error messages to be output whenever dependent functionality
  / files are not present, to avoid silent failures.
- Added additional check for strerror_r's presence to guard against
  ancient systems.
olad/Systemd.h Outdated
* @brief Tests whether the systemd notification socket is available.
* @return @c true if the socket is available, @c false if not.
*/
bool NotifyAvailable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be clearer if this had Systemd in it's name, perhaps SystemdNotifyAvailable for this function and then rename the other one to SystemdNotify. It's then a closer match to the sd_notify it's wrapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit b7afd38

- NotifySystemd() and NotifyAvailable() modified to better
  match the naming style of the functions provided by
  libsystemd.

- Call sites modified to use new names.

- Log messages emitted to inform the user about the
  presence of the notification socket has been changed
  to make it clear that systemd only provides the application
  with the address of the notification socket. Systemd does
  not have the process inherit a file descriptor to use
  when sending notifications.
- Most system functions return an integer value indicating success
  or failure, and the usage of strerror_r may seem that we are writing
  that value out to the log.

- However, olad is compiled using GNU extensions, which means that the
  return value from strerror_r is a pointer to the description
  for the particular error value passed in.

- This is a little bit unconventional, and the comment was added
  to clarify that this is only the case because of a non-standard
  GNU extension.
- This is to ensure that compilation will not be dependent
  on GNU extensions being enabled.

- A new source file has to be added, because one cannot simply
  undef _GNU_SOURCE in Systemd.cpp - it will cause the a failure
  to compile - something in the logging framework depends
  on GNU extensions.
Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

* @param unset_environment whether to unset the notification socket environment
* variable so child processes cannot utilize it.
* @param state state block to pass to systemd.
* @return value returned from @ref sd_notify()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is actually the right Doxygen command, but I can't see a specific one either:
https://www.stack.nl/~dimitri/doxygen/manual/commands.html#cmdref

olad/Systemd.cpp Outdated
if (rtn < 0) {
char buf[1024];
OLA_WARN << "Error sending notification to systemd: ";
if (ola::Strerror_r(-rtn, buf, sizeof(buf))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if should all be within the function, rather than repeating it multiple times in each bit of code.

Also I think because you're doing multiple OLA_WARNs and it's a macro, I believe this would log multiple lines. Whereas a function returning a string would avoid those issues.

/*
* @brief XSI-compliant version of @ref strerror_r()

* See strerror(3) for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@see is supposed to link to some symbol, though, AFAIK. Not sure what doxygen will do when it tries to link to some man page...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "Starts a paragraph where one or more cross-references to classes, functions, methods, variables, files or URL may be specified." So an URL isn't a symbol, although I don't know how it decides if something is a URL or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess I could put a link to a manual page in man7.

olad/Olad.cpp Outdated
@@ -171,6 +175,18 @@ int main(int argc, char *argv[]) {
ola::NewCallback(olad->GetOlaServer(), &ola::OlaServer::ReloadPlugins));
#endif // _WIN32

#if HAVE_LIBSYSTEMD
if (ola::SystemdNotifyAvailable()) {
OLA_INFO << "Systemd notification socket address present. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably wants to be a comma not a full stop (then adjust the case on the line below.

olad/Olad.cpp Outdated
OLA_INFO << "Systemd notification socket address present. "
<< "Sending notifications.";
} else {
OLA_WARN << "Systemd notification socket address not present. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again

* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*
* Strerror_r.cpp
* Definition of strerror_r that is XSI-compliant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally slightly wary about the future maintainability of this, but I guess a pile of ifdef's to switch behaviour based on the detected version might not be any better.

Where did this come from, is there a link for credit/future reference?

Copy link
Contributor Author

@shenghaoyang shenghaoyang Jul 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macros came from strerror(3).

@@ -0,0 +1,49 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should be in include/ola and common/ola for future use by other bits of the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that, but I'd have to add strerror_r to the main list of functions checked. Probably not a problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember, are they mandatory functions, or still optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well AC_CHECK_FUNCS wouldn't terminate if something is not found, so it's alright.

- Strerorr_r wrapper could be useful to elements outside of olad
Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments


* See strerror(3) for more details.
*/
int Strerror_r(int errnum, char* buf, size_t buflen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this StrError_R or something, to match ola::strings::StrNCopy and others.

olad/Systemd.cpp Outdated
if (rtn < 0) {
char buf[1024];
if (ola::Strerror_r(-rtn, buf, sizeof(buf))) {
OLA_WARN << "Error sending notification to systemd: errno = " << -rtn;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've now just had to duplicate the "Error sending notification to systemd" text. I still feel the whole return a string or return "errno" and a value should be in the function, as we'll always want that functionality. Or if you want to do a direct wrap of strerror_r, we should also ship a helper function which does what we actually want each time (and then calls the wrapped strerror_r).

Otherwise we have to keep writing this code:

    char buf[1024];
    if (ola::Strerror_r(-rtn, buf, sizeof(buf))) {
      OLA_WARN << "Error sending notification to systemd: errno = " << -rtn;
    } else {
      OLA_WARN << "Error sending notification to systemd: " << buf;
    }

Whereas we should just be able to write something like this (or whatever the function gets called) and get the same output as the above:
OLA_WARN << "Error sending notification to systemd: " << ola::StrError_rHelper(-rtn);

shenghaoyang and others added 3 commits August 9, 2018 22:16
- New wrapper function returns a C++ standard library string,
  that can be easily passed to one of the logging streams
  for output, as opposed to requiring the caller to discriminate,
  based on the return value, what to pass to the logging stream.

- Added a constant that exposes the internal buffer size of the
  wrapper, in case this is required.

- Naming for the wrapper functions and their source and header
  files have been changed to reflect convention.
- In StrError_R(): Assume the happy path where the buffer
  is sufficient to hold the error message, first, instead
  of preemptively filling the buffer with the fallback
  numerical code. We rely on the fallback behaviour only
  when the buffer has been deemed to be too small.

- In StrError_R(): Properly handle negative return values
  from snprintf(), instead of casting it away without
  considering the possibility of the value being
  negative.

- Doxygen: generation of documentation produced warnings
  due to linking against strerror_r(), a symbol that
  wasn't found by doxygen because string.h wasn't included
  in StrError_R.h used to generate doc output. Including
  it wasn't necessary, so the references were dropped
  and fixed-width text is used to refer to the function
  instead.
Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments, I think it's very nearly there now thanks.

// of digits in the position of the most significant base 10 digit.
// (i.e. 0-2 representable in an 8-bit unsigned integer at the
// hundredth's place instead of 0-9 because an u8 is clamped to
// [0, 255])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add a note on the end of this line to explicitly clarify that digits10 will therefore only return 2 for uint8. It was only when I read the docs that your comment made sense.

out.append("<error>");
// Ugly cast required to avoid compiler warning when comparing
// numbers of different signedness.
} else if (static_cast<unsigned int>(r) >= sizeof(errs)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would casting the sizeof to an int be better? Although there are downsides either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... that could be done, but r is known to be non-negative from this if onwards, but I guess this would be moot if we were to shift to using an ostringstream.

// - 1 additional unit of storage just in case a negative errnum is
// passed, and somehow StrError_R_XSI() failed on that.
char errs[std::numeric_limits<int>::digits10 + 3];
int r(snprintf(errs, sizeof(errs), "%d", errnum));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our style is int r = snprintf(...);

} else if (static_cast<unsigned int>(r) >= sizeof(errs)) {
out.append("<truncated>");
} else {
out.append(errs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have a resize too?

configure.ac Outdated
AC_MSG_WARN([disabling systemd support: can't find systemd/sd-daemon.h])])
AS_IF([test "x$ac_cv_func_strerror_r" = "xno"],
[have_libsystemd="no"
AC_MSG_WARN([disabling systemd support: can't find strerror_r])])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WARNs mean configure will continue. It might be nice to have a final bit of logic where if "x$enable_systemd" != "xyes" and "x$have_libsystemd"=="xno" it throws an error. So if you've explicitly added --enable-systemd to your configure, it bombs out, but if you've done nothing, it just warns and carries on.

/**
* @brief XSI-compliant version of \c strerror_r()
*
* See https://linux.die.net/man/3/strerror for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, Doxygen explicitly mentions URLs with @see, so I think doing that and dropping the superflous words around it would make better use of their special styling.

*
* @param errnum error value to generate the textual description for.
* @return Textual description of the error value. If the description is
* trucnated due to an insufficient buffer length, the description will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPaG truncated

// Pre-allocate to prevent re-allocation.
std::string out(StrError_R_BufSize, '\0');
if (StrError_R_XSI(errnum, &(out[0]), out.size())) {
out.assign("errno = ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming a stringstream doesn't allow access to a buffer to use it throughout this code? However what's the hit, aside from a bit of memory, with using an ostringstream in the error case, and then appending that to out, see e.g. https://github.com/OpenLightingProject/ola/blob/master/common/base/Version.cpp#L43-L47 . Which would completely avoid all the edge case handling you've had to write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done, just didn't feel like dragging the whole setup associated with the stream into something as trivial as representing an integer as text - really looking forward to the availability of to_string as we (hopefully) move towards C++11 in the future.

* of this buffer minus one, then the output of @ref StrError_R() will only
* include the numerical error value provided.
*/
extern const int StrError_R_BufSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually static const, is there a reason you need extern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we static const here - then every compilation unit that includes this header file would have their own copy of StrError_R_BufSize. extern here just lets me allocate storage for it in that one unit and have the other units referencing this variable link to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could #define stuff.... but I'd want to avoid that route - maybe future refactoring to constexpr or inline constexpr in future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One for @nomis52 then as to why we've gone for static const over extern const?


namespace ola {

const int StrError_R_BufSize(1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again our style is ...BufSize = 1024;

- Formatting, style, and spelling:
  - Fixes for wrong spelling in StrError_R.h
  - Documentation special command used to refer
    to manual page for strerror_r() instead of merely
    mentioning the URL for the manual page.
  - Style changes to initializers for variables defined
    in StrError_R.cpp, in order to match conventions.

- Refactored StrError_R:
  - Switched to using ostringstream to convert error
    value to a human-readable number, avoiding complex
    error handling at the cost of longer execution time,
    but this path should not be taken for the most part,
    so any effect should be minimal.
- Configure script will now check if systemd support has been explicitly
  enabled when it is unable to enable it.
  - Provides user feedback when his explicit choice fails, instead of
    failing silently with a warning message that is probably hidden
    in the mountain of messages the configure script generates.
@shenghaoyang
Copy link
Contributor Author

Got most of them done for now - when ola moves over to C++11 we can clean up a lot more.

// Pre-allocate to prevent re-allocation.
std::string out(StrError_R_BufSize, '\0');
if (StrError_R_XSI(errnum, &(out[0]), out.size())) {
std::ostringstream errs("errno = ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I realised after I said this, I should just get you to use our existing function to do this ola::strings::IntToString :
http://docs.openlighting.org/ola/doc/latest/namespaceola_1_1strings.html#a1c6637c4498627e405cf8c95cef677c7

@shenghaoyang
Copy link
Contributor Author

shenghaoyang commented Aug 12, 2018 via email

@peternewman
Copy link
Member

Sorry @shenghaoyang I didn't think of it when I was commenting; TBH I'm used to more grown-up languages where you don't have to hand code such functions!

@shenghaoyang
Copy link
Contributor Author

Sorry @shenghaoyang I didn't think of it when I was commenting; TBH I'm used to more grown-up languages where you don't have to hand code such functions!

Well... C++11 is more grown up than C++03 :). There's so much that can be torn out when that's switched to.

@peternewman
Copy link
Member

I was impressed to see the + string concatenation was in C++98!

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of minor comments.

Would you mind adding an example systemd file that uses this functionality too please?

@@ -234,6 +234,9 @@ AM_CONDITIONAL(HAVE_EPOLL, test "${ax_cv_have_epoll}" = "yes")
AC_CHECK_FUNCS([kqueue])
AM_CONDITIONAL(HAVE_KQUEUE, test "${ac_cv_func_kqueue}" = "yes")

# strerror_r
AM_CONDITIONAL([HAVE_STRERROR_R], [test "x$ac_cv_func_strerror_r" = "xyes"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can drop the x's and square brackets here, looking at HAVE_KQUEUE just above, which will have had broader OS testing.

* of this buffer minus one, then the output of @ref StrError_R() will only
* include the numerical error value provided.
*/
extern const int StrError_R_BufSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One for @nomis52 then as to why we've gone for static const over extern const?

@shenghaoyang
Copy link
Contributor Author

Would you mind adding an example systemd file that uses this functionality too please?

Sure. I can use the unit file used for testing. What sort of systemd version are we targeting? Arch is around systemd 239, and I think some of the features I'm using might not be available in mainstream distributions - a quick look on the Ubuntu package search service shows versions 204 - 237 in use.

@peternewman
Copy link
Member

I'd probably aim at Debian Stretch, as that's where our deb package will go first, which is on 232:
https://packages.debian.org/stretch/systemd

I guess you could add the latest stuff in another file somewhere or an issue or something (or comments in the file)? Assuming the features aren't backwards compatible/fail safe.

But there are Arch packages too, so I don't know how best we offer it to people like them too?

@shenghaoyang
Copy link
Contributor Author

shenghaoyang commented Aug 20, 2018

I'd probably aim at Debian Stretch, as that's where our deb package will go first, which is on 232:
https://packages.debian.org/stretch/systemd

Okay, Debian probably has some document on packaging items with systemd unit files. I'll go take a look.

I guess you could add the latest stuff in another file somewhere or an issue or something (or comments in the file)? Assuming the features aren't backwards compatible/fail safe.

I was intending to use DynamicUser to have olad run as an unprivileged user - seems to me that 232 doesn't really support that. I'll take a look at the current launching mechanism and write a unit file around that, using some of systemd's features.

But there are Arch packages too, so I don't know how best we offer it to people like them too?

IIRC the Arch package is on the AUR... Maybe we could selectively install the unit file based on the version?

I was actually curious about the uaccess mechanisms I saw in a couple udev scripts. How does olad used this to allow itself access to devices? When I wrote the unit file for testing I had systemd add the dynamically created user to a group that had access to those devices.

On a side note, writing a unit file opens up a whole can of worms. If you really want to go full ham on the sandboxing and "container-like" options that systemd offers it's probably going to spawn a long debate. I'll just enable a few simple options and users that are concerned could probably drop in an override file and use that to add options instead.

@peternewman
Copy link
Member

Indeed it will, e.g. https://wiki.debian.org/Teams/pkg-systemd/Packaging

There's probably something in Debian Policy too.

You shouldn't need any real privileges, aside from some permissions on the devices for USB dongles etc. That sounds about right, I don't know much about Arch/AUR at all TBH.

The udev uaccess stuff came in here:
ce5d6db
From:
c679560
Which is from:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=840389

But that's about all I really know about that bit too...

I think there's enough of a can of worms between init and systemd anyway, despite different bits of systemd. I don't think the systemd file needs to be completely gold plated, just lint clean for Debian/compliant with their minimum requirements and to allow people to actually make use of the functionality you've added in without them really needing to think for themselves!

shenghaoyang and others added 2 commits August 25, 2018 19:00
Since most items are tagged using "uaccess", this is the most
easy way to run the olad daemon, without creating additional
system users.

Note that DynamicUsers can be used in future
along with a SupplementaryGroups=plugdev in order to implement
an olad service that can be spawned off the main systemd process
in charge of the whole system. Since this option is not supported
in 232, that has not been enabled here.
@shenghaoyang
Copy link
Contributor Author

shenghaoyang commented Aug 25, 2018

[RFC] I've added in a systemd unit file for a user instance of systemd. Try copying the unit file from systemd/olad.service to ~/.config/systemd/user/ and invoke:

  • systemd --user daemon-reload
  • systemd --user start olad

When you test this.

Note that if your olad executable is located in a different path, you might need to modify
ExecStart=

Do try testing these few things:

  • Starting olad: systemctl --user start olad
  • Reloading olad configuration: systemctl --user reload olad
  • Restarting olad: systemctl restart olad
  • Whether olad correctly reads the configuration files.
  • Whether olad is accessible through its network sockets.
  • Whether olad generally works.

Hopefully the options I've chosen are compatible across most systems.

The unit file for running olad in system mode should be added later. I'm inkling towards
using DynamicUser, but unfortunately that's not available in older versions.

Note that the user who is running this service MUST have a physical seat on
the system, or be in plugdev, due to the way the udev rules are setup.
(The uaccess tag is also one of the reasons I've decided to introduce a user
instance service first, since a system service would require the creation of
plugdev, which may not be feasible depending on the context. This user instance
service should work across most systems without antiquated versions of systemd.)

If anyone has problems, please post:

  • journalctl --user -eu olad

and

  • systemctl --user status olad

If everything works out we can copy the unit file during install.

Once this whole shebang has been completed the next step is to make olad socket activatable!

Shenghao


A side note regarding the user instance of systemd: this is an instance of systemd that is started up per-user in order to manage the user's services, etc. It activates default.target, which is wanted by the olad service, and hence can be used to autostart olad.

Copy link
Member

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments from my vague understanding of systemd.

Does systemd allow you to depend on networks being up? I think I've been round this loop before with it, but I thought I'd ask.

@@ -0,0 +1,30 @@
[Unit]
Description=Ola Daemon (User mode)
Documentation=https://github.com/OpenLightingProject/ola
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says you can have multiple documentation ( https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Documentation= ), so I'd also suggest:
http://docs.openlighting.org/ola/conf/
https://www.openlighting.org/ola/

Probably both listed before GitHub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problems. Any other suggestions? I probably ought to write some kind of documentation article showing how this systemd integration is accomplished as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I can immediately think of. Yeah some docs to put on the website or wiki would be good.

I guess also the man pages:
https://docs.openlighting.org/ola/man/

Actually reading the page I linked to, it says "It is a good idea to first reference documentation that explains what the unit's purpose is, followed by how it is configured, followed by any other related documentation.", so it should probably be the website, then the man pages, then conf, then GitHub.

SystemCallArchitectures=native

# Directory specifications
ConfigurationDirectory="~/.ola/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This varies on Debian when installed via a deb package, it says you can list multiple, but relative:
https://www.freedesktop.org/software/systemd/man/systemd.exec.html#RuntimeDirectory=

So I don't know if it's worth adding /var/lib/ola/conf, or explicitly setting the --config-dir option in olad to the config directory you're listing? Although as your description says user mode, using ~ probably makes sense.

Copy link
Contributor Author

@shenghaoyang shenghaoyang Sep 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I chose not to add that in. Since olad appears to default to the user configuration directory, it'd be safe to leave it as it is. If we're developing a unit file for a system wide daemon, then some modification would be required (as fitting distribution-specific group and user settings).

@shenghaoyang
Copy link
Contributor Author

shenghaoyang commented Sep 9, 2018

Does systemd allow you to depend on networks being up? I think I've been round this loop before with it, but I thought I'd ask.

Yes, you can, but this is a bit of a mixed bag. There are targets you can add dependencies on to wait for the network, but the loopback interface should be initialized by systemd directly before it brings up any units. Since the problems I've seen are from olad trying to listen on loopback before that is initialized (I think I've seen it in one of the issues), I don't think that's gonna be an issue.

You can take a look at some of the targets that can be added as dependencies for startup:
https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/, but I'm not sure if they apply if we simply have a dependency on the loopback interface (for RPC). If we're listening on the wildcard address (for plugins), then we shouldn't even have to worry about other network interfaces except loopback. If we're not, however, then it's another can of worms.....

EDIT: I've noticed the build to be failing quite a bit due to the placement of the files in the /systemd directory. Is there another more appropriate location to place these files?

@peternewman
Copy link
Member

Does systemd allow you to depend on networks being up? I think I've been round this loop before with it, but I thought I'd ask.

Yes, you can, but this is a bit of a mixed bag. There are targets you can add dependencies on to wait for the network, but the loopback interface should be initialized by systemd directly before it brings up any units. Since the problems I've seen are from olad trying to listen on loopback before that is initialized (I think I've seen it in one of the issues), I don't think that's gonna be an issue.

You can take a look at some of the targets that can be added as dependencies for startup:
https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/, but I'm not sure if they apply if we simply have a dependency on the loopback interface (for RPC). If we're listening on the wildcard address (for plugins), then we shouldn't even have to worry about other network interfaces except loopback. If we're not, however, then it's another can of worms.....

Some of the plugins (and indeed bits of the core), want our IP for various reasons. Firing off a SIGHUP will reload the plugins and should resolve most such issues.

EDIT: I've noticed the build to be failing quite a bit due to the placement of the files in the /systemd directory. Is there another more appropriate location to place these files?

I assume you mean this: https://travis-ci.org/OpenLightingProject/ola/jobs/427037785#L7207

You need to either add the file to a Makefile (add one in systemd folder), or add it to the ignore list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants