Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a reliable lightweight monitor with notifications #45

Open
chimp1984 opened this issue Oct 18, 2020 · 7 comments
Open

Provide a reliable lightweight monitor with notifications #45

chimp1984 opened this issue Oct 18, 2020 · 7 comments
Labels
has:approval bisq.wiki/Project_management#Approval has:budget bisq.wiki/Project_management#Budgeting to:Improve Reliability

Comments

@chimp1984
Copy link

chimp1984 commented Oct 18, 2020

This is a Bisq Network project. Please familiarize yourself with the project management process.

Description

A lightweight and stable monitor for seed nodes to detect if a seed node is not in the expected state. Instead of requesting the data itself (which is pretty heavy) we request an inventory list with the number of p2p data objects as well as other relevant data. We include also some performance and load specific data.
When a seed node is not in the expected state it sends an alert/notification to the operator.

Rationale

We have 2 monitor projects [1], [2] (as well a ping monitor used by @wiz) for monitoring seed nodes. But both do not provide the quality that it can be used for notifications as they produce too many false positives. Also those monitor projects request all the data causing heavy load for the seed nodes and the monitor itself, which is probably the main reason why they are not as stable as they should be.

Criteria for delivery

Notifactions for real problems is the main goal we need to achieve.

It is alreay implemented in a basic form. See below the open tasks we want to add.
Current URL: http://46.101.179.224/
Source: https://github.com/chimp1984/bisq/tree/add-InventoryMonitor-module

Measures of success

Very low rate of false positives for notifications. It cannot be excluded that some alerts might be caused by not critical reasons due the nature of the data and uncertainties with blockchain related behaviour. See below at the data type and priority descriptions for more context.

Risks

I don't see any risk here. In the opposite the current monitors cause heavy load and with that risk for seed nodes. Once that project is completed we can shut down http://104.248.88.175 and maybe consider to remove the p2p data requests from https://monitor.bisq.network to reduce seed node load.

Tasks

  • Rethink the file name strategy for json files. Json files are written with the timestamp in ms as file name. A better approach for dealing with historical data would be to use a global persisted counter and use that as file name.
  • Add checks for data deviations (get average data of all seeds per request and compare individual how far it is away from average. maybe use past requests as well for certain data?). Apply level of warning/alert
  • Add notification via keybase for alerts. First use a new custom channel to not spam ops while developing. Once false positive rate is low enough point to ops
  • Add web app reading the json data and displaying recent request results
  • Add a sub view (on top) with compressed warnings/alerts info. Should be empty most of the time (e.g. "all seeds are ok").
  • Add support for displaying historical data. Show the warnings/alerts info summary seems to be the most important.
  • Add button to zoom into a request cycle to see details data
  • Remove http server from java app once not needed anymore

Estimates

I will delay all my comp. requests until Bisq is more profitable, so from my side there is no budget needed.
@jmacxx : 500 USD (#45 (comment))

Notes (following is copied over from original post)

High level concept:

Instead of requesting the data and then check if all seeds deliver the expected data we add a new request message and the seeds tells us how many objects per data type they have (as well as some other data). This reduces load from 8 MB (we had to exclude the largest data as it would have been much more then) to a few kb. It also does not require that the monitor runs the full Bisq code base but only a tor node and only need to understand the messages which do not contain domain specific dependencies, so its very lightweight both for monitor and seeds.

Goal:

Goal is to get a reliable monitor which can be used for alerting operators if a seed is not in the state it should be.
To achieve that we try to be lightweight and keep things as simple as possible. Flexibility to add new metric types is a goal as well. UI should provide a quick overview, so that with a quick look at it one can see if all is ok, or if there are issues with any seed.

With the https://monitor.bisq.network/ project that goal was never met as false positive rate and instability of the monitor made it impossible to use it for that purpose.

This project is not aiming to compete with feature richness and sophisticated UI of https://monitor.bisq.network. It is intended for devs and operators not for users, thought it is public and users can see it as well, but it is not a goal that it is user-friendly for people who are not familiar with the context.

Current state:

Currently we write html data and provide it via a simple http server inside the monitor app (in java). Parallel we write json data for each response. The data are a hash map of string keys and string values to be flexible for future changes and updates. Type conversion from string to integer or long need to be done per key type. Flexibility is here preferred over type safety.

Example json:

"requestStartTime": 1602948574936,
  "responseTime": 1602948576319,
  "inventory": {
    "blindVoteHash": "3f3b46ecd254e6d3739f8ef76ca1b2e5db92dc19",
    "BlindVotePayload": "316",
    "proposalHash": "dad7456b93944c10f93325b1f78817a92d579ee9",
    "usedMemory": "890",
    "sentMessagesPerSec": "2.24",
    "TempProposalPayload": "66",
    "numConnections": "30",
    "MailboxStoragePayload": "585",
    "AccountAgeWitness": "64253",
    "jvmStartTime": "1602932539817",
    "TradeStatistics3": "76325",
    "receivedMessagesPerSec": "12.86",
    "numBsqBlocks": "81436",
    "daoStateHash": "3fbc3417575aa125c191d69d4ee00b25910d44a2",
    "RefundAgent": "1",
    "Filter": "2",
    "sentData": "639.861 MB",
    "ProposalPayload": "514",
    "receivedData": "939.049 MB",
    "Mediator": "3",
    "Alert": "1",
    "OfferPayload": "437",
    "SignedWitness": "4588",
    "daoStateChainHeight": "653182"

Currently there are 7 seed nodes updated to provide those data and we request every 5 minutes.

Priorities per data types

Prio 1:

blindVoteHash, proposalHash need to be the same for all seed nodes at the blocks when those get set (I need to look up when that is and it would be good to add those blocks to the hashmap).
daoStateHash needs to be the same for all seeds with the same block. Changes with each block.
If any of those data is not matching its a severe failure and the op need to be alerted.

Deviation of numBsqBlocks and daoStateChainHeight must be in low range. It is super rare that > 3 blocks are created in very short time. So I would suggest deviation of > 3 blocks is an alert. Still could be valid case but a look up in blockexplorers will resolve that for ops.

Prio 2:

Mediator and RefundAgent must not be 0 (thought RefundAgent could be theoretically). If not its severe error.

Prio 3:

Mediator and RefundAgent should be the same most of the time. Only when a mediator revokes or get added there might be a difference as some seeds might get it earier then others. Those events are vary rare.
Similar is true for Filter and Alert. They should be the same most of the time, just when new ones get published deviation is expected but even then rather rare.

Prio 4:

ProposalPayload, BlindVotePayload and TempProposalPayload should be the same most of the time. Here its a bit more complex as after a certain block those data cannot be added anymore so that they are valid for the DAO, thought technically they can be added. I would suggest to accept low level of deviation (e.g. 10%) but should some color if the data is not the same as that is the 95% case.

Prio 5:

SignedWitness, AccountAgeWitness, MailboxStoragePayload, TradeStatistics3: Those get added all the time but at a low pace. Deviations of < 10% are normal. > 30% should be considered as error.

Prio 6:

OfferPayload gets added and removed all the time. If a big marketmaker goes online/offline it is expected that 100 offers or more are different. We have about 300-550 offers. As far I observed it is rare that deviation is > 100. I would suggest deviation < 10% is normal. 10 - 30% is light warning but still can be a valid case. 30 - 50% should get a severe warning but still no alert. > 50% should send alert to op.

Others:

  • jvmStartTime: seeds restart once a day: if that time > 1 day and 2 hours send alert.
    usedMemory: So far 500MB - 1 GB seems to be normal. If > 1 GB send warning to op.

  • numConnections: depends on maxConnections set by op (we should prob. add the maxCon param, currenty they use 30 but they could use diff. values per seed). If numConnections > 2 x maxConnections send an alert.

  • sentMessagesPerSec, receivedMessagesPerSec, sentData, receivedData: Lets obsever a bit normal values and then add alerts if deviation gets larger as usual. Also recent changes in P2P network should lower receivedMessagesPerSec still to get as low as sentMessagesPerSec, might need a while until most users have updated.

[1] https://monitor.bisq.network/
[2] http://104.248.88.175/

@chimp1984 chimp1984 added a:proposal bisq.wiki/Project_management#Proposal needs:triage bisq.wiki/Project_management#Triage labels Oct 18, 2020
@chimp1984
Copy link
Author

Replaces bisq-network/bisq#4665

@wiz wiz added has:approval bisq.wiki/Project_management#Approval has:budget bisq.wiki/Project_management#Budgeting and removed a:proposal bisq.wiki/Project_management#Proposal needs:triage bisq.wiki/Project_management#Triage labels Oct 18, 2020
@ghost
Copy link

ghost commented Oct 18, 2020

Rough estimate $500.

@wiz wiz assigned ghost Oct 18, 2020
@wiz
Copy link
Member

wiz commented Oct 18, 2020

As per the Bisq Project management guidelines, this project is approved for ops budget allocation and the project has been assigned to @jmacxx

@ghost
Copy link

ghost commented Oct 25, 2020

FYI this is what I am working on. Please let me know if anything is wrong..

For Wiz' alerting infrastructure:

  1. Provide a JSON file that contains details only of CURRENT ACTIVE ALERTS. If there are no alerts it should be empty. Format should include timestamp, seednode id, field triggering the alert, value of field, rule name that triggered the alert.

For use by a web client UI:

  1. Provide JSON files of the latest raw data queried from seed nodes. This is already available as a WIP at http://46.101.179.224:8082/seednode_json and is essentially the same as described in "Example json" for all seednodes.
  2. Provide a JSON file of data flagged by the serverside Java analytics: ERROR, WARN, or INFO. Should include the tag names (same as in [2]), alert level and timestamp. This will be used by the UI to highlight relevant data elements.
  3. Provide an aggregated JSON view of the current seednode data, i.e. values averaged across all seednodes. I originally implemented this in the prototype GUI but it would make more sense to do it at the server and leave the GUI just to display, no logic. Identical tag names as [2].
  4. Provide a JSON file listing alerts that happened in the past (with timestamps). A historical log for display in GUI. Same format as [1] but historical.

Alerts

Highest priority according to Wiz is there should be no instances of false alerts. To achieve that we need to define upfront exactly what the alerting criteria are (clarify the list posted OP). Here follows a proposed list based on the original spec:

Hopefully @wiz will have some suggestions as to which in this list are important, and which are not necessarily so important.

  • A seednode is unreachable for > 3 minutes
  • A seednode has uptime of > 26 hours
  • A seednode's memory usage is over 1.5Gb
  • A seednode's numConnections > 2x the seednode's maxConnections
  • A seednode's Offer count is < 50% of the 24 hour moving average
  • A seednode's Mediator count < 1
  • A seednode's RefundAgent count < 1
  • Across all seednodes, if there is any deviation of +/- 3 in numBsqBlocks
  • Across all seednodes, if there is any deviation of +/- 3 in daoStateChainHeight
  • Across all seednodes, if there is any deviation > 30% in SignedWitness
  • Across all seednodes, if there is any deviation > 30% in AccountAgeWitness
  • Across all seednodes, if there is any deviation > 30% in MailboxStoragePayload
  • Across all seednodes, if there is any deviation > 30% in TradeStatistics3
  • Across all seednodes, if daoStateHash is not consistent if and only if daoStateChainHeight is the same. This seems a bit tricky any may need some thought. [see below] difficult
  • Across all seednodes, Mediator count should be consistent [see below] difficult
  • Across all seednodes, RefundAgent count should be consistent [see below] difficult
  • Across all seednodes, Filter count should be consistent [see below] difficult
  • Across all seednodes, Alert count should be consistent [see below] difficult

[NB] difficult due to timing of differences between seednodes being updated/queried. The operations are not atomic so values can differ for a while. Need some thought on how to implement these cases without generating false alerts.

@chimp1984
Copy link
Author

@jmacxx
While trying to fine-tune http://46.101.179.224/ I saw that it is not trivial to avoid false positives.

One problem is that when for instance a new mediator gets published some nodes might get it earlier then other if the request event happends just around the publishing time. But we cannot lower the criteria as normally a diff of 1 is an alert case.

One option is to do repeated requests if an alert/warning is triggered. But that need to be done on all nodes as it can be that the node which caused the alert has the fresh data and the others have been behind (e.g. a new block arrived), so only if we repeat all requests we will find out if it was a false positive.

Maybe easiest to find out if that really helps it to decrease request interval and add some more tolerant thresholds to number of alerts/warnings. But that need to be done per param as for instance a new mediator is a very rare event but a new block happens every 10 minutes so we can expect more alerts for blocks.

@chimp1984
Copy link
Author

After more thought about the problem I think we need a secondary data set as kind of overlay to interpret an alert/warning correctly. E.g. If we know the time of new blocks being publihsed we can filter out all the alerts/warnings which happened around that event. For other data like number of offers its more difficult as we don't have a primary more reliable source then the seed nodes themselfes. We still could run the monitor as full p2p node and thus receiving the data from the network independently and using that as reference. It only can be applied on past data, as at the most recent data we still don't know if in the next moments an event happens which would let the alert be seen as false positive (e.g. num offer spiked from 350 to 500 offers in a few minutes). Not trivial how to deal with all that ;-(

@ghost
Copy link

ghost commented Nov 15, 2020

Progress:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has:approval bisq.wiki/Project_management#Approval has:budget bisq.wiki/Project_management#Budgeting to:Improve Reliability
Projects
None yet
Development

No branches or pull requests

2 participants