Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nats-js-kv service registry only knows about one instance of a service #9535

Closed
wkloucek opened this issue Jul 4, 2024 · 12 comments · Fixed by #9620
Closed

nats-js-kv service registry only knows about one instance of a service #9535

wkloucek opened this issue Jul 4, 2024 · 12 comments · Fixed by #9620

Comments

@wkloucek
Copy link
Contributor

wkloucek commented Jul 4, 2024

Describe the bug

nats-js-kv service registry only knows about one instance of a service

Steps to reproduce

  1. run https://github.com/owncloud/ocis-charts/tree/main/deployments/ocis-nats in Kubernetes
  2. look at the gateway deployment: kubectl -n ocis get deployment gateway, it'll have only one replica (=1 pod)
    NAME      READY   UP-TO-DATE   AVAILABLE   AGE
    gateway   1/1     1            1           5m32s
    
  3. look into the service-registry entry of the gateway service: kubectl exec -n ocis-nats deployments/nats-box -- nats kv get service-registry ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTHMF2GK53BPE====== --raw | jq .data | tr -d '"' | base64 --decode | jq .:
    {
     "name": "com.owncloud.api.gateway",
     "version": "5.0.5",
     "metadata": null,
     "endpoints": [],
     "nodes": [
       {
         "metadata": {
           "protocol": "grpc",
           "registry": "cache",
           "server": "grpc",
           "transport": "grpc"
         },
         "id": "com.owncloud.api.gateway-a476b641-1a02-44af-9a56-ad992e1a5376",
         "address": "10.244.4.205:9142"
       }
     ]

}

6. check where the IP from address originates, `kubectl get pods,svc -n ocis -o wide | grep 10.244.4.205`, acknowledge that it's the IP of the only gateway pod
7. scale the gateway service to 10 replicas: `kubectl -n ocis scale deployment/gateway --replicas=10`
8. look at the gateway deployment: `kubectl -n ocis get deployment gateway`, it'll have 10 replicas (=10 pods)

NAME READY UP-TO-DATE AVAILABLE AGE
gateway 10/10 10 10 12m

9. look into the service-registry entry of the gateway service: `kubectl exec -n ocis-nats deployments/nats-box -- nats kv get service-registry ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTHMF2GK53BPE====== --raw | jq .data | tr -d '"' | base64 --decode | jq .`:
```json
{
 "name": "com.owncloud.api.gateway",
 "version": "5.0.5",
 "metadata": null,
 "endpoints": [],
 "nodes": [
   {
     "metadata": {
       "protocol": "grpc",
       "registry": "cache",
       "server": "grpc",
       "transport": "grpc"
     },
     "id": "com.owncloud.api.gateway-a476b641-1a02-44af-9a56-ad992e1a5376",
     "address": "10.244.4.205:9142"
   }
 ]
}
  1. run kubectl -n ocis exec deployments/gateway -- ocis gateway version
Version: 5.0.5
Compiled: 2024-05-22 00:00:00 +0000 UTC

+---------+-------------------+---------------------------------------------------------------+
| Version |      Address      |                              Id                               |
+---------+-------------------+---------------------------------------------------------------+
| 5.0.5   | 10.244.4.205:9142 | com.owncloud.api.gateway-a476b641-1a02-44af-9a56-ad992e1a5376 |
+---------+-------------------+---------------------------------------------------------------+

Expected behavior

I'd expect to see 10 gateway services.

Actual behavior

I only see the first gateway service.

Setup

Using the ocis-nats deployment example of the oCIS Helm Chart.

oCIS version: owncloud/ocis:5.0.5

Additional context

@micbar
Copy link
Contributor

micbar commented Jul 4, 2024

Regression in 5.0.5? We had no changes on the stable branch.

@wkloucek
Copy link
Contributor Author

wkloucek commented Jul 4, 2024

Regression in 5.0.5? We had no changes on the stable branch.

I found it in 5.0.5
but we didn't do loadtests for bugfix releases from what I know, so we might not have catched this since 5.0.0 ?

@wkloucek
Copy link
Contributor Author

wkloucek commented Jul 5, 2024

5.0.0 is also already affected

EDIT: from what I remember, nats-js-kv service registry was only introduced with 5.0.0, therefore this might have been the case from the beginning

@butonic
Copy link
Member

butonic commented Jul 15, 2024

The natsjsregistry implementation just replaces the existing entry:

// Register adds a service to the registry
func (n *storeregistry) Register(s *registry.Service, _ ...registry.RegisterOption) error {
	n.lock.RLock()
	defer n.lock.RUnlock()

	if s == nil {
		return errors.New("wont store nil service")
	}
	b, err := json.Marshal(s)
	if err != nil {
		return err
	}
	return n.store.Write(&store.Record{
		Key:    s.Name,
		Value:  b,
		Expiry: n.expiry,
	})
}

The memory implementation takes into account node name and version to actually add a node.

@wkloucek
Copy link
Contributor Author

wkloucek commented Jul 15, 2024

The natsjsregistry implementation just replaces the existing entry:

Appending sounds also hard in this case, because the expiry is per key. Actually each service instance would need to add it's own key with a instance dependent suffix!? And the service discovery would work by looking at the key prefix!?

@butonic
Copy link
Member

butonic commented Jul 15, 2024

That indeed is what the memory implementation does. Pasting here for comparison:

func (m *Registry) Register(s *registry.Service, opts ...registry.RegisterOption) error {
	m.Lock()
	defer m.Unlock()
	log := m.options.Logger

	var options registry.RegisterOptions
	for _, o := range opts {
		o(&options)
	}

	r := serviceToRecord(s, options.TTL)

	if _, ok := m.records[s.Name]; !ok {
		m.records[s.Name] = make(map[string]*record)
	}

	if _, ok := m.records[s.Name][s.Version]; !ok {
		m.records[s.Name][s.Version] = r
		log.Logf(logger.DebugLevel, "Registry added new service: %s, version: %s", s.Name, s.Version)
		go m.sendEvent(&registry.Result{Action: "update", Service: s})
		return nil
	}

	addedNodes := false
	for _, n := range s.Nodes {
		if _, ok := m.records[s.Name][s.Version].Nodes[n.Id]; !ok {
			addedNodes = true
			metadata := make(map[string]string)
			for k, v := range n.Metadata {
				metadata[k] = v
				m.records[s.Name][s.Version].Nodes[n.Id] = &node{
					Node: &registry.Node{
						Id:       n.Id,
						Address:  n.Address,
						Metadata: metadata,
					},
					TTL:      options.TTL,
					LastSeen: time.Now(),
				}
			}
		}
	}

	if addedNodes {
		log.Logf(logger.DebugLevel, "Registry added new node to service: %s, version: %s", s.Name, s.Version)
		go m.sendEvent(&registry.Result{Action: "update", Service: s})
		return nil
	}

	// refresh TTL and timestamp
	for _, n := range s.Nodes {
		log.Logf(logger.DebugLevel, "Updated registration for service: %s, version: %s", s.Name, s.Version)
		m.records[s.Name][s.Version].Nodes[n.Id].TTL = options.TTL
		m.records[s.Name][s.Version].Nodes[n.Id].LastSeen = time.Now()
	}

	return nil
}

@kobergj
Copy link
Collaborator

kobergj commented Jul 15, 2024

Appending sounds also hard in this case, because the expiry is per key. Actually each service instance would need to add it's own key with a instance dependent suffix!? And the service discovery would work by looking at the key prefix!?

This is the correct way imho. GetService should List with prefix, then return a random entry.

@wkloucek
Copy link
Contributor Author

This is the correct way imho. GetService should List with prefix, then return a random entry.

Wouldn't it be all entries? Otherwise you need a roundtrip to the registry, if you call next() ?

@kobergj
Copy link
Collaborator

kobergj commented Jul 15, 2024

Wouldn't it be all entries? Otherwise you need a roundtrip to the registry, if you call next() ?

Don't get you. Shouldn't GetService return only one specific service?

@wkloucek
Copy link
Contributor Author

What I meant and probably you too:

The service client should cache the state of the registry for a particular service. It may include more than one registered service instance. Thus all registered service instances should be cached. This means we have exactly one request to the registry every seconds. There may be exceptions when no one of the service instances is reachable (early invalidation??) or the cache is empty.

@kobergj
Copy link
Collaborator

kobergj commented Jul 15, 2024

Ah, you are about caching. Sure we can add a cache into the natsjsregistry package to avoid running to nats too many times. 👍

@butonic
Copy link
Member

butonic commented Jul 15, 2024

no ... the registries are wrapped in a cache registry which uses the watch function to invalidate entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants