Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csr_matrices #122

Open
kwchurch opened this issue Jun 28, 2022 · 9 comments · Fixed by #124
Open

csr_matrices #122

kwchurch opened this issue Jun 28, 2022 · 9 comments · Fixed by #124
Labels
enhancement New feature or request

Comments

@kwchurch
Copy link

kwchurch commented Jun 28, 2022

I have a large csr_matrix in npz format. I'd like to use that as input as is, but it doens't have IDs field

added this to graph.py (but it doesn't work)

if 'IDs' in raw:
    self.set_node_ids(raw["IDs"].tolist())
else:
    # added by kwc                                                                                                                                                                                                                          
    self.set_node_ids(np.arange(raw["shape"][0]).tolist())

Created edg2npz.py with this:

import numpy as np
import scipy.sparse
import sys

dtype=bool
if sys.argv[2] == "int":
    dtype=int

X=[]
Y=[]

for line in sys.stdin:
    fields = line.rstrip().split()
    if len(fields) >= 2:
	x,y = fields[0:2]
	X.append(int(x))
        Y.append(int(y))

X = np.array(X, dtype=np.int32)
Y = np.array(Y, dtype=np.int32)
N = 1+max(np.max(X), np.max(Y))
V = np.ones(len(X), dtype=bool)

M = scipy.sparse.csr_matrix((V, (X, Y)), dtype=dtype, shape=(N,N))

scipy.sparse.save_npz(sys.argv[1], M)

called it with

python edg2npz.py demo/karate.bool.npz bool < demo/karate.edg 

Unfortunately, I can't use this kind of csr_matrix...

I can write out my matrix to text and then run pecanpy on that, but my matrix is very large and it will take a long time to write it out and read it back. My matrix has N = 300M nodes and E=2B nonzero edges.

 pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF
init pecanpy: p = 1, q = 1, workers = 1, verbose = False, extend = False, gamma = 0, random_state = None
WARNING: when p = 1 and q = 1 with unweighted graph, highly recommend using the FirstOrderUnweighted over SparseOTF. The runtime could be improved greatly with improved  memory usage.
Took 00:00:00.02 to load Graph
Took 00:00:00.00 to pre-compute transition probabilities
Traceback (most recent call last):
  File "/home/k.church/venv/gft/bin/pecanpy", line 8, in <module>
    sys.exit(main())
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 333, in main
    walks = simulate_walks(args, g)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/wrappers.py", line 18, in wrapper
    result = func(*args, **kwargs)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 320, in simulate_walks
    return g.simulate_walks(args.num_walks, args.walk_length)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/pecanpy.py", line 153, in simulate_walks
    walk_idx_mat = self._random_walks(
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 468, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)                                                                                                                                                                                          
Failed in nopython mode pipeline (step: nopython frontend)                                                                                                                                                                                          
No implementation of function Function(<built-in function itruediv>) found for signature:                                                                                                                                                           
                                                                                                                                                                                                                                                    
 >>> itruediv(array(bool, 1d, C), Literal[int](1))                                                                                                                                                                                                  

There are 6 candidate implementations:

  • Of which 2 did not match due to:
    Overload in function 'NumpyRulesInplaceArrayOperator.generic': File: numba/core/typing/npydecl.py: Line 244.
    With argument(s): '(array(bool, 1d, C), int64)':
    Rejected as the implementation raised a specific error:
    AttributeError: 'NoneType' object has no attribute 'args'
    raised from /home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/typing/npydecl.py:255
  • Of which 2 did not match due to:
    Operator Overload in function 'itruediv': File: unknown: Line unknown.
    With argument(s): '(array(bool, 1d, C), int64)':
    No match for registered cases:
    • (int64, int64) -> float64
    • (int64, uint64) -> float64
    • (uint64, int64) -> float64
    • (uint64, uint64) -> float64
    • (float32, float32) -> float32
    • (float64, float64) -> float64
    • (complex64, complex64) -> complex64
    • (complex128, complex128) -> complex128
  • Of which 2 did not match due to:
    Overload of function 'itruediv': File: numba/core/typing/npdatetime.py: Line 94.
    With argument(s): '(array(bool, 1d, C), int64)':
    No match.
@RemyLau RemyLau added the enhancement New feature or request label Jun 29, 2022
@RemyLau
Copy link
Contributor

RemyLau commented Jun 29, 2022

Hi @kwchurch, thank you for the detailed dev log! I slightly edited the format to further improve the readability. At a first glance, it looks to me like an issue of incompatible dtype. More specifically, the csr used by PecanPy uses uint32 for both the index and indptr fields, rather than int32 as used by scipy.sparse.csr. Similarly, PecanPy uses float32 instead of float64 for the data field in the csr object.

I think to resolve the type issue, the most straightforward solution is to enforce the desired types (i.e., float32 for data; uint32 for indices and `indptr) at loading time:

self.data = raw["data"]
if self.data is None:
raise ValueError("Adjacency matrix data not found.")
elif not weighted:
self.data[:] = 1.0 # overwrite edge weights with constant
self.indptr = raw["indptr"]
self.indices = raw["indices"]

I will first try to reproduce the error here using the example script you provided, and then see if my proposed solution actually fixes the issue.

As we also discussed, I will add the option for implicitly assigning node IDs if it is not found in the .csr.npz file. I will make it so that it requires a "soft confirmation" from the user that the implicit assignment is desired by printing a warning message about the implicit assignment, unless a specific flag (e.g., --implicit_node_ids) is set.

@RemyLau RemyLau linked a pull request Jun 29, 2022 that will close this issue
@RemyLau
Copy link
Contributor

RemyLau commented Jun 29, 2022

Hi @kwchurch, I've created a new branch (see #124) implementing my suggestions above (explicit dtype setting and implicit node IDs setting). The scipy csr karate test case works fine on my end.

  • I will do more testing and make a unit-test for this latter today or tomorrow.

In the meantime, if you would like to give the new changes a try and let me know if this resolves your issue, that would be great. You can run it as before using

pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF

which will warn you about the implicit node IDs setting. To suppress that, you can set the --implicit_ids flag:

pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF --implicit_ids

@kwchurch
Copy link
Author

kwchurch commented Jun 29, 2022 via email

@RemyLau
Copy link
Contributor

RemyLau commented Jun 29, 2022

@kwchurch yes it is doing that now

self.indptr = raw["indptr"].astype(np.uint32)
self.indices = raw["indices"].astype(np.uint32)
self.data = raw["data"].astype(np.float32)

@kwchurch
Copy link
Author

kwchurch commented Jun 29, 2022 via email

@kwchurch
Copy link
Author

kwchurch commented Jun 29, 2022 via email

@RemyLau
Copy link
Contributor

RemyLau commented Jun 29, 2022

@kwchurch it is ready to be tried out, but it is not on the main branch. you'll need to checkout the scipy-csr branch, and you will find the new changes there.

@RemyLau RemyLau closed this as completed Jun 29, 2022
@RemyLau RemyLau reopened this Jul 1, 2022
@RemyLau RemyLau reopened this Jul 1, 2022
@RemyLau
Copy link
Contributor

RemyLau commented Jul 1, 2022

Hi @kwchurch, I have completed some more testing and merged the new feature (implicit IDs) back to the main branch (see 2d58132). Let me know if you get a chance to test and see if this works in your case.

@kwchurch
Copy link
Author

kwchurch commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants