Skip to content

Commit

Permalink
Performance improvements (#12)
Browse files Browse the repository at this point in the history
* lines iter added

* rearranged more libs

* line iter

* fixed tiny bug in line processor and UTs

* support is-messy cli arg which falls back to bytes iter.

* python cli updated to support messy json

* type fix, benchmark has code to run async

* updated readme

* windows executable

* updated version and regen'd docs

* simplified the code

---------

Co-authored-by: Salaah Amin <[email protected]>
  • Loading branch information
Salaah01 and Salaah01 committed Jun 28, 2023
1 parent 18f7645 commit c052d2b
Show file tree
Hide file tree
Showing 79 changed files with 3,947 additions and 241 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -210,5 +210,4 @@ Cargo.lock
# End of https://www.toptal.com/developers/gitignore/api/rust,python,visualstudiocode

tmp.*
sample_data/324mb_sample.json
sample_data/32mb_sample.json
sample_data/*_sample.json
44 changes: 29 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,19 @@
- [Why not Just Use Python's `json` Library?](#why-not-just-use-pythons-json-library)
- [Functionality](#functionality)
- [Benchmarks](#benchmarks)
- [32MB JSON file](#32mb-json-file)
- [324MB JSON file](#324mb-json-file)
- [Installation](#installation)
- [Usage](#usage)
- [Iterating over a JSON file](#iterating-over-a-json-file)
- [Iterating over a JSON file asynchronously](#iterating-over-a-json-file-asynchronously)
- [Poorly Formatted JSON](#poorly-formatted-json)
- [Under the Hood](#under-the-hood)


## Introduction

JSON Linage is a tool that allows you to convert JSON to JSONL (JSON Lines) format as well as iteratively parse JSON where the JSON contains a list of objects.

The underlying program is written in Rust and is built to feed one JSON object at a time to the parser. This allows for the parsing of very large JSON files that would otherwise not fit into memory.
The underlying program is written in Rust and is built to feed one JSON object at a time to the parser. This allows for the parsing of very large JSON files that would otherwise not fit into memory. In addition to saving memory, this program is capable of parsing JSON files faster than the built-in Python JSON parser as the file size increases.

Additionally, this project contains adapters for easy integration into other programming languages. Currently, there is only a Python adapter, but more are planned.

Expand Down Expand Up @@ -60,21 +59,20 @@ For information on how to use the CLI, run: `python -m json_lineage --help`.

The following benchmarks where run comparing the performance of the Python JSON parser and JSON Lineage. These results should help you decide when Python's JSON parser is sufficient and when you should use JSON Lineage.

##### 32MB JSON file
In a nutshell, when working with very small JSON files, Python's JSON parser is faster. However, as the size of the JSON file increases, JSON Lineage becomes faster. Additionally, JSON Lineage uses significantly less memory than Python's JSON parser.

| Library | Time (s) | Memory (MB) |
| -------------- | -------- | ----------- |
| `json` | 0.166 | 158.99 |
| `json_lineage` | 1.01 | 0.52 |
| Size (MB) | `json` Time (s) | `json_lineage` Time (s) | `json` Memory (MB) | `json_lineage` Memory (MB) |
| --------- | --------------- | ----------------------- | ------------------ | -------------------------- |
| 0.05 | 0.0002 | 0.0010 | 0.25 | 0.25 |
| 0.1 | 0.0004 | 0.0009 | 0.53 | 0.25 |
| 5 | 0.02 | 0.01 | 25.47 | 0.52 |
| 32 | 0.166 | 1.10 | 158.99 | 0.77 |
| 324 | 1.66 | 0.99 | 1580.46 | 0.92 |

##### 324MB JSON file

| Library | Time (s) | Memory (MB) |
| -------------- | -------- | ----------- |
| `json` | 1.66 | 1580.46 |
| `json_lineage` | 10.06 | 0.71 |

![Benchmark of difference in time as file size grows](/docs/benchmark/benchmark-time-diff.jpg)

![Benchmark of difference in memory as file size grows](/docs/benchmark/benchmark-memory-diff.jpg)

#### Installation

Expand All @@ -91,7 +89,6 @@ from json_lineage import load

jsonl_iter = load("path/to/file.json")


for obj in jsonl_iter:
do_something(obj)
```
Expand Down Expand Up @@ -122,6 +119,23 @@ async def main():
asyncio.run(main())
```

##### Poorly Formatted JSON

When parsing a JSON file, the program will assume that the JSON file is well formatted. If the JSON file is not well formatted, then you can provide a `messy=True` argument to either the sync or async load:

```python
from json_lineage import load

jsonl_iter = load("path/to/file.json", messy=True)


for obj in jsonl_iter:
do_something(obj)
```

This will cause the program to output the same results. However, how it parses the JSON file will be different. Using this option will cause the program to be slower, but it will be able to parse JSON files that are not well formatted.

If you are using the CLI, then you can use the `--messy` flag to achieve the same result.

## Under the Hood

Expand Down
14 changes: 12 additions & 2 deletions adapters/python/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,20 @@
file.
"""

import asyncio
import json
import os
import resource
import timeit

from json_lineage import load
from json_lineage import aload, load

FP = os.path.join(
os.path.dirname(os.path.realpath(__file__)),
"..",
"..",
"sample_data",
"32mb_sample.json",
"50kb_sample.json",
)


Expand All @@ -31,6 +32,15 @@ def using_python_lib():
i


async def using_rust_lib_async():
async for i in aload(FP):
i


def async_main():
asyncio.run(using_rust_lib_async())


def benchmark(fn):
print(f"{'BENCHMARKING:'.ljust(15)}{fn.__name__}")
start_mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
Expand Down
Binary file modified adapters/python/json_lineage/bin/jsonl_converter
Binary file not shown.
Binary file modified adapters/python/json_lineage/bin/jsonl_converter.exe
Binary file not shown.
16 changes: 12 additions & 4 deletions adapters/python/json_lineage/bin_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ def get_bin_path() -> str:
class BaseBinaryReader:
"""Base class for the `BinaryReader` and `AsyncBinaryReader` classes."""

def __init__(self, filepath: str):
def __init__(self, filepath: str, messy: bool = False):
self.bin_path = get_bin_path()
self.file_path = filepath
self.messy = messy
self._proc: _t.Optional[
_t.Union[subprocess.Popen, asyncio.subprocess.Process]
] = None
Expand All @@ -38,6 +39,13 @@ def __repr__(self) -> str:
f"file_path={self.file_path}>"
)

def bin_args(self) -> _t.List[str]:
"""Return the arguments to pass to the binary."""
bin_args = [self.bin_path, self.file_path]
if self.messy:
bin_args.append("--messy")
return bin_args

def kill_subprocess_proc(self) -> None:
"""Kill the subprocess process."""
if self._proc is None:
Expand All @@ -64,7 +72,7 @@ def __iter__(self):
def popen(self) -> subprocess.Popen:
"""Run the binary and return a Popen object."""
self._proc = subprocess.Popen(
[self.bin_path, self.file_path],
self.bin_args(),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True,
Expand Down Expand Up @@ -111,9 +119,9 @@ class AsyncBinaryReader(BaseBinaryReader):

async def popen(self) -> asyncio.subprocess.Process:
"""Run the binary and return a Popen object."""

self._proc = await asyncio.create_subprocess_exec(
self.bin_path,
self.file_path,
*self.bin_args(),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
Expand Down
14 changes: 13 additions & 1 deletion adapters/python/json_lineage/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,17 @@ def parse_args() -> argparse.Namespace:
type=str,
help="Path to the JSON file to read.",
)
parser.add_argument(
"--messy",
"-m",
action="store_true",
help=(
"Indicates that the JSON file may not be well formatted. For "
"example, the file may contain multiple JSON objects on a "
"single line. Note: this option is considerably slower than "
"the default option."
),
)
parser.add_argument(
"--output-file",
"-o",
Expand Down Expand Up @@ -46,7 +57,8 @@ def main() -> None:
module.
"""
args = parse_args()
reader = BinaryReader(args.filepath)
reader = BinaryReader(args.filepath, args.messy)

if args.output_file:
write_lines(reader, args.output_file)
else:
Expand Down
8 changes: 4 additions & 4 deletions adapters/python/json_lineage/public.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@
]


def load(fp: str) -> BinaryReader:
def load(fp: str, *, messy: bool = False) -> BinaryReader:
"""Return a `BinaryReader` object for the given file path."""
return BinaryReader(fp)
return BinaryReader(fp, messy)


def aload(fp: str) -> AsyncBinaryReader:
def aload(fp: str, *, messy: bool = False) -> AsyncBinaryReader:
"""Return an `AsyncBinaryReader` object for the given file path."""
return AsyncBinaryReader(fp)
return AsyncBinaryReader(fp, messy)
2 changes: 1 addition & 1 deletion adapters/python/setup.cfg
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[metadata]
name = json-lineage
version = 0.1.0
version = 0.2.0
description = Library to parse JSON files iteratively without loading the whole file into memory
long_description = file: README.md
long_description_content_type = text/markdown
Expand Down
17 changes: 17 additions & 0 deletions adapters/python/tests/test_bin_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from unittest.mock import patch

from json_lineage import bin_interface
from json_lineage.bin_interface import get_bin_path
from json_lineage.exceptions import BinaryExecutionException

from .helpers import SAMPLE_DATA_PATH
Expand Down Expand Up @@ -76,6 +77,22 @@ def test_kill_subprocess_proc_closes_files_and_terminates_proc(self):
time.sleep(0.01)
self.assertIsNotNone(proc.poll())

def test_bin_args_with_just_filename(self):
"""Test that the `bin_args` method returns the correct arguments when
only the filename is passed.
"""
reader = bin_interface.BinaryReader("filename")
self.assertEqual(reader.bin_args(), [get_bin_path(), "filename"])

def test_bin_with_messy_opt(self):
"""Test that the `bin_args` method returns the correct arguments when
the `messy` option is passed.
"""
reader = bin_interface.BinaryReader("filename", messy=True)
self.assertEqual(
reader.bin_args(), [get_bin_path(), "filename", "--messy"]
)


class TestBinaryReader(ReaderInstanceMixin, TestCase):
"""Tests for the `BinaryReader` class."""
Expand Down
Binary file added docs/benchmark/benchmark-memory-diff.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/benchmark/benchmark-time-diff.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/cargo/implementors/core/marker/trait.Freeze.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/cargo/implementors/core/marker/trait.Send.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit c052d2b

Please sign in to comment.