Skip to content

domarps/multi-threaded-image-downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

multithreaded-image-downloads

Note :If you are looking for a simple solution for multi-threaded downloads, try aria2c

sed -E 's/([^,]*),(.*)/\2\n  out=\1.jpg/' <path-to-csv> | aria2c -i -

Use this package to perform a multi-threaded download of a list of file URLs contained in a TSV file with lines formatted as <label>\t<url>\n (separated by a single tab). Each url is downloaded to a directory with the same name as its label.

Installation

python setup.py install

Input file sample:

In test_URLs.tsv (each line):

  `<label>\t<url>\n`

Example:

      fat_cat         http://farm1.staticflickr.com/1/1053148_4114c598f2.jpg
      fat_cat         http://farm2.staticflickr.com/1246/1061116668_a7e80ff2e8.jpg
      colorful_bird   http://farm1.staticflickr.com/34/100197289_ffc66e727e.jpg
      colorful_bird   http://farm2.staticflickr.com/1438/1271854268_d051bdd585.jpg
      barking_dog     http://farm1.staticflickr.com/41/103187370_7db6b95089.jpg
      barking_dog     http://farm1.staticflickr.com/45/107867809_57412c5cb4.jpg

Pseudocode usage:

(complete example in tests/test_basic.py)

  from multithreaded_image_download import URLDownloaderThread
  # define a queue and threadpool
  q = queue.Queue(maxsize=queue_size)
  threads = [URLDownloaderThread(q) for t in range(num_threads)]
  for t in threads: t.start() # begin calling run() loop, which dequeues and processes queue items
  # open a tsv file formatted like the above example
  with open('tsv_file', 'r') as f:
    for line in f:
      # read and process each line to get the item
      item = preprocess_line(line.strip().split('\t'))
      # enqueue item so that threads process it
      q.put(item) # here, item = (label_dir, file_url)
  q.join() # main thread waits for q.task_done() to be called enough times so that all enqueued elements are processed
  for i in range(queue_size): q.put(None) # make worker threads leave the t.run() loop
  for t in threads: t.join() # wait for all threads to terminate execution

More info

Also, one may use this package as a skeleton to create threads that dequeue and process items from a shared python queue. An example is provided for multithreaded image downloads, and an extended, commented usage description is provided in tests/test_basic.py. Run the test script by calling make test from Terminal (tested on macOS).

About

Fast download of image urls

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages