API reference¶

crawler¶

Crawler base class

class icrawler.crawler.Crawler(feeder_cls=<class 'icrawler.feeder.Feeder'>, parser_cls=<class 'icrawler.parser.Parser'>, downloader_cls=<class 'icrawler.downloader.Downloader'>, feeder_threads=1, parser_threads=1, downloader_threads=1, storage={'backend': 'FileSystem', 'root_dir': 'images'}, log_level=20, extra_feeder_args=None, extra_parser_args=None, extra_downloader_args=None)[source]¶

Base class for crawlers

session¶

A Session object.

Type:	Session

feeder¶

A Feeder object.

Type:	Feeder

parser¶

A Parser object.

Type:	Parser

downloader¶

A Downloader object.

Type:	Downloader

signal¶

A Signal object shared by all components, used for communication among threads

Type:	Signal

logger¶

A Logger object used for logging

Type:	Logger

crawl(feeder_kwargs=None, parser_kwargs=None, downloader_kwargs=None)[source]¶

Start crawling

This method will start feeder, parser and download and wait until all threads exit.

Parameters:	feeder_kwargs (dict, optional) – Arguments to be passed to `feeder.start()` parser_kwargs (dict, optional) – Arguments to be passed to `parser.start()` downloader_kwargs (dict, optional) – Arguments to be passed to `downloader.start()`

init_signal()[source]¶

Init signal

3 signals are added: feeder_exited, parser_exited and reach_max_num.

set_logger(log_level=20)[source]¶: Configure the logger with log_level.

set_proxy_pool(pool=None)[source]¶

Construct a proxy pool

By default no proxy is used.

Parameters:	pool (ProxyPool, optional) – a `ProxyPool` object

set_session(headers=None)[source]¶

Init session with default or custom headers

Parameters:	headers – A dict of headers (default None, thus using the default header to init the session)

set_storage(storage)[source]¶

Set storage backend for downloader

For full list of storage backend supported, please see storage.

Parameters:	storage (dict or BaseStorage) – storage backend configuration or instance

feeder¶

class icrawler.feeder.Feeder(thread_num, signal, session)[source]¶

Bases: icrawler.utils.thread_pool.ThreadPool

Base class for feeder.

A thread pool of feeder threads, in charge of feeding urls to parsers.

thread_num¶

An integer indicating the number of threads.

Type:	int

global_signal¶

A Signal object for communication among all threads.

Type:	Signal

out_queue¶

A queue connected with parsers’ inputs, storing page urls.

Type:	Queue

session¶

A session object.

Type:	Session

logger¶

A logging.Logger object used for logging.

Type:	Logger

workers¶

A list storing all the threading.Thread objects of the feeder.

Type:	list

lock¶

A Lock instance shared by all feeder threads.

Type:	Lock

feed(**kwargs)[source]¶

Feed urls.

This method should be implemented by users.

worker_exec(**kwargs)[source]¶: Target function of workers

class icrawler.feeder.SimpleSEFeeder(thread_num, signal, session)[source]¶

Bases: icrawler.feeder.Feeder

Simple search engine like Feeder

feed(url_template, keyword, offset, max_num, page_step)[source]¶

Feed urls once

Parameters:	url_template – A string with parameters replaced with “{}”. keyword – A string indicating the searching keyword. offset – An integer indicating the starting index. max_num – An integer indicating the max number of images to be crawled. page_step – An integer added to offset after each iteration.

class icrawler.feeder.UrlListFeeder(thread_num, signal, session)[source]¶

Bases: icrawler.feeder.Feeder

Url list feeder which feed a list of urls

feed(url_list, offset=0, max_num=0)[source]¶

Feed urls.

This method should be implemented by users.

parser¶

class icrawler.parser.Parser(thread_num, signal, session)[source]¶

Bases: icrawler.utils.thread_pool.ThreadPool

Base class for parser.

A thread pool of parser threads, in charge of downloading and parsing pages, extracting file urls and put them into the input queue of downloader.

global_signal¶: A Signal object for cross-module communication.

session¶: A requests.Session object.

logger¶: A logging.Logger object used for logging.

threads¶: A list storing all the threading.Thread objects of the parser.

thread_num¶: An integer indicating the number of threads.

lock¶: A threading.Lock object.

parse(response, **kwargs)[source]¶

Parse a page and extract image urls, then put it into task_queue.

This method should be overridden by users.

Example:

>>> task = {}
>>> self.output(task)

worker_exec(queue_timeout=2, req_timeout=5, max_retry=3, **kwargs)[source]¶

Target method of workers.

Firstly download the page and then call the parse() method. A parser thread will exit in either of the following cases:

All feeder threads have exited and the url_queue is empty.
Downloaded image number has reached required number.

Parameters:	queue_timeout (int) – Timeout of getting urls from `url_queue`. req_timeout (int) – Timeout of making requests for downloading pages. max_retry (int) – Max retry times if the request fails. **kwargs – Arguments to be passed to the `parse()` method.

downloader¶

class icrawler.downloader.Downloader(thread_num, signal, session, storage)[source]¶

Bases: icrawler.utils.thread_pool.ThreadPool

Base class for downloader.

A thread pool of downloader threads, in charge of downloading files and saving them in the corresponding paths.

task_queue¶

A queue storing image downloading tasks, connecting Parser and Downloader.

Type:	CachedQueue

signal¶

A Signal object shared by all components.

Type:	Signal

session¶

A session object.

Type:	Session

logger¶: A logging.Logger object used for logging.

workers¶

A list of downloader threads.

Type:	list

thread_num¶

The number of downloader threads.

Type:	int

lock¶

A threading.Lock object.

Type:	Lock

storage¶

storage backend.

Type:	BaseStorage

clear_status()[source]¶: Reset fetched_num to 0.

download(task, default_ext, timeout=5, max_retry=3, overwrite=False, **kwargs)[source]¶

Download the image and save it to the corresponding path.

Parameters:	task (dict) – The task dict got from `task_queue`. timeout (int) – Timeout of making requests for downloading images. max_retry (int) – the max retry times if the request fails. **kwargs – reserved arguments for overriding.

get_filename(task, default_ext)[source]¶

Set the path where the image will be saved.

The default strategy is to use an increasing 6-digit number as the filename. You can override this method if you want to set custom naming rules. The file extension is kept if it can be obtained from the url, otherwise default_ext is used as extension.

Parameters:	task (dict) – The task dict got from `task_queue`.

Output:: Filename with extension.

process_meta(task)[source]¶

Process some meta data of the images.

This method should be overridden by users if wanting to do more things other than just downloading the image, such as saving annotations.

Parameters:	task (dict) – The task dict got from task_queue. This method will make use of fields other than `file_url` in the dict.

reach_max_num()[source]¶

Check if downloaded images reached max num.

Returns:	if downloaded images reached max num.
Return type:	bool

set_file_idx_offset(file_idx_offset=0)[source]¶

Set offset of file index.

Parameters:	file_idx_offset – It can be either an integer or ‘auto’. If set to an integer, the filename will start from `file_idx_offset` + 1. If set to `'auto'`, the filename will start from existing max file index plus 1.

worker_exec(max_num, default_ext='', queue_timeout=5, req_timeout=5, **kwargs)[source]¶

Target method of workers.

Get task from task_queue and then download files and process meta data. A downloader thread will exit in either of the following cases:

All parser threads have exited and the task_queue is empty.
Downloaded image number has reached required number(max_num).

Parameters:	queue_timeout (int) – Timeout of getting tasks from `task_queue`. req_timeout (int) – Timeout of making requests for downloading pages. **kwargs – Arguments passed to the `download()` method.

class icrawler.downloader.ImageDownloader(thread_num, signal, session, storage)[source]¶

Bases: icrawler.downloader.Downloader

Downloader specified for images.

get_filename(task, default_ext)[source]¶

Set the path where the image will be saved.

The default strategy is to use an increasing 6-digit number as the filename. You can override this method if you want to set custom naming rules. The file extension is kept if it can be obtained from the url, otherwise default_ext is used as extension.

Parameters:	task (dict) – The task dict got from `task_queue`.

Output:: Filename with extension.

keep_file(task, response, min_size=None, max_size=None)[source]¶

Decide whether to keep the image

Compare image size with min_size and max_size to decide.

Parameters:	response (Response) – response of requests. min_size (tuple or None) – minimum size of required images. max_size (tuple or None) – maximum size of required images.
Returns:	whether to keep the image.
Return type:	bool

worker_exec(max_num, default_ext='jpg', queue_timeout=5, req_timeout=5, **kwargs)[source]¶

Target method of workers.

Get task from task_queue and then download files and process meta data. A downloader thread will exit in either of the following cases:

All parser threads have exited and the task_queue is empty.
Downloaded image number has reached required number(max_num).

Parameters:	queue_timeout (int) – Timeout of getting tasks from `task_queue`. req_timeout (int) – Timeout of making requests for downloading pages. **kwargs – Arguments passed to the `download()` method.

storage¶

class icrawler.storage.BaseStorage[source]¶

Bases: object

Base class of backend storage

exists(id)[source]¶

Check the existence of some data

Parameters:	id (str) – unique id of the data in the storage
Returns:	whether the data exists
Return type:	bool

max_file_idx()[source]¶

Get the max existing file index

Returns:	the max index
Return type:	int

write(id, data)[source]¶

Abstract interface of writing data

Parameters:	id (str) – unique id of the data in the storage. data (bytes or str) – data to be stored.

class icrawler.storage.FileSystem(root_dir)[source]¶

Bases: icrawler.storage.base.BaseStorage

Use filesystem as storage backend.

The id is filename and data is stored as text files or binary files.

exists(id)[source]¶

Check the existence of some data

Parameters:	id (str) – unique id of the data in the storage
Returns:	whether the data exists
Return type:	bool

max_file_idx()[source]¶

Get the max existing file index

Returns:	the max index
Return type:	int

write(id, data)[source]¶

Abstract interface of writing data

Parameters:	id (str) – unique id of the data in the storage. data (bytes or str) – data to be stored.

class icrawler.storage.GoogleStorage(root_dir)[source]¶

Bases: icrawler.storage.base.BaseStorage

Google Storage backend.

The id is filename and data is stored as text files or binary files. The root_dir is the bucket address such as gs://<your_bucket>/<your_directory>.

exists(id)[source]¶

Check the existence of some data

Parameters:	id (str) – unique id of the data in the storage
Returns:	whether the data exists
Return type:	bool

max_file_idx()[source]¶

Get the max existing file index

Returns:	the max index
Return type:	int

write(id, data)[source]¶

Abstract interface of writing data

Parameters:	id (str) – unique id of the data in the storage. data (bytes or str) – data to be stored.

utils¶

class icrawler.utils.CachedQueue(*args, **kwargs)[source]¶

Bases: Queue.Queue, object

Queue with cache

This queue is used in ThreadPool, it enables parser and downloader to check if the page url or the task has been seen or processed before.

_cache¶

cache, elements are stored as keys of it.

Type:	OrderedDict

cache_capacity¶

maximum size of cache.

Type:	int

is_duplicated(item)[source]¶

Check whether the item has been in the cache

If the item has not been seen before, then hash it and put it into the cache, otherwise indicates the item is duplicated. When the cache size exceeds capacity, discard the earliest items in the cache.

Parameters:	item (object) – The item to be checked and stored in cache. It must be immutable or a list/dict.
Returns:	Whether the item has been in cache.
Return type:	bool

put(item, block=True, timeout=None, dup_callback=None)[source]¶: Put an item to queue if it is not duplicated.

put_nowait(item, dup_callback=None)[source]¶

Put an item into the queue without blocking.

Only enqueue the item if a free slot is immediately available. Otherwise raise the Full exception.

class icrawler.utils.Proxy(addr=None, protocol='http', weight=1.0, last_checked=None)[source]¶

Bases: object

Proxy class

addr¶

A string with IP and port, for example ‘123.123.123.123:8080’

Type:	str

protocol¶

‘http’ or ‘https’

Type:	str

weight¶

A float point number indicating the probability of being selected, the weight is based on the connection time and stability

Type:	float

last_checked¶

A UNIX timestamp indicating when the proxy was checked

Type:	time

format()[source]¶

Return the proxy compatible with requests.Session parameters

Returns:	A dict like {‘http’: ‘123.123.123.123:8080’}
Return type:	dict

to_dict()[source]¶

convert detailed proxy info into a dict

Returns:	A dict with four keys: `addr`, `protocol`, `weight` and `last_checked`
Return type:	dict

class icrawler.utils.ProxyPool(filename=None)[source]¶

Bases: object

Proxy pool class

ProxyPool provides friendly apis to manage proxies.

idx¶

Index for http proxy list and https proxy list.

Type:	dict

test_url¶

A dict containing two urls, when testing if a proxy is valid, test_url[‘http’] and test_url[‘https’] will be used according to the protocol.

Type:	dict

proxies¶

All the http and https proxies.

Type:	dict

addr_list¶

Address of proxies.

Type:	dict

dec_ratio¶

When decreasing the weight of some proxy, its weight is multiplied with dec_ratio.

Type:	float

inc_ratio¶

Similar to dec_ratio but used for increasing weights, default the reciprocal of dec_ratio.

Type:	float

weight_thr¶

The minimum weight of a valid proxy, if the weight of a proxy is lower than weight_thr, it will be removed.

Type:	float

logger¶

A logging.Logger object used for logging.

Type:	Logger

add_proxy(proxy)[source]¶

Add a valid proxy into pool

You must call add_proxy method to add a proxy into pool instead of directly operate the proxies variable.

decrease_weight(proxy)[source]¶: Decreasing the weight of a proxy by multiplying dec_ratio

default_scan(region='mainland', expected_num=20, val_thr_num=4, queue_timeout=3, val_timeout=5, out_file='proxies.json', src_files=None)[source]¶

Default scan method, to simplify the usage of scan method.

It will register following scan functions: 1. scan_file 2. scan_cnproxy (if region is mainland) 3. scan_free_proxy_list (if region is overseas) 4. scan_ip84 5. scan_mimiip After scanning, all the proxy info will be saved in out_file.

Parameters:

region – Either ‘mainland’ or ‘overseas’
expected_num – An integer indicating the expected number of proxies, if this argument is set too great, it may take long to finish scanning process.
val_thr_num – Number of threads used for validating proxies.
queue_timeout – An integer indicating the timeout for getting a candidate proxy from the queue.
val_timeout – An integer indicating the timeout when connecting the test url using a candidate proxy.
out_file – the file name of the output file saving all the proxy info
src_files – A list of file names to scan

get_next(protocol='http', format=False, policy='loop')[source]¶

Get the next proxy

Parameters:

protocol (str) – ‘http’ or ‘https’. (default ‘http’)
format (bool) – Whether to format the proxy. (default False)
policy (str) – Either ‘loop’ or ‘random’, indicating the policy of getting the next proxy. If set to ‘loop’, will return proxies in turn, otherwise will return a proxy randomly.

Returns:

If format is true, then return the formatted proxy: which is compatible with requests.Session parameters, otherwise a Proxy object.

Return type:

Proxy or dict

increase_weight(proxy)[source]¶: Increase the weight of a proxy by multiplying inc_ratio

is_valid(addr, protocol='http', timeout=5)[source]¶

Check if a proxy is valid

Parameters:

addr – A string in the form of ‘ip:port’
protocol – Either ‘http’ or ‘https’, different test urls will be used according to protocol.
timeout – A integer indicating the timeout of connecting the test url.

Returns:

If the proxy is valid, returns {‘valid’: True, ‘response_time’: xx}: otherwise returns {‘valid’: False, ‘msg’: ‘xxxxxx’}.

Return type:

dict

load(filename)[source]¶: Load proxies from file

proxy_num(protocol=None)[source]¶

Get the number of proxies in the pool

Parameters:	protocol (str, optional) – ‘http’ or ‘https’ or None. (default None)
Returns:	If protocol is None, return the total number of proxies, otherwise, return the number of proxies of corresponding protocol.

remove_proxy(proxy)[source]¶: Remove a proxy out of the pool

save(filename)[source]¶: Save proxies to file

scan(proxy_scanner, expected_num=20, val_thr_num=4, queue_timeout=3, val_timeout=5, out_file='proxies.json')[source]¶

Scan and validate proxies

Firstly, call the scan method of proxy_scanner, then using multiple threads to validate them.

Parameters:

proxy_scanner – A ProxyScanner object.
expected_num – Max number of valid proxies to be scanned.
val_thr_num – Number of threads used for validating proxies.
queue_timeout – Timeout for getting a proxy from the queue.
val_timeout – An integer passed to is_valid as argument timeout.
out_file – A string or None. If not None, the proxies will be saved into out_file.

validate(proxy_scanner, expected_num=20, queue_timeout=3, val_timeout=5)[source]¶

Target function of validation threads

Parameters:	proxy_scanner – A ProxyScanner object. expected_num – Max number of valid proxies to be scanned. queue_timeout – Timeout for getting a proxy from the queue. val_timeout – An integer passed to is_valid as argument timeout.

class icrawler.utils.ProxyScanner[source]¶

Proxy scanner class

ProxyScanner focuses on scanning proxy lists from different sources.

proxy_queue¶: The queue for storing proxies.

scan_funcs¶: Name of functions to be used in scan method.

scan_kwargs¶: Arguments of functions

scan_threads¶: A list of threading.thread object.

logger¶: A logging.Logger object used for logging.

is_scanning()[source]¶: Return whether at least one scanning thread is alive

register_func(func_name, func_kwargs)[source]¶

Register a scan function

Parameters:	func_name – The function name of a scan function. func_kwargs – A dict containing arguments of the scan function.

scan()[source]¶: Start a thread for each registered scan function to scan proxy lists

scan_cnproxy()[source]¶: Scan candidate (mainland) proxies from http://cn-proxy.com

scan_file(src_file)[source]¶: Scan candidate proxies from an existing file

scan_free_proxy_list()[source]¶: Scan candidate (overseas) proxies from http://free-proxy-list.net

scan_ip84(region='mainland', page=1)[source]¶

Scan candidate proxies from http://ip84.com

Parameters:	region – Either ‘mainland’ or ‘overseas’. page – An integer indicating how many pages to be scanned.

scan_mimiip(region='mainland', page=1)[source]¶

Scan candidate proxies from http://mimiip.com

Parameters:	region – Either ‘mainland’ or ‘overseas’. page – An integer indicating how many pages to be scanned.

class icrawler.utils.Session(proxy_pool)[source]¶

Bases: requests.sessions.Session

get(url, **kwargs)[source]¶

Sends a GET request. Returns Response object.

Parameters:	url – URL for the new `Request` object. **kwargs – Optional arguments that `request` takes.
Return type:	requests.Response

post(url, data=None, json=None, **kwargs)[source]¶

Sends a POST request. Returns Response object.

Parameters:	url – URL for the new `Request` object. data – (optional) Dictionary, list of tuples, bytes, or file-like object to send in the body of the `Request`. json – (optional) json to send in the body of the `Request`. **kwargs – Optional arguments that `request` takes.
Return type:	requests.Response

class icrawler.utils.Signal[source]¶

Bases: object

Signal class

Provides interfaces for set and get some globally shared variables(signals).

signals¶: A dict of all signal names and values.

init_status¶: The initial values of all signals.

get(name)[source]¶

Get a signal value by its name.

Parameters:	name – a string indicating the signal name.
Returns:	Value of the signal or None if the name is invalid.

names()[source]¶: Return all the signal names

reset()[source]¶: Reset signals with their initial values

set(**signals)[source]¶

Set signals.

Parameters:	signals – A dict(key-value pairs) of all signals. For example {‘signal1’: True, ‘signal2’: 10}

class icrawler.utils.ThreadPool(thread_num, in_queue=None, out_queue=None, name=None)[source]¶

Bases: object

Simple implementation of a thread pool

This is the base class of Feeder, Parser and Downloader, it incorporates two FIFO queues and a number of “workers”, namely threads. All threads share the two queues, after each thread starts, it will watch the in_queue, once the queue is not empty, it will get a task from the queue and process as wanted, then it will put the output to out_queue.

Note

This class is not designed as a generic thread pool, but works specifically for crawler components.

name¶

thread pool name.

Type:	str

thread_num¶

number of available threads.

Type:	int

in_queue¶

input queue of tasks.

Type:	Queue

out_queue¶

output queue of finished tasks.

Type:	Queue

workers¶

a list of working threads.

Type:	list

lock¶

thread lock.

Type:	Lock

logger¶

standard python logger.

Type:	Logger

connect(component)[source]¶

Connect two ThreadPools.

The in_queue of the second pool will be set as the out_queue of the current pool, thus all the output will be input to the second pool.

Parameters:	component (ThreadPool) – the ThreadPool to be connected.
Returns:	the modified second ThreadPool.
Return type:	ThreadPool