API reference¶
crawler¶
Crawler base class
-
class
icrawler.crawler.
Crawler
(feeder_cls=<class 'icrawler.feeder.Feeder'>, parser_cls=<class 'icrawler.parser.Parser'>, downloader_cls=<class 'icrawler.downloader.Downloader'>, feeder_threads=1, parser_threads=1, downloader_threads=1, storage={'backend': 'FileSystem', 'root_dir': 'images'}, log_level=20, extra_feeder_args=None, extra_parser_args=None, extra_downloader_args=None)[source]¶ Base class for crawlers
-
downloader
¶ A Downloader object.
Type: Downloader
-
logger
¶ A Logger object used for logging
Type: Logger
-
crawl
(feeder_kwargs=None, parser_kwargs=None, downloader_kwargs=None)[source]¶ Start crawling
This method will start feeder, parser and download and wait until all threads exit.
Parameters: - feeder_kwargs (dict, optional) – Arguments to be passed to
feeder.start()
- parser_kwargs (dict, optional) – Arguments to be passed to
parser.start()
- downloader_kwargs (dict, optional) – Arguments to be passed to
downloader.start()
- feeder_kwargs (dict, optional) – Arguments to be passed to
-
init_signal
()[source]¶ Init signal
3 signals are added:
feeder_exited
,parser_exited
andreach_max_num
.
-
set_proxy_pool
(pool=None)[source]¶ Construct a proxy pool
By default no proxy is used.
Parameters: pool (ProxyPool, optional) – a ProxyPool
object
-
set_session
(headers=None)[source]¶ Init session with default or custom headers
Parameters: headers – A dict of headers (default None, thus using the default header to init the session)
-
set_storage
(storage)[source]¶ Set storage backend for downloader
For full list of storage backend supported, please see
storage
.Parameters: storage (dict or BaseStorage) – storage backend configuration or instance
-
feeder¶
-
class
icrawler.feeder.
Feeder
(thread_num, signal, session)[source]¶ Bases:
icrawler.utils.thread_pool.ThreadPool
Base class for feeder.
A thread pool of feeder threads, in charge of feeding urls to parsers.
-
thread_num
¶ An integer indicating the number of threads.
Type: int
-
out_queue
¶ A queue connected with parsers’ inputs, storing page urls.
Type: Queue
-
logger
¶ A logging.Logger object used for logging.
Type: Logger
-
workers
¶ A list storing all the threading.Thread objects of the feeder.
Type: list
-
lock
¶ A
Lock
instance shared by all feeder threads.Type: Lock
-
-
class
icrawler.feeder.
SimpleSEFeeder
(thread_num, signal, session)[source]¶ Bases:
icrawler.feeder.Feeder
Simple search engine like Feeder
-
feed
(url_template, keyword, offset, max_num, page_step)[source]¶ Feed urls once
Parameters: - url_template – A string with parameters replaced with “{}”.
- keyword – A string indicating the searching keyword.
- offset – An integer indicating the starting index.
- max_num – An integer indicating the max number of images to be crawled.
- page_step – An integer added to offset after each iteration.
-
-
class
icrawler.feeder.
UrlListFeeder
(thread_num, signal, session)[source]¶ Bases:
icrawler.feeder.Feeder
Url list feeder which feed a list of urls
parser¶
-
class
icrawler.parser.
Parser
(thread_num, signal, session)[source]¶ Bases:
icrawler.utils.thread_pool.ThreadPool
Base class for parser.
A thread pool of parser threads, in charge of downloading and parsing pages, extracting file urls and put them into the input queue of downloader.
-
global_signal
¶ A Signal object for cross-module communication.
-
session
¶ A requests.Session object.
-
logger
¶ A logging.Logger object used for logging.
-
threads
¶ A list storing all the threading.Thread objects of the parser.
-
thread_num
¶ An integer indicating the number of threads.
-
lock
¶ A threading.Lock object.
-
parse
(response, **kwargs)[source]¶ Parse a page and extract image urls, then put it into task_queue.
This method should be overridden by users.
Example: >>> task = {} >>> self.output(task)
-
worker_exec
(queue_timeout=2, req_timeout=5, max_retry=3, **kwargs)[source]¶ Target method of workers.
Firstly download the page and then call the
parse()
method. A parser thread will exit in either of the following cases:- All feeder threads have exited and the
url_queue
is empty. - Downloaded image number has reached required number.
Parameters: - queue_timeout (int) – Timeout of getting urls from
url_queue
. - req_timeout (int) – Timeout of making requests for downloading pages.
- max_retry (int) – Max retry times if the request fails.
- **kwargs – Arguments to be passed to the
parse()
method.
- All feeder threads have exited and the
-
downloader¶
-
class
icrawler.downloader.
Downloader
(thread_num, signal, session, storage)[source]¶ Bases:
icrawler.utils.thread_pool.ThreadPool
Base class for downloader.
A thread pool of downloader threads, in charge of downloading files and saving them in the corresponding paths.
-
task_queue
¶ A queue storing image downloading tasks, connecting
Parser
andDownloader
.Type: CachedQueue
-
logger
¶ A logging.Logger object used for logging.
-
workers
¶ A list of downloader threads.
Type: list
-
thread_num
¶ The number of downloader threads.
Type: int
-
lock
¶ A threading.Lock object.
Type: Lock
-
storage
¶ storage backend.
Type: BaseStorage
-
download
(task, default_ext, timeout=5, max_retry=3, overwrite=False, **kwargs)[source]¶ Download the image and save it to the corresponding path.
Parameters: - task (dict) – The task dict got from
task_queue
. - timeout (int) – Timeout of making requests for downloading images.
- max_retry (int) – the max retry times if the request fails.
- **kwargs – reserved arguments for overriding.
- task (dict) – The task dict got from
-
get_filename
(task, default_ext)[source]¶ Set the path where the image will be saved.
The default strategy is to use an increasing 6-digit number as the filename. You can override this method if you want to set custom naming rules. The file extension is kept if it can be obtained from the url, otherwise
default_ext
is used as extension.Parameters: task (dict) – The task dict got from task_queue
.- Output:
- Filename with extension.
-
process_meta
(task)[source]¶ Process some meta data of the images.
This method should be overridden by users if wanting to do more things other than just downloading the image, such as saving annotations.
Parameters: task (dict) – The task dict got from task_queue. This method will make use of fields other than file_url
in the dict.
-
reach_max_num
()[source]¶ Check if downloaded images reached max num.
Returns: if downloaded images reached max num. Return type: bool
-
set_file_idx_offset
(file_idx_offset=0)[source]¶ Set offset of file index.
Parameters: file_idx_offset – It can be either an integer or ‘auto’. If set to an integer, the filename will start from file_idx_offset
+ 1. If set to'auto'
, the filename will start from existing max file index plus 1.
-
worker_exec
(max_num, default_ext='', queue_timeout=5, req_timeout=5, **kwargs)[source]¶ Target method of workers.
Get task from
task_queue
and then download files and process meta data. A downloader thread will exit in either of the following cases:- All parser threads have exited and the task_queue is empty.
- Downloaded image number has reached required number(max_num).
Parameters: - queue_timeout (int) – Timeout of getting tasks from
task_queue
. - req_timeout (int) – Timeout of making requests for downloading pages.
- **kwargs – Arguments passed to the
download()
method.
-
-
class
icrawler.downloader.
ImageDownloader
(thread_num, signal, session, storage)[source]¶ Bases:
icrawler.downloader.Downloader
Downloader specified for images.
-
get_filename
(task, default_ext)[source]¶ Set the path where the image will be saved.
The default strategy is to use an increasing 6-digit number as the filename. You can override this method if you want to set custom naming rules. The file extension is kept if it can be obtained from the url, otherwise
default_ext
is used as extension.Parameters: task (dict) – The task dict got from task_queue
.- Output:
- Filename with extension.
-
keep_file
(task, response, min_size=None, max_size=None)[source]¶ Decide whether to keep the image
Compare image size with
min_size
andmax_size
to decide.Parameters: - response (Response) – response of requests.
- min_size (tuple or None) – minimum size of required images.
- max_size (tuple or None) – maximum size of required images.
Returns: whether to keep the image.
Return type: bool
-
worker_exec
(max_num, default_ext='jpg', queue_timeout=5, req_timeout=5, **kwargs)[source]¶ Target method of workers.
Get task from
task_queue
and then download files and process meta data. A downloader thread will exit in either of the following cases:- All parser threads have exited and the task_queue is empty.
- Downloaded image number has reached required number(max_num).
Parameters: - queue_timeout (int) – Timeout of getting tasks from
task_queue
. - req_timeout (int) – Timeout of making requests for downloading pages.
- **kwargs – Arguments passed to the
download()
method.
-
storage¶
-
class
icrawler.storage.
BaseStorage
[source]¶ Bases:
object
Base class of backend storage
-
class
icrawler.storage.
FileSystem
(root_dir)[source]¶ Bases:
icrawler.storage.base.BaseStorage
Use filesystem as storage backend.
The id is filename and data is stored as text files or binary files.
-
class
icrawler.storage.
GoogleStorage
(root_dir)[source]¶ Bases:
icrawler.storage.base.BaseStorage
Google Storage backend.
The id is filename and data is stored as text files or binary files. The root_dir is the bucket address such as gs://<your_bucket>/<your_directory>.
utils¶
-
class
icrawler.utils.
CachedQueue
(*args, **kwargs)[source]¶ Bases:
Queue.Queue
,object
Queue with cache
This queue is used in
ThreadPool
, it enables parser and downloader to check if the page url or the task has been seen or processed before.-
_cache
¶ cache, elements are stored as keys of it.
Type: OrderedDict
-
cache_capacity
¶ maximum size of cache.
Type: int
-
is_duplicated
(item)[source]¶ Check whether the item has been in the cache
If the item has not been seen before, then hash it and put it into the cache, otherwise indicates the item is duplicated. When the cache size exceeds capacity, discard the earliest items in the cache.
Parameters: item (object) – The item to be checked and stored in cache. It must be immutable or a list/dict. Returns: Whether the item has been in cache. Return type: bool
-
-
class
icrawler.utils.
Proxy
(addr=None, protocol='http', weight=1.0, last_checked=None)[source]¶ Bases:
object
Proxy class
-
addr
¶ A string with IP and port, for example ‘123.123.123.123:8080’
Type: str
-
protocol
¶ ‘http’ or ‘https’
Type: str
-
weight
¶ A float point number indicating the probability of being selected, the weight is based on the connection time and stability
Type: float
-
last_checked
¶ A UNIX timestamp indicating when the proxy was checked
Type: time
-
-
class
icrawler.utils.
ProxyPool
(filename=None)[source]¶ Bases:
object
Proxy pool class
ProxyPool provides friendly apis to manage proxies.
-
idx
¶ Index for http proxy list and https proxy list.
Type: dict
-
test_url
¶ A dict containing two urls, when testing if a proxy is valid, test_url[‘http’] and test_url[‘https’] will be used according to the protocol.
Type: dict
-
proxies
¶ All the http and https proxies.
Type: dict
-
addr_list
¶ Address of proxies.
Type: dict
-
dec_ratio
¶ When decreasing the weight of some proxy, its weight is multiplied with dec_ratio.
Type: float
-
inc_ratio
¶ Similar to dec_ratio but used for increasing weights, default the reciprocal of dec_ratio.
Type: float
-
weight_thr
¶ The minimum weight of a valid proxy, if the weight of a proxy is lower than weight_thr, it will be removed.
Type: float
-
logger
¶ A logging.Logger object used for logging.
Type: Logger
-
add_proxy
(proxy)[source]¶ Add a valid proxy into pool
You must call add_proxy method to add a proxy into pool instead of directly operate the proxies variable.
-
default_scan
(region='mainland', expected_num=20, val_thr_num=4, queue_timeout=3, val_timeout=5, out_file='proxies.json', src_files=None)[source]¶ Default scan method, to simplify the usage of scan method.
It will register following scan functions: 1. scan_file 2. scan_cnproxy (if region is mainland) 3. scan_free_proxy_list (if region is overseas) 4. scan_ip84 5. scan_mimiip After scanning, all the proxy info will be saved in out_file.
Parameters: - region – Either ‘mainland’ or ‘overseas’
- expected_num – An integer indicating the expected number of proxies, if this argument is set too great, it may take long to finish scanning process.
- val_thr_num – Number of threads used for validating proxies.
- queue_timeout – An integer indicating the timeout for getting a candidate proxy from the queue.
- val_timeout – An integer indicating the timeout when connecting the test url using a candidate proxy.
- out_file – the file name of the output file saving all the proxy info
- src_files – A list of file names to scan
-
get_next
(protocol='http', format=False, policy='loop')[source]¶ Get the next proxy
Parameters: - protocol (str) – ‘http’ or ‘https’. (default ‘http’)
- format (bool) – Whether to format the proxy. (default False)
- policy (str) – Either ‘loop’ or ‘random’, indicating the policy of getting the next proxy. If set to ‘loop’, will return proxies in turn, otherwise will return a proxy randomly.
Returns: - If format is true, then return the formatted proxy
which is compatible with requests.Session parameters, otherwise a Proxy object.
Return type: Proxy or dict
-
is_valid
(addr, protocol='http', timeout=5)[source]¶ Check if a proxy is valid
Parameters: - addr – A string in the form of ‘ip:port’
- protocol – Either ‘http’ or ‘https’, different test urls will be used according to protocol.
- timeout – A integer indicating the timeout of connecting the test url.
Returns: - If the proxy is valid, returns {‘valid’: True, ‘response_time’: xx}
otherwise returns {‘valid’: False, ‘msg’: ‘xxxxxx’}.
Return type: dict
-
proxy_num
(protocol=None)[source]¶ Get the number of proxies in the pool
Parameters: protocol (str, optional) – ‘http’ or ‘https’ or None. (default None) Returns: If protocol is None, return the total number of proxies, otherwise, return the number of proxies of corresponding protocol.
-
scan
(proxy_scanner, expected_num=20, val_thr_num=4, queue_timeout=3, val_timeout=5, out_file='proxies.json')[source]¶ Scan and validate proxies
Firstly, call the scan method of proxy_scanner, then using multiple threads to validate them.
Parameters: - proxy_scanner – A ProxyScanner object.
- expected_num – Max number of valid proxies to be scanned.
- val_thr_num – Number of threads used for validating proxies.
- queue_timeout – Timeout for getting a proxy from the queue.
- val_timeout – An integer passed to is_valid as argument timeout.
- out_file – A string or None. If not None, the proxies will be saved into out_file.
-
validate
(proxy_scanner, expected_num=20, queue_timeout=3, val_timeout=5)[source]¶ Target function of validation threads
Parameters: - proxy_scanner – A ProxyScanner object.
- expected_num – Max number of valid proxies to be scanned.
- queue_timeout – Timeout for getting a proxy from the queue.
- val_timeout – An integer passed to is_valid as argument timeout.
-
-
class
icrawler.utils.
ProxyScanner
[source]¶ Proxy scanner class
ProxyScanner focuses on scanning proxy lists from different sources.
-
proxy_queue
¶ The queue for storing proxies.
-
scan_funcs
¶ Name of functions to be used in scan method.
-
scan_kwargs
¶ Arguments of functions
-
scan_threads
¶ A list of threading.thread object.
-
logger
¶ A logging.Logger object used for logging.
-
register_func
(func_name, func_kwargs)[source]¶ Register a scan function
Parameters: - func_name – The function name of a scan function.
- func_kwargs – A dict containing arguments of the scan function.
-
scan_cnproxy
()[source]¶ Scan candidate (mainland) proxies from http://cn-proxy.com
-
scan_free_proxy_list
()[source]¶ Scan candidate (overseas) proxies from http://free-proxy-list.net
-
scan_ip84
(region='mainland', page=1)[source]¶ Scan candidate proxies from http://ip84.com
Parameters: - region – Either ‘mainland’ or ‘overseas’.
- page – An integer indicating how many pages to be scanned.
-
scan_mimiip
(region='mainland', page=1)[source]¶ Scan candidate proxies from http://mimiip.com
Parameters: - region – Either ‘mainland’ or ‘overseas’.
- page – An integer indicating how many pages to be scanned.
-
-
class
icrawler.utils.
Session
(proxy_pool)[source]¶ Bases:
requests.sessions.Session
-
get
(url, **kwargs)[source]¶ Sends a GET request. Returns
Response
object.Parameters: - url – URL for the new
Request
object. - **kwargs – Optional arguments that
request
takes.
Return type: requests.Response
- url – URL for the new
-
post
(url, data=None, json=None, **kwargs)[source]¶ Sends a POST request. Returns
Response
object.Parameters: - url – URL for the new
Request
object. - data – (optional) Dictionary, list of tuples, bytes, or file-like
object to send in the body of the
Request
. - json – (optional) json to send in the body of the
Request
. - **kwargs – Optional arguments that
request
takes.
Return type: requests.Response
- url – URL for the new
-
-
class
icrawler.utils.
Signal
[source]¶ Bases:
object
Signal class
Provides interfaces for set and get some globally shared variables(signals).
-
signals
¶ A dict of all signal names and values.
-
init_status
¶ The initial values of all signals.
-
-
class
icrawler.utils.
ThreadPool
(thread_num, in_queue=None, out_queue=None, name=None)[source]¶ Bases:
object
Simple implementation of a thread pool
This is the base class of
Feeder
,Parser
andDownloader
, it incorporates two FIFO queues and a number of “workers”, namely threads. All threads share the two queues, after each thread starts, it will watch thein_queue
, once the queue is not empty, it will get a task from the queue and process as wanted, then it will put the output toout_queue
.Note
This class is not designed as a generic thread pool, but works specifically for crawler components.
-
name
¶ thread pool name.
Type: str
-
thread_num
¶ number of available threads.
Type: int
-
in_queue
¶ input queue of tasks.
Type: Queue
-
out_queue
¶ output queue of finished tasks.
Type: Queue
-
workers
¶ a list of working threads.
Type: list
-
lock
¶ thread lock.
Type: Lock
-
logger
¶ standard python logger.
Type: Logger
-
connect
(component)[source]¶ Connect two ThreadPools.
The
in_queue
of the second pool will be set as theout_queue
of the current pool, thus all the output will be input to the second pool.Parameters: component (ThreadPool) – the ThreadPool to be connected. Returns: the modified second ThreadPool. Return type: ThreadPool
-