How to use Python Scrapy to scrap a website with examples
In this post, we will learn how to use Python Scrapy.
Otherwise, use another website you want.
I will suppose you already have experience with Python.
It will be helpful for you to spend a few hours to read Scrapy documentations.
Table of Contents
- Setup Python development environment
- Inspect the website
- Write Python Scrapy code
You can skip 1. if you already have Scrapy development environment ready.
1. Setup Python development environment
We will start by setting Scrapy development environment with pip. Use this command.
It will make a structure similar to this in your machine with directory name scrapy.
We don't have to care for others and our interest will be only bin/activate file to use virutalenv. We should activate Python development environment to with it.
You will have more Scrapy projects later and making alias for it will save your time. Use this command.
Then, include the code similar to this.
You should find the equivalent part of /home/youraccount/Desktop/code/ with $pwd command if you want to use this. Then, $source ~/.bashrc and you can use this Python development environment with $usescrapy only whenever you want.
Type $usescrapy and $pip install ipython scrapy. It will install the minimal dependencies to use Python Scrapy.
If you want to reuse the exactly same packages later, use these commands.
- $pip freeze > requirements.txt to extract the list of them.
- $pip install -r requirements.txt to install them later.
2. Inspect the website
I hope you already visited the Rust notification website or other websites you want to crawl.
Refer to the processes used here with Scrapy Tutorial and apply it later to a website you want to scrap.
I will assume you already know how to use browser inspector and familiar with CSS and HTML.
The purpose of This Week In Rust is to give you useful links relevant to Rust for each week.
It has the recent issue links in the homepage.
When you visit each of them, you will see the list of links for blog posts, crates(packages in Rust), call for participation, events, jobs etc.
Back to its homepage and use your browser inspector with CRTL+SHIFT+I and find how its html is structured. You can see that it is just simple static website with a CSS framework.
Inspect This week in Rust
Collecting those links to follow will be our main job for this page. They will be the entry points to the pages with target informations that we will scrap.
Visit one of them. When you inspect jobs parts and others you want to scrap, you will see that they structure similar to this.
Our main target will be href to help you find job titles and job links for them. It is the part of a tag that are wrapped with li and its parent element ul.
You can see that ul is also followed by h1 or h2 tags with ids. Knowing how html tags are organized for the data we want to scrap will help you test the Scrapy code we will write in the next part.
3. Write Python Scrapy code
We set up development environment and have the information ready to use with the previous parts. What left is to write the Python code for Scrapy.
Before that, use shell command from Scrapy CLI to test how the Scrapy programm will see the webpage.
Use another website you want to scrap if you have any. Then, the console will become the Ipython mode with information similar to this.
With This Week In Rust, there will be no problem because it is just a normal static website.
You can play with Scrapy shell mode with request, response etc. For example, use response.body, response.title. Then, exit it with
quit()and start your Scrapy project.
Use $scrapy startproject notification rust.
It will automatically generate Scrapy project folder with rust and project name notification and will show message similar to this in your console.
You can use $scrapy startproject -h for more information.
Follow the instruction.
Then, use command similar to $scrapy genspider this_week_in_rust this-week-in-rust.org/.
It should have created spiders/this_week_in_rust.py in your machine. Then, We will write code for the spider(this_week_in_rust.py).
Edit it similar to this.
We just converted the information we get from the previous part into Python code with Scrapy.
1. we extract the publication page links to follow with CSS Selectors. div.custom-xs-text-left is to help it to select href part in a tags.
We extract all links to follow through so we use getall().
Then, we define what to do with them with parse_post_and_jobs callback function.
2. This is payload of all these processes. We extract date of the publication, the total number of them, titles and other important datas of Rust jobs to make the information useful. Then, we turn it into JSON format with Python API.
You can see the pattern that only id part such as #news-blog-posts, #rust-jobs are different and others are repeated.
You can easily include events, call for participation etc from the website if you want to scrap other parts.
3. We return the data we want to use here.
Your code will be different from this if you used other websites but the main processes to find what you want will be similar.
- Get the links to follow to visit the payload webpages.
- Extract the information you want at each page.
Test it work with $scrapy crawl this_week_in_rust -o rust_jobs.json.
Then, you can verify the result similar to this structure.
It may not be ordered well by date. Therefore, make Python file similar to this if you want.
Use it with $python sort_json.py rust_jobs.json and it will organize the JSON file by dates.
You should comment or remove sort_json.py from your Scrapy project if you want to use this project later.
In this post, we learnt how to use Python Scrapy. If you followed this post well, what you need later will be just use $scrapy genspider and edit the Python file(spider) made from it.
I hope this post be helpful for other Rust developers who wait for This Week In Rust every week and also for people who want to learn Python Scrapy.