How can I iterate through the pages of a website using Python? - python

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?

Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.

To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.

For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want

Related

Web scraping for dummies (or not)

GOAL
Extract data from a web page.. automatically.
Data are on this page... Be careful , it's in French...
MY HARD WAY, manually
I choose the data I want by clicking on the desired fields on the left side ('CHOISIR DES INDICATEURS')
Then I select ('Tableau' = Table), to have data table.
Then I click on ('Action'), on the right side, then ('Exporter' = Export)
I choose the format I want (ie CSV) and hit ('Executer'= Execute) to download the file.
WHAT I TRIED
I tried to automate this process, but It's like an impossible task for me. I tried to inspect the page for the network exchanges to see if there is an underlying server I could make easy json request.
I mainly work with python and frameworks like BS4 or scrapy.
I have few data to extract, so I can easily do it manually. Thus this question, I just purely for my own knowledge, to see if it is possible to scrape a page like that.
I would appreciate if you could share your skills!
Thank you,
It is possible. Check this website for details. This website will tell you how to scrape a website with an example.
https://realpython.com/beautiful-soup-web-scraper-python/#scraping-the-monster-job-site

Python script that utilizes a single search bar out of several on a website

I have a list of 230 crystal structure space groups (strings). I'd like to write a python script to extract files, for each group, from http://rruff.geo.arizona.edu/AMS/amcsd.php.
I'd like the script to iteratively searches for all space groups in the "Cell Parameters and Symmetry" search option, and then downloads one of the files for some structure (say the first one).
An example of my list looks something like spaceGroups = ["A-1","A2","A2/a","A2/m","..."]. Search format for say group 1 will look like this, sg=A-1, and the results look like http://rruff.geo.arizona.edu/AMS/result.php.
First I'd like to know if this is even possible, and if so, where to start?
Sure, it's possible. The "clean" way is to create a crawler to make requests, download and save the files.
You can use scrapy (https://docs.scrapy.org/en/latest/) for the crawler and Fiddler (https://www.telerik.com/fiddler) to see what requests you need to recreate inside your spider.
In essence you will use a list of space groups to generate requests to the form on that page, after each request you will parse the response, collect the IDs/download urls and follow on subsequent pages (to collect all IDs/download urls). Finally you will download the files.
If you don't want to use scrapy you can make your own logic with requests (https://requests.readthedocs.io/en/latest/user/quickstart/), but scrapy would download everything faster and has a lot of features to help you.
Perusing that page it seems you only need the ids from each crystals, the actual download urls are simple.

How can I exclude the list of web-pages from google-search results?

"minus" sign doesn't fit because the list consists of ~2000 entries.
I'm just beginner in python, so, please explain as to 5-year old, if possible
Thank you much in advance!
Presumably you are fetching the Google search results from a Python program. So you can exclude the web pages in your list in your Python program as you read the results, instead of trying to make Google do it for you. You can use a functional programming technique like calling filter for this.
Ideally you would do this by comparing the URLs of the links, but if you were willing to sacrifice accuracy you could do it by comparing the titles of the links instead, if you only had titles in your list and not URLs. But URLs are definitely better for this purpose.
So you could parse the Google search results using a library like Beautiful Soup, extract the URLs of the links, and filter out (using filter) the ones that were equal to any of the URLs on your list (you could define a function using def, for checking whether a given URL is on your list). You'll have to be careful though because sometimes Google search result links go via a Google website which redirects to the real URL, for ranking purposes.

How to scrape a website and all its directories from the one link?

Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.

How to extract all the url's from a website?

I am writing a programme in Python to extract all the urls from a given website. All the url's from a site not from a page.
As I suppose I am not the first one who wants to do that I was wondering if there was a ready made solution or if I have to write the code myself.
It's not gonna be easy, but a decent starting point would be to look into these two libraries:
urllib
BeautifulSoup
I didn't see any ready made scripts that does this on a quick google search.
Using the scrapy framework makes this almost trivial.
The time consuming part would be learning how to use scrapy. THeir tutorials are great though and shoulndn't take you that long.
http://doc.scrapy.org/en/latest/intro/tutorial.html
Creating a solution that others can use is one of the joys of being part of a programming community. iF a scraper doesn't exist you can create one that everyone can use to get all links from a site!
The given answers are what I would have suggested (+1).
But if you really want to do something quick and simple, and you're on a *NIX platform, try this:
lynx -dump YOUR_URL | grep http
Where YOUR_URL is the URL that you want to check. This should get you all the links you want (except for links that are not fully written)
You first have to download the page's HTML content using a package like urlib or requests.
After that, you can use Beautiful Soup to extract the URLs. In fact, their tutorial shows how to extract all links enclosed in <a> elements as a specific example:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
If you also want to find links not enclosed in <a> elements, you'll may have to write something more complex on your own.
EDIT: I also just came across two Scrapy link extractor classes that were created specifically for this task:
http://doc.scrapy.org/en/latest/topics/link-extractors.html

Categories