How to get simple information through a crawler - python

I am trying to make a simple crawler that scrapes through this https://en.wikipedia.org/wiki/Web_scraping page, then proceeds to extract the 19 links from the See About section. This I manage to do, however I am also trying to extract the first paragraph from each of those 19 links and this is where it stops "working". I get the same paragraph from the first page and not from each one. This is what I have so far. I know there might be better options for doing this but i want to stick to BeautifulSoup and simple python code.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text
soup = BeautifulSoup(data, 'html.parser')
def visit():
try:
p = soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit()
Example of the first print
Now visiting: https://en.wikipedia.org/wiki/OpenSocial
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Intended function should be that it prints the first paragraph for every new link printed, not the same paragraph from the first link. What do I need to do in order to fix this? Or any tips on what I am missing. I am fairly new to python so I am still learning the concepts as I work on things.

At the top of your code you define data and soup. Both are tied to https://en.wikipedia.org/wiki/Web_scraping.
Every time you call visit(), you print from soup, and soup never changes.
You need to pass the url to visit(), e.g. visit(url_to_visit). The visit function should accept the url as an argument, then visit the page using requests, and create a new soup from the returned data, then print the first paragraph.
Edited to add code explaining my original answer:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
start_url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Renamed this to start_url to make it clear that this is the source page
data = requests.get(start_url).text
soup = BeautifulSoup(data, 'html.parser')
def visit(new_url): # function now accepts a url as an argument
try:
new_data = requests.get(new_url).text # retrieve the text from the url
new_soup = BeautifulSoup(new_data, 'html.parser') # process the retrieved html in beautiful soup
p = new_soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(start_url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit(url_to_visit) # here's where we pass each line to the visit() function

Related

Webscraping in Python (beautifulsoup)

I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

Treating a list of items like a single item error: how to find links within each 'link' within string already scraped

I am writing a python code to scrape the pdfs of meetings off this website: https://www.gmcameetings.co.uk
The pdf links are within links, which are also within links. I have the first set of links off the page above, then I need to scrape links within the new urls.
When I do this I get the following error:
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
This is my code so far which is all fine and checked in jupyter notebook:
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
This is the line that then receives the error:
second_links = meeting_links.find_all('a', href='TRUE')
I have tried the find() as python suggests but that doesn't work either. But I understand that it can't treat meeting_links as a single item.
So basically, how do you search for links within each bit of the new string variable (meeting_links).
I already have code to get the pdfs once I have the second set of urls which seems to work fine but need to obviously get these first.
Hopefully this makes sense and I've explained ok - I only properly started using python on Monday so I'm a complete beginner.
To get all meeting links try
from bs4 import BeautifulSoup as bs
import requests
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# Scrape to find all links
all_links = soup.find_all('a', href=True)
# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
href = link['href']
if '/meetings/' in href:
meeting_links.append(href)
print(meeting_links)
The .find() function that you use in your original code is specific to beautiful soup objects. To find a substring within a string, just use native Python: 'a' in 'abcd'.
Hope that helps!

Scraping with Python. Can't get wanted data

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

Website Scraping Specific Forms

For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.

Categories