Parsing multiple urls with Python and BeautifulSoup

Parsing multiple urls with Python and BeautifulSoup - python

I started learning Python today and so it is not a surprise that I am struggling with some basics. I am trying to parse data from a school website for a project and I managed to parse the first page. However, there are multiple pages (results are paginated).
I have an idea about how to go about it, ie, run through the urls in a loop since I know the url format but I have no idea how to proceed. I figured it would be better to somehow search for the "next" button and run the function if it is there, if not, then stop function.
I would appreciate any help I can get.
import requests
from bs4 import BeautifulSoup
url = "http://www.myschoolwebsite.com/1"
#url2 = "http://www.myschoolwebsite.com/2"
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})
for item in g_data:
for li in item.findAll('li'):
for resultnameh2 in li.findAll('h2'):
for resultname in resultnameh2.findAll('a'):
print(resultname).text
for resultAddress in li.findAll('p', {"class": "resultAddress"}):
print(resultAddress).text.replace('Get directions','').strip()
for resultContact in li.findAll('ul', {"class": "resultContact"}):
for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
print(resultContact).text

First, you can assume the maximum no. of pages of the directory (if you know pattern of the url). I am assuming the url is of the form http://base_url/page Next you can write this:
base_url = 'http://www.myschoolwebsite.com'
total_pages = 100
def parse_content(r):
soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})
for item in g_data:
for li in item.findAll('li'):
for resultnameh2 in li.findAll('h2'):
for resultname in resultnameh2.findAll('a'):
print(resultname).text
for resultAddress in li.findAll('p', {"class": "resultAddress"}):
print(resultAddress).text.replace('Get directions','').strip()
for resultContact in li.findAll('ul', {"class": "resultContact"}):
for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
print(resultContact).text
for page in range(1, total_pages):
response = requests.get(base_url + '/' + str(page))
if response.status_code != 200:
break
parse_content(response)

I would make an array with all the URLs and loop through it, or if there is a clear pattern, write a regex to search for that pattern.

Related

Get links from RSS feed

I'm trying to append all links in the RSS feed of this Google News page using Beautiful Soup. I'm probably doing too much, but I can't seem to do it with this loop that iterates through a list of search terms for which I want to scrape Google News.
for t in terms:
raw_url = "https://news.google.com/rss/search?q=" + t + "&hl=en-US&gl=US&ceid=US%3Aen"
url = raw_url.replace(" ","-")
req = Request(url)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
links.append(re.findall("href=[\"\'](.*?)[\"\']", str(html_page), flags=0))
print(links)
The list comes up empty every time. My regex is probably off...
Any ideas?

Let BeautifulSoup help you by extracting all of the <item> tags, but because the link is not part of an embedded tag, you need to do the rest by hand. This does what you want, I think.
from bs4 import BeautifulSoup
import requests
terms = ['abercrombie']
for t in terms:
url = f"https://news.google.com/rss/search?q={t}&hl=en-US&gl=US&ceid=US%3Aen"
html_page = requests.get(url)
soup = BeautifulSoup(html_page.text, "lxml")
for item in soup.find_all("item"):
link= str(item)
i = link.find("<link/>")
j = link.find("<guid")
print( link[i+7:j] )

How to get the output of a function to be the input of another function

I'm making a web crawler for a recipe website and I would like to get the link for a recipe then use that link to get the ingredients. I am able to do that, but only by manually entering the link to get the recipe. Is there a way to get the link then use this link to look at the ingredients. Also I will take any suggestions on how to make this code better!
def trade_spider():
url= 'https://tasty.co/topic/best-vegetarian'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
test = link.get('href')
print(test)
def ingredient_spider():
url1= 'https://tasty.co/recipe/peanut-butter-keto-cookies'
source_code1= requests.get(url1)
new_text= source_code1.text
soup1= BeautifulSoup(new_text, 'lxml')
for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
print(ingredients.text)

To do this, ensure that the output of your is set to return rather than print (to understand the difference, try reading the top answer on this post: What is the formal difference between "print" and "return"?)
You can then use the output of the function as either a variable, or put the output directly into the next function.
For example
x = tradespider()
or
newFunction(tradespider())

You need to call the ingredient_spider function for every link you get from your recipe.
Using your example, it would look like this:
def trade_spider():
url= 'https://tasty.co/topic/best-vegetarian'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
test = link.get('href')
ingredient_spider(test)
def ingredient_spider(url):
source_code1= requests.get(url) #receive url from trade_spider function
new_text= source_code1.text
soup1= BeautifulSoup(new_text, 'lxml')
for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
print(ingredients.text)
For each link you get from test = link.get('href'), you call the function ingredient_spider(), sending test variable as argument.

I am honestly not sure if I understand correctly what you are asking, but if I do you could go with something like this:
first create a list of URLs
second create a function that can work a url
last create a worker that works off that list peace by peace
.
def first():
URLs = []
...
for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
URLs.append(link.get('href'))
return URLs
def second(url):
source_code1= requests.get(url)
new_text= source_code1.text
soup1= BeautifulSoup(new_text, 'lxml')
for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
return ingredients.text
def third(URL_LIST):
for URL in URL_LIST:
tmp = second(URL)
print(tmp)
URL_LIST = first()
third(URL_LIST)

Scrape href not working with python

I have copies of this very code that I am trying to do and every time I copy it line by line it isn't working right. I am more than frustrated and can't seem to figure out where it is not working. What I am trying to do is go to a website, scrap the different ratings pages which are labelled A, B, C ... etc. Then I am going to each site to pull the total number of pages they are using. I am trying to scrape the <span class='letter-pages' href='/ratings/A/1' and so on. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
# elif good_ratings.startswith('/401k'):
# ks.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
print(ratings)
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find('span', class_='letter-pages'):
#Not working Here
pages_scrape.append(href.attrs['href'])
# Will print all the anchor tags with hrefs if I remove the above comment.
print(href)

You are trying to get the href prematurely. You are trying to extract the attribute directly from a span tag that has nested a tags, rather than a list of a tags.
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
I didn't test this on all pages, but it worked for the first one. You pointed out that on some of the pages the content wasn't getting scraped, which is due to the span search returning None. To get around this you can do something like:
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if span:
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
print(href)
else:
print('span.letter-pages not found on ' + page)
Depending on your use case you might want to do something different, but this will indicate to you which pages don't match your scraping model and need to be manually investigated.

You probably meant to do find_all instead of find -- so change
for href in soup.find('span', class_='letter-pages'):
to
for href in soup.find_all('span', class_='letter-pages'):
You want to be iterating over a list of tags, not a single tag. find would give you a single tag object. When you iterate over a single tag, you iterate get NavigableString objects. find_all gives you the list of tag objects you want.

Python bs4 BeautifulSoup: findall gives empty bracket

when i run this code it gives me an empty bracket. Im new to web scraping so i dont know what im doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
#btw the space is also there in the html code
print(container)
results:
[]
What i tried is to grab the html code from the site, and to soup trough the li tags where all the information is stored so I can print out all the information in a for loop.
Also if someone wants to explain how to use BeautifulSoup we can always talk.
Thank you guys.

So a working code that grabs product and price would could look something like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url, headers={'User-Agent': 'Mozilla Firefox'})
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
for cont in container:
h2 = cont.h2.text.strip()
# Amazon lists prices in two ways. If one fails, use the other
try:
currency = cont.find('sup', {'class': 'sx-price-currency'}).text.strip()
price = currency + cont.find('span', {'class': 'sx-price-whole'}).text.strip()
except:
price = cont.find('span', {'class': 'a-size-base a-color-base'})
print('Product: {}, Price: {}'.format(h2, price))
Let me know if that helps you further...

Display all search results when web scraping with Python

I'm trying to scrape a list of URL's from the European Parliament's Legislative Observatory. I do not type in any search keyword in order to get all links to documents (currently 13172). I can easily scrape a list of the first 10 results which are displayed on the website using the code below. However, I want to have all links so that I would not need to somehow press the next page button. Please let me know if you know of a way to achieve this.
import requests, bs4, re
# main url of the Legislative Observatory's search site
url_main = 'http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y'
# function gets a list of links to the procedures
def links_to_procedures (url_main):
# requesting html code from the main search site of the Legislative Observatory
response = requests.get(url_main)
soup = bs4.BeautifulSoup(response.text) # loading text into Beautiful Soup
links = [a.attrs.get('href') for a in soup.select('div.procedure_title a')] # getting a list of links of the procedure title
return links
print(links_to_procedures(url_main))

You can follow the pagination by specifying the page GET parameter.
First, get the results count, then calculate the number of pages to process by dividing the count on the results count per page. Then, iterate over pages one by one and collect the links:
import re
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.europarl.europa.eu/oeil/search/search.do?searchTab=y')
soup = BeautifulSoup(response.content)
# get the results count
num_results = soup.find('span', class_=re.compile('resultNum')).text
num_results = int(re.search('(\d+)', num_results).group(1))
print "Results found: " + str(num_results)
results_per_page = 50
base_url = "http://www.europarl.europa.eu/oeil/search/result.do?page={page}&rows=%s&sort=d&searchTab=y&sortTab=y&x=1411566719001" % results_per_page
links = []
for page in xrange(1, num_results/results_per_page + 1):
print "Current page: " + str(page)
url = base_url.format(page=page)
response = requests.get(url)
soup = BeautifulSoup(response.content)
links += [a.attrs.get('href') for a in soup.select('div.procedure_title a')]
print links

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing multiple urls with Python and BeautifulSoup - python

I would make an array with all the URLs and loop through it, or if there is a clear pattern, write a regex to search for that pattern.

Related

Get links from RSS feed

How to get the output of a function to be the input of another function

Scrape href not working with python

Python bs4 BeautifulSoup: findall gives empty bracket

Display all search results when web scraping with Python

Categories

Resources