For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.
Related
I am trying to make a simple crawler that scrapes through this https://en.wikipedia.org/wiki/Web_scraping page, then proceeds to extract the 19 links from the See About section. This I manage to do, however I am also trying to extract the first paragraph from each of those 19 links and this is where it stops "working". I get the same paragraph from the first page and not from each one. This is what I have so far. I know there might be better options for doing this but i want to stick to BeautifulSoup and simple python code.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text
soup = BeautifulSoup(data, 'html.parser')
def visit():
try:
p = soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit()
Example of the first print
Now visiting: https://en.wikipedia.org/wiki/OpenSocial
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Intended function should be that it prints the first paragraph for every new link printed, not the same paragraph from the first link. What do I need to do in order to fix this? Or any tips on what I am missing. I am fairly new to python so I am still learning the concepts as I work on things.
At the top of your code you define data and soup. Both are tied to https://en.wikipedia.org/wiki/Web_scraping.
Every time you call visit(), you print from soup, and soup never changes.
You need to pass the url to visit(), e.g. visit(url_to_visit). The visit function should accept the url as an argument, then visit the page using requests, and create a new soup from the returned data, then print the first paragraph.
Edited to add code explaining my original answer:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
start_url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Renamed this to start_url to make it clear that this is the source page
data = requests.get(start_url).text
soup = BeautifulSoup(data, 'html.parser')
def visit(new_url): # function now accepts a url as an argument
try:
new_data = requests.get(new_url).text # retrieve the text from the url
new_soup = BeautifulSoup(new_data, 'html.parser') # process the retrieved html in beautiful soup
p = new_soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(start_url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit(url_to_visit) # here's where we pass each line to the visit() function
Here is the website I am to scrape the number of reviews
So here i want to extract number 272 but it returns None everytime .
I have to use BeautifulSoup.
I tried-
sources = requests.get('https://www.thebodyshop.com/en-us/body/body-butter/olive-body-butter/p/p000016')
soup = BeautifulSoup(sources.content, 'lxml')
x = soup.find('div', {'class': 'columns five product-info'}).find('div')
print(x)
output - empty tag
I want to go inside that tag further.
The number of reviews is dynamically retrieved from an url you can find in network tab. You can simply extract from response.text with regex. The endpoint is part of a defined ajax handler.
You can find a lot of the API instructions in one of the js files: https://thebodyshop-usa.ugc.bazaarvoice.com/static/6097redes-en_us/bvapi.js
For example:
You can trace back through a whole lot of jquery if you really want.
tl;dr; I think you need only add the product_id to a constant string.
import requests, re
from bs4 import BeautifulSoup as bs
p = re.compile(r'"numReviews":(\d+),')
ids = ['p000627']
with requests.Session() as s:
for product_id in ids:
r = s.get(f'https://thebodyshop-usa.ugc.bazaarvoice.com/6097redes-en_us/{product_id}/reviews.djs?format=embeddedhtml')
p = re.compile(r'"numReviews":(\d+),')
print(int(p.findall(r.text)[0]))
i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)
I try to build a list of urls out of an online-forum. It is necessary to use BeautifulSoup in my case. The goal is a list of URLs containing every page of a thread, e.g.
[http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html,
http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-2.html,
http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches-3.html]
What works is this:
#import modules
import requests
from bs4 import BeautifulSoup
#define main-url
url = 'http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html'
#create a list of urls
urls=[url]
#load url
page = requests.get(url)
#parse it using BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')
#search for the url of the next page
nextpage = soup.find("a", ["pag next"]).get('href')
#append the urls of the next page to the list of urls
urls.append(nextpage)
print(urls)
When I try to build a if loop for the next pages as follows it will not work. Why?
for url in urls:
urls.append(soup.find("a", ["pag next"]).get('href'))
("a", ["pag next"]).get('href') identifies the url for the next page)
It's not an option to use the pagination of the url because there will be many other threads to crawl.
I'm using Jupyter Notebook server 5.7.4 on a macbook pro
Python 3.7.1
IPython 7.2.0
I know about the existence of this post. For my beginners knowledge the code is written too complicated but maybe your experience allows you to apply it on my use case.
The url pattern for pagination is always consistent with this site, therefore you don't need to make requests to grab page urls. Instead you can parse the the text in the button that says "Page 1 of 10" and construct the page urls after knowing the final page number.
import re
import requests
from bs4 import BeautifulSoup
thread_url = "http://forum.pcgames.de/stellt-euch-vor/9331721-update-im-out-bitches.html"
r = requests.get(thread_url)
soup = BeautifulSoup(r.content, 'lxml')
pattern = re.compile(r'Seite\s\d+\svon\s(\d+)', re.I)
pages = soup.find('a', text=pattern).text.strip()
pages = int(pattern.match(pages).group(1))
page_urls = [f"{thread_url[:-5]}-{p}.html" for p in range(1, pages + 1)]
for url in page_urls:
print(url)
You are looping through the urls and adding it to itself so the size of the list continues increasing indefinitely.
You are adding each url from urls to url - do you see the problem? urls continues to grow... it also already has each url that you are iterating through inside of it. Do you mean to call a function that executes the prior code on each next url and then adds it to the list?
I'm trying to scrape links with contextual information from the following page: https://www.reddit.com/r/anime/wiki/discussion_archive/2018. I'm able to get the links just fine using BS4 using Python, but having year, season, titles, and episodes associated to the links is ideal. The desired output would look like this:
I've started with the code below, but don't know how to loop through the code to capture things in sections for each season/title:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
link = 'https://www.reddit.com/r/anime/wiki/discussion_archive/2018'
request_2018 = session.get(link, headers={'User-agent': 'Chrome'})
soup = BeautifulSoup(request_2018.content, 'lxml')
data_table = soup.find('div', class_='md wiki')
Is this something that's doable with BS4? Thanks for your help!
EDIT
criteria = {'class':'md wiki'} # so it can reuse later
data_soup = soup.find('div', criteria)
titles = data_soup.find_all('strong')
tables = data_soup.find_all('table')
Try following:
titles = soup.find('div', {'class':'md wiki'}).find_all('strong')
data_tables = soup.find('div', {'class':'md wiki'}).find_all('table')
Better put the second argument of find into a dict and find_all will return all elements which match your search.