I am trying my hand at webscraping using BeautifulSoup.
I had posted this before here, but I was not very clear as to what I wanted, so it only partially answers my issue.
How do I extract only the content from this webpage
I want to extract the content from the webpage and then extract all the links from the output. Please can someone help me understand where I am going wrong.
This is what I have after updating my previous code with the answer provided in the link above.
# Define the content to retrieve (webpage's URL)
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
# Retrieve the page
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
page = r.data
print(f'Type of Variable "page": {page.__class__.__name__}')
print(f'Page Retrieved. Request Status: {r.status}, Page Size:{len(page)}')
else:
print(f'Some problem occured. Request status: {r.status}')
# Convert the stream of bytes into a BeautifulSoup representation
soup = BeautifulSoup(page, 'html.parser')
print(f'Type of variable "soup": {soup.__class__.__name__}')
# Check the content
print(f'{soup. Prettify()[:1000]}')
# Check the HTML's Title
print(f'Title tag: {soup.title}')
print(f'Title text: {soup.title.string}')
# Find the main content
article_tag = 'p'
articles = soup.find_all(article_tag)
print(f'Type of the variable "article":{article.__class__.__name__}')
for p in articles:
print (p.text)
I then used the code below to get all the links, but get an error
# Find the links in the text
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in articles.find_all(tag)]
tag_list
That is cause articles is a ResultSet of soup.find_all(article_tag) what you can check with type(articles)
To get your goal you have to iterate articles first - So simply add an additional for-loop to your list comprehension:
[t.get('href') for article in articles for t in article.find_all(tag)]
In addition you may should use a set comprehension to avoid duplicates and also concat paths with base url:
list(set(t.get('href') if t.get('href').startswith('http') else 'https://bigbangtheory.fandom.com'+t.get('href') for article in articles for t in article.find_all(tag)))
Output:
['https://bigbangtheory.fandom.com/wiki/The_Killer_Robot_Instability',
'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
'https://bigbangtheory.fandom.com/wiki/The_Valentino_Submergence',
'https://bigbangtheory.fandom.com/wiki/The_Beta_Test_Initiation',
'https://bigbangtheory.fandom.com/wiki/Season_2',
'https://bigbangtheory.fandom.com/wiki/Dr._Pemberton',...]
Related
I am very new and need your help. I want to write a script I can generalize to do web-scraping. So far I have the code below, but it keeps giving me a blank output file. I would like to be able to easily modify this code to work on all websites and eventually be able to make the search strings a little more complex. For now, I have CNN as a general page, and "mccarthy" b/c I figure there are certainly articles with him in them right now. Can you help me get this to work?
#Begin Code
import requests
from bs4 import BeautifulSoup
import docx
# Set the search parameters
search_term = 'mccarthy' # Set the search term
start_date = '2023-01-04' # Set the start date (format: YYYY-MM-DD)
end_date = '2023-01-05' # Set the end date (format: YYYY-MM-DD)
website = 'https://www.cnn.com' # Set the website to search
document = open('testfile.docx','w') # Open the existing Word document
# Initialize the list of articles and the page number
articles = []
page_number = 1
# Set the base URL for the search API
search_url = f'{website}/search'
# Set the base URL for the article page
article_base_url = f'{website}/article/'
while articles or page_number == 1:
# Send a request to the search API
response = requests.get(search_url, params={'q': search_term, 'from': start_date, 'to': end_date, 'page': page_number})
# Check if the response is in JSON format
if response.headers['Content-Type'] == 'application/json':
# Load the JSON data
data = response.json()
# Get the list of articles from the JSON data
articles = data['articles']
else:
# Parse the HTML content of the search results page
soup = BeautifulSoup(response.text, 'html.parser')
# Find all articles on the search results page
articles = soup.find_all('article')
# Loop through the articles
for article in articles:
# Find the link element
link_element = article.find('a', class_='title')
# Extract the link from the link element
link = link_element['href']
# Check if the link is a relative URL
if link.startswith('/'):
# If the link is relative, convert it to an absolute URL
link = f'{website}{link}'
# Add the link to the document
document.add_paragraph(link)
# Increment the page number
page_number += 1
# Save the document
document.close()
I have tried numerous iterations, but I have deleted them all so cannot really post any here. This keeps giving me a blank output file.
This won't solve the main issue but a couple of things to fix:
https://edition.cnn.com/search?q=&from=0&size=10&page=1&sort=newest&types=all§ion=
Looking at the CNN search page URL, we can see that the from parameter is not referring to a date but a number instead, i.e. if from=5, it will only show the 5th article onwards. Therefore you can remove 'from' and 'to' from your request params.
articles = soup.find_all('article')
This is returning an empty list because there are no <article> tags within the HTML page. Inspecting the CNN HTML we see that the URLs you are looking for are within <div class="card container__item container__item--type- __item __item--type- "> tags so I would change this line to soup.find_all('div', class_="card container__item container__item--type- __item __item--type- ")
document = open('testfile.docx','w') # Open the existing Word document
You've imported the docx module but are not using it. Word documents (which require extra data for formatting) should be opened like this document = Document(). For reference, here are the docx docs: https://python-docx.readthedocs.io/en/latest/
while articles or page_number == 1:
I don't think this line is needed.
The main issue seems to be that this page requires Javascript to be run to render the content. Using request.get() by itself won't do this. You'll need to use a library such as Requests-HTML. I tried doing this but the articles still don't render so I'm not sure.
I'm stuck again with my first attempts in web scraping with python.
url = link
page = requests.get(url)
soup = BeautifulSoup(page.content, features="lxml")
checkout_link = []
links = soup.find_all("a")
for url in soup.find_all('a'):
if url.get('href') == None:
pass
elif len(url.get('href')) >= 200:
checklist += 10
for search in links:
if "checkout" in search.get("href"):
checkout_link = search.get("href")
else:
pass
else:
pass
So this is my code right now. The parsing of all links works fine (I want this part to check how many links are available in total and thought it would be a good method to do both in a single request. Correct me if I'm attempting this the wrong way), even if I search for the checkout link and print it I get the correct link reference printed but I can't find a solution to store it in checkout_link to use it further on. I want to make a request into this specific checkout url afterwards.
You need to append it to the list
checkout_link.append(search.get("href"))
Consider doing href filtering via attribute selector with * contains operator:
soup.select_one("[href*=checkout]")['href']
I've been working on a project that takes an input of a url and creates a map of the page connections on a website.
The way I was approaching this was to scrape the page for links, then create a page object to hold the href of the page and a list of all the child links on that page. Once I have the data pulled from all the pages on the site I would pass it to a graphing function like matplotlib or plotly in order to get a graphical representation of the connections between pages on a website.
This is my code so far:
from urllib.request import urlopen
import urllib.error
from bs4 import BeautifulSoup, SoupStrainer
#object to hold page href and child links on page
class Page:
def __init__(self, href, links):
self.href = href
self.children = links
def getHref(self):
return self.href
def getChildren(self):
return self.children
#method to get an array of all hrefs on a page
def getPages(url):
allLinks = []
try:
#combine the starting url and the new href
page = urlopen('{}{}'.format(startPage, url))
for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')):
try:
if 'href' in link.attrs:
allLinks.append(link)
except AttributeError:
#if there is no href, skip the link
continue
#return an array of all the links on the page
return allLinks
#catch pages that can't be opened
except urllib.error.HTTPError:
print('Could not open {}{}'.format(startPage, url))
#get starting page url from user
startPage = input('Enter a URL: ')
page = urlopen(startPage)
#sets to hold unique hrefs and page objects
pages = set()
pageObj = set()
for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')):
try:
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
newPage = link.attrs['href']
pages.add(newPage)
#get the child links on this page
pageChildren = getPages(newPage)
#create a new page object, add to set of page objects
pageObj.add(Page(newPage, pageChildren))
except AttributeError:
print('{} has an attribute error.'.format(link))
continue
Would Scrapy be better for what I'm trying to do?
What library would work best for displaying the connections?
How do I fix the getPages function to correctly combine the user-inputted url with the hrefs pulled from the page? If I'm at 'https://en.wikipedia.org/wiki/Main_Page', I'll get 'Could not open https://en.wikipedia.org/wiki/Main_Page/wiki/English_language'. I think I need to combine from the end of the .org/ and drop the /wiki/Main_Page but I don't know the best way to do this.
This is my first real project so any pointers on how I could improve my logic are appreciated.
thats a nice idea for a first project!
Would Scrapy be better for what I'm trying to do?
There are numerous advantages that a scrapy version of your project would have over your current version. The advantage you would feel immediatly is the speed at which your requests are made. However, it may take you a while to get used to the structure of scrapy projects.
How do I fix the getPages function to correctly combine the user-inputted url with the hrefs pulled from the page? If I'm at 'https://en.wikipedia.org/wiki/Main_Page', I'll get 'Could not open https://en.wikipedia.org/wiki/Main_Page/wiki/English_language'. I think I need to combine from the end of the .org/ and drop the /wiki/Main_Page but I don't know the best way to do this.
You can achieve this using urllib.parse.urljoin(startPage, relativeHref). Most of the links you're gonna find are relative links which you can then convert to an absolute link using the urljoin function.
In your code you would change newPage = link.attrs['href'] to newPage = urllib.parse.urljoin(startPage, link.attrs['href']) and page = urlopen('{}{}'.format(startPage, url)) to page = urlopen(url).
Here are a couple of examples as to where you can change your code slightly for some benefits.
Instead of for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')): you can use BeautifulSoup's find_all() function like this for link in BeautifulSoup(page, 'html.parser').find_all('a', href=True):. This way all your links are already guaranteed to have an href.
In order to prevent links on the same page from occuring twice, you should change allLinks = [] to be a set instead.
This is up to preference, but since Python 3.6 there is another syntax called "f-Strings" for referencing variables in strings. You could change print('{} has an attribute error.'.format(link)) to print(f'{link} has an attribute error.') for example.
Here is the website I am to scrape the number of reviews
So here i want to extract number 272 but it returns None everytime .
I have to use BeautifulSoup.
I tried-
sources = requests.get('https://www.thebodyshop.com/en-us/body/body-butter/olive-body-butter/p/p000016')
soup = BeautifulSoup(sources.content, 'lxml')
x = soup.find('div', {'class': 'columns five product-info'}).find('div')
print(x)
output - empty tag
I want to go inside that tag further.
The number of reviews is dynamically retrieved from an url you can find in network tab. You can simply extract from response.text with regex. The endpoint is part of a defined ajax handler.
You can find a lot of the API instructions in one of the js files: https://thebodyshop-usa.ugc.bazaarvoice.com/static/6097redes-en_us/bvapi.js
For example:
You can trace back through a whole lot of jquery if you really want.
tl;dr; I think you need only add the product_id to a constant string.
import requests, re
from bs4 import BeautifulSoup as bs
p = re.compile(r'"numReviews":(\d+),')
ids = ['p000627']
with requests.Session() as s:
for product_id in ids:
r = s.get(f'https://thebodyshop-usa.ugc.bazaarvoice.com/6097redes-en_us/{product_id}/reviews.djs?format=embeddedhtml')
p = re.compile(r'"numReviews":(\d+),')
print(int(p.findall(r.text)[0]))
I would like to thank the User Pythonista for giving me this very useful code a few months back that solved my problem. I'm still however confused how the code function due to my lack of knowledge of HTML and the Beautiful soup library.
I'm confused about what part does specific_messages data strcuture play in this program ?
I'm also confused about how does the code save the various posts ?
and how does it check the user of the post?
import requests, pprint
from bs4 import BeautifulSoup as BS
url = "https://forums.spacebattles.com/threads/the-wizard-of-woah-and-the-impossible-methods-of-necromancy.337233/"
r = requests.get(url)
soup = BS(r.content, "html.parser")
#To find all posts from a specific user everything below this is for all posts
specific_messages = soup.findAll('li', {'data-author': 'The Wizard of Woah!'})
#To find every post from every user
posts = {}
message_container = soup.find('ol', {'id':'messageList'})
messages = message_container.findAll('li', recursive=0)
for message in messages:
author = message['data-author']
#or don't encode to utf-8 simply for printing in shell
content = message.find('div', {'class':'messageContent'}).text.strip().encode("utf-8")
if author in posts:
posts[author].append(content)
else:
posts[author] = [content]
pprint.pprint(posts)
specific_messages = soup.findAll('li', {'data-author': 'The Wizard of Woah!'})
soup is the BeautifulSoup Object that is needed to parse the html
findAll() is a function that finds all the parameters you passed in the html code
li is the tag that needs to be found.
data-author is html attribute which will be searched inside tags
The Wizard of Woah! is author name.
so basically that line is searching for all the tag with the attribute data-author who have a name The Wizard of Woah!
and findall returns multiple line so you need to loop through it so that you can get each line and it is appending to a list.
thats all