I am very new and need your help. I want to write a script I can generalize to do web-scraping. So far I have the code below, but it keeps giving me a blank output file. I would like to be able to easily modify this code to work on all websites and eventually be able to make the search strings a little more complex. For now, I have CNN as a general page, and "mccarthy" b/c I figure there are certainly articles with him in them right now. Can you help me get this to work?
#Begin Code
import requests
from bs4 import BeautifulSoup
import docx
# Set the search parameters
search_term = 'mccarthy' # Set the search term
start_date = '2023-01-04' # Set the start date (format: YYYY-MM-DD)
end_date = '2023-01-05' # Set the end date (format: YYYY-MM-DD)
website = 'https://www.cnn.com' # Set the website to search
document = open('testfile.docx','w') # Open the existing Word document
# Initialize the list of articles and the page number
articles = []
page_number = 1
# Set the base URL for the search API
search_url = f'{website}/search'
# Set the base URL for the article page
article_base_url = f'{website}/article/'
while articles or page_number == 1:
# Send a request to the search API
response = requests.get(search_url, params={'q': search_term, 'from': start_date, 'to': end_date, 'page': page_number})
# Check if the response is in JSON format
if response.headers['Content-Type'] == 'application/json':
# Load the JSON data
data = response.json()
# Get the list of articles from the JSON data
articles = data['articles']
else:
# Parse the HTML content of the search results page
soup = BeautifulSoup(response.text, 'html.parser')
# Find all articles on the search results page
articles = soup.find_all('article')
# Loop through the articles
for article in articles:
# Find the link element
link_element = article.find('a', class_='title')
# Extract the link from the link element
link = link_element['href']
# Check if the link is a relative URL
if link.startswith('/'):
# If the link is relative, convert it to an absolute URL
link = f'{website}{link}'
# Add the link to the document
document.add_paragraph(link)
# Increment the page number
page_number += 1
# Save the document
document.close()
I have tried numerous iterations, but I have deleted them all so cannot really post any here. This keeps giving me a blank output file.
This won't solve the main issue but a couple of things to fix:
https://edition.cnn.com/search?q=&from=0&size=10&page=1&sort=newest&types=all§ion=
Looking at the CNN search page URL, we can see that the from parameter is not referring to a date but a number instead, i.e. if from=5, it will only show the 5th article onwards. Therefore you can remove 'from' and 'to' from your request params.
articles = soup.find_all('article')
This is returning an empty list because there are no <article> tags within the HTML page. Inspecting the CNN HTML we see that the URLs you are looking for are within <div class="card container__item container__item--type- __item __item--type- "> tags so I would change this line to soup.find_all('div', class_="card container__item container__item--type- __item __item--type- ")
document = open('testfile.docx','w') # Open the existing Word document
You've imported the docx module but are not using it. Word documents (which require extra data for formatting) should be opened like this document = Document(). For reference, here are the docx docs: https://python-docx.readthedocs.io/en/latest/
while articles or page_number == 1:
I don't think this line is needed.
The main issue seems to be that this page requires Javascript to be run to render the content. Using request.get() by itself won't do this. You'll need to use a library such as Requests-HTML. I tried doing this but the articles still don't render so I'm not sure.
Related
I am trying my hand at webscraping using BeautifulSoup.
I had posted this before here, but I was not very clear as to what I wanted, so it only partially answers my issue.
How do I extract only the content from this webpage
I want to extract the content from the webpage and then extract all the links from the output. Please can someone help me understand where I am going wrong.
This is what I have after updating my previous code with the answer provided in the link above.
# Define the content to retrieve (webpage's URL)
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
# Retrieve the page
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
page = r.data
print(f'Type of Variable "page": {page.__class__.__name__}')
print(f'Page Retrieved. Request Status: {r.status}, Page Size:{len(page)}')
else:
print(f'Some problem occured. Request status: {r.status}')
# Convert the stream of bytes into a BeautifulSoup representation
soup = BeautifulSoup(page, 'html.parser')
print(f'Type of variable "soup": {soup.__class__.__name__}')
# Check the content
print(f'{soup. Prettify()[:1000]}')
# Check the HTML's Title
print(f'Title tag: {soup.title}')
print(f'Title text: {soup.title.string}')
# Find the main content
article_tag = 'p'
articles = soup.find_all(article_tag)
print(f'Type of the variable "article":{article.__class__.__name__}')
for p in articles:
print (p.text)
I then used the code below to get all the links, but get an error
# Find the links in the text
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in articles.find_all(tag)]
tag_list
That is cause articles is a ResultSet of soup.find_all(article_tag) what you can check with type(articles)
To get your goal you have to iterate articles first - So simply add an additional for-loop to your list comprehension:
[t.get('href') for article in articles for t in article.find_all(tag)]
In addition you may should use a set comprehension to avoid duplicates and also concat paths with base url:
list(set(t.get('href') if t.get('href').startswith('http') else 'https://bigbangtheory.fandom.com'+t.get('href') for article in articles for t in article.find_all(tag)))
Output:
['https://bigbangtheory.fandom.com/wiki/The_Killer_Robot_Instability',
'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
'https://bigbangtheory.fandom.com/wiki/The_Valentino_Submergence',
'https://bigbangtheory.fandom.com/wiki/The_Beta_Test_Initiation',
'https://bigbangtheory.fandom.com/wiki/Season_2',
'https://bigbangtheory.fandom.com/wiki/Dr._Pemberton',...]
Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you
As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data
I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.
I would like to thank the User Pythonista for giving me this very useful code a few months back that solved my problem. I'm still however confused how the code function due to my lack of knowledge of HTML and the Beautiful soup library.
I'm confused about what part does specific_messages data strcuture play in this program ?
I'm also confused about how does the code save the various posts ?
and how does it check the user of the post?
import requests, pprint
from bs4 import BeautifulSoup as BS
url = "https://forums.spacebattles.com/threads/the-wizard-of-woah-and-the-impossible-methods-of-necromancy.337233/"
r = requests.get(url)
soup = BS(r.content, "html.parser")
#To find all posts from a specific user everything below this is for all posts
specific_messages = soup.findAll('li', {'data-author': 'The Wizard of Woah!'})
#To find every post from every user
posts = {}
message_container = soup.find('ol', {'id':'messageList'})
messages = message_container.findAll('li', recursive=0)
for message in messages:
author = message['data-author']
#or don't encode to utf-8 simply for printing in shell
content = message.find('div', {'class':'messageContent'}).text.strip().encode("utf-8")
if author in posts:
posts[author].append(content)
else:
posts[author] = [content]
pprint.pprint(posts)
specific_messages = soup.findAll('li', {'data-author': 'The Wizard of Woah!'})
soup is the BeautifulSoup Object that is needed to parse the html
findAll() is a function that finds all the parameters you passed in the html code
li is the tag that needs to be found.
data-author is html attribute which will be searched inside tags
The Wizard of Woah! is author name.
so basically that line is searching for all the tag with the attribute data-author who have a name The Wizard of Woah!
and findall returns multiple line so you need to loop through it so that you can get each line and it is appending to a list.
thats all
I have a simple scraping task that I would like to improve the pagination efficiency of, and append lists so that I may output the results of scraping to a common/single file.
The current task is scraping municipal laws for the city of São Paulo, iterating over the first 10 pages. I would like to find a way to determine the total number of pages for pagination, and have the script automatically cycle through all pages, similar in spirit to this: Handling pagination in lxml.
The xpaths for the pagination links are too poorly defined at the moment for me to understand how to do this effectively. For instance, on the first or last page (1 or 1608), there are only three li nodes, while on page page 1605 there are 6 nodes.
/html/body/div/section/ul[2]/li/a
How may I efficiently account for this pagination; making the determination of pages in an automated way rather than manually, and how can I properly specify the xpaths to cycle through all the appropriate pages, without duplicates?
The existing code is as follows:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import html
base_url = "http://www.leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page=%d&types=o"
for url in [base_url % i for i in xrange(10)]:
page = requests.get(url)
tree = html.fromstring(page.text)
#This will create a list of titles:
titles = tree.xpath('/html/body/div/section/ul/li/a/strong/text()')
#This will create a list of descriptions:
desc = tree.xpath('/html/body/div/section/ul/li/a/text()')
#This will create a list of URLs
url = tree.xpath('/html/body/div/section/ul/li/a/#href')
print 'Titles: ', titles
print 'Description: ', desc
print 'URL: ', url
Secondarily, how can I compile/append these results and write them out to JSON, SQL, etc? I prefer JSON due to familiarity, but am rather ambivalent about how to do this at the moment.
You'll need to examine the data layout of your page/site. Each site is different. Look for 'pagination' or 'next' or some slider. Extract the details/count and use that in your loop.
import json library. You have a json dump function...
Although I couldn't understand your problem properly, this code will help invigorate your new attempt. The code is compatible with python 3 and later version.
import requests
from lxml import html
result = {}
base_url = "https://leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page={0}&types=28&types=5"
for url in [base_url .format(i) for i in range(1,3)]:
tree = html.fromstring(requests.get(url).text)
for title in tree.cssselect(".item-result"):
try:
name = ' '.join(title.cssselect(".title a")[0].text.split())
except Exception:
name = ""
try:
url = ' '.join(title.cssselect(".domain")[0].text.split())
except Exception:
url = ""
result[name] = url
print(result)
Partial output:
{'Decreto 57998/2017': 'http://leismunicipa.is', 'Decreto 58009/2017': 'http://leismunicipa.is'}