How do I scrape specific text from different wikipedia pages with beautifulsoup? - python

For a little personal project, I would like to scrape the episode summary in Wikipedia for TV series:
for example, I started with this page Andor.
I write this script and it seems to do what I would like:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/Obi-Wan_Kenobi_(TV_series)').read()
# Make a soup
soup = BeautifulSoup(source,'lxml')
print(set([text.parent.name for text in soup.find_all(text=True)]))
tab = soup.find("table",{"class":"wikitable plainrowheaders wikiepisodetable"})
spans = tab.find_all('td')
# tds with actual text
x = [i for i in range(4,len(spans),5)]
tds = [i for i in spans if spans.index(i) in x]
text = ''
for paragraph in tds:
text += paragraph.text
#cleaning a bit
import re
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
text
The problem is that this is not working in other cases:
Big bang theory page
Here, you have to go to main page for the episode list, then there is a page for each season.
Or another different example is:
Loki
Here, the link for the episode summary is in the same page of the main article, but still you have to pass by another page to access the summary.
I would like to know if there is a way to create a script that can take care in a simple way for all these cases. Or there is a simpler way (maybe instead of scraping there is a Wikipedia database that can be access and thus to access the same information).

You don't need to scrape Wikipedia because they already have Client Library;
pip install Wikipedia
detailed documentation:
https://wikipedia.readthedocs.io/en/latest/code.html#api

Related

How to get simple information through a crawler

I am trying to make a simple crawler that scrapes through this https://en.wikipedia.org/wiki/Web_scraping page, then proceeds to extract the 19 links from the See About section. This I manage to do, however I am also trying to extract the first paragraph from each of those 19 links and this is where it stops "working". I get the same paragraph from the first page and not from each one. This is what I have so far. I know there might be better options for doing this but i want to stick to BeautifulSoup and simple python code.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text
soup = BeautifulSoup(data, 'html.parser')
def visit():
try:
p = soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit()
Example of the first print
Now visiting: https://en.wikipedia.org/wiki/OpenSocial
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Intended function should be that it prints the first paragraph for every new link printed, not the same paragraph from the first link. What do I need to do in order to fix this? Or any tips on what I am missing. I am fairly new to python so I am still learning the concepts as I work on things.
At the top of your code you define data and soup. Both are tied to https://en.wikipedia.org/wiki/Web_scraping.
Every time you call visit(), you print from soup, and soup never changes.
You need to pass the url to visit(), e.g. visit(url_to_visit). The visit function should accept the url as an argument, then visit the page using requests, and create a new soup from the returned data, then print the first paragraph.
Edited to add code explaining my original answer:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
start_url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Renamed this to start_url to make it clear that this is the source page
data = requests.get(start_url).text
soup = BeautifulSoup(data, 'html.parser')
def visit(new_url): # function now accepts a url as an argument
try:
new_data = requests.get(new_url).text # retrieve the text from the url
new_soup = BeautifulSoup(new_data, 'html.parser') # process the retrieved html in beautiful soup
p = new_soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(start_url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit(url_to_visit) # here's where we pass each line to the visit() function

How do I use Bs4 to pull similar information but from different places in DOM hierarchy?

I'm trying to scrape information from a series of pages from like these two:
https://www.nysenate.gov/legislation/bills/2019/s240
https://www.nysenate.gov/legislation/bills/2019/s8450
What I want to do is build a scraper that can pull down the text of "See Assembly Version of this Bill". In the two links listed above, the classes are the same but for one page it's the only iteration of that class, but for another it's the third.
I'm trying to make something like this work:
assembly_version = soup.select_one(".bill-amendment-detail content active > dd")
print(assembly_version)
But I keep getting None
Any thoughts?
url = "https://www.nysenate.gov/legislation/bills/2019/s11"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
assembly_version = soup.find(class_="c-block c-bill-section c-bill--details").find("a").text.strip()
print(assembly_version)

Webscraping in Python (beautifulsoup)

I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.

Beautiful Soup not Scraping all the visible website Data (Python 3)

My issue is that I'm trying to scrape a bunch of different websites to find all visible text to download to a .txt file -- unfortunately I'm not getting all the possible text I can from these websites. I have posted a working example of my code below:
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ['https://www304.americanexpress.com/credit-card/compare']
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print(item, file=outfile)
If you test out this code, all you get is the following data --
Ratings & Reviews for this card are currently not available
Ratings & Reviews for this card are currently not available
Ratings & Reviews for this card are currently not available
All users of our online services subject to Privacy Statement and agree to be bound by Terms of etc...
How exactly do I get the rest of the visible data on this page? Based on my research, I'm pretty sure it has to do with my parameters of soup.findAll('p')] but I don't know what to addin instead to get the rest of the data.
Instead of searching for paragraphs, get the .text from the body:
print(soup.body.text, file=outfile)
If you want to avoid script tag contents being written to results, you can find all tags on the top-level (see recursive=False) and join the text:
print(''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))

How do I draw out specific data from an opened url in Python using urllib2?

I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this:
http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13
where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in 'bigdrizzle13' and it could output those numbers.
As another poster mentioned, BeautifulSoup is a wonderful tool for this job.
Here's the entire, ostentatiously-commented program. It could use a lot of error tolerance, but as long as you enter a valid username, it will pull all the scores from the corresponding web page.
I tried to comment as well as I could. If you're fresh to BeautifulSoup, I highly recommend working through my example with the BeautifulSoup documentation handy.
The whole program...
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys
URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]
# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)
# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})
# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]
# Helper function to return concatenation of all character data in an element
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
# Get all the text from the <td>s
data = map(parse_string, row.findAll('td'))
# Skip the first td, which is an image
data = data[1:]
# Do something with the data...
print data
And here's a test run.
> test.py bigdrizzle13
[u'Overall', u'87,417', u'1,784', u'78,772,017']
[u'Attack', u'140,903', u'88', u'4,509,031']
[u'Defence', u'123,057', u'85', u'3,449,751']
[u'Strength', u'325,883', u'84', u'3,057,628']
[u'Hitpoints', u'245,982', u'85', u'3,571,420']
[u'Ranged', u'583,645', u'71', u'856,428']
[u'Prayer', u'227,853', u'62', u'357,847']
[u'Magic', u'368,201', u'75', u'1,264,042']
[u'Cooking', u'34,754', u'99', u'13,192,745']
[u'Woodcutting', u'50,080', u'93', u'7,751,265']
[u'Fletching', u'53,269', u'99', u'13,051,939']
[u'Fishing', u'5,195', u'99', u'14,512,569']
[u'Firemaking', u'46,398', u'88', u'4,677,933']
[u'Crafting', u'328,268', u'62', u'343,143']
[u'Smithing', u'39,898', u'77', u'1,561,493']
[u'Mining', u'31,584', u'85', u'3,331,051']
[u'Herblore', u'247,149', u'52', u'135,215']
[u'Agility', u'225,869', u'60', u'276,753']
[u'Thieving', u'292,638', u'56', u'193,037']
[u'Slayer', u'113,245', u'73', u'998,607']
[u'Farming', u'204,608', u'51', u'115,507']
[u'Runecraft', u'38,369', u'71', u'880,789']
[u'Hunter', u'384,920', u'53', u'139,030']
[u'Construction', u'232,379', u'52', u'125,708']
[u'Summoning', u'87,236', u'64', u'419,086']
Voila :)
You can use Beautiful Soup to parse the HTML.

Categories