Beautiful Soup not Scraping all the visible website Data (Python 3) - python

My issue is that I'm trying to scrape a bunch of different websites to find all visible text to download to a .txt file -- unfortunately I'm not getting all the possible text I can from these websites. I have posted a working example of my code below:
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ['https://www304.americanexpress.com/credit-card/compare']
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print(item, file=outfile)
If you test out this code, all you get is the following data --
Ratings & Reviews for this card are currently not available
Ratings & Reviews for this card are currently not available
Ratings & Reviews for this card are currently not available
All users of our online services subject to Privacy Statement and agree to be bound by Terms of etc...
How exactly do I get the rest of the visible data on this page? Based on my research, I'm pretty sure it has to do with my parameters of soup.findAll('p')] but I don't know what to addin instead to get the rest of the data.

Instead of searching for paragraphs, get the .text from the body:
print(soup.body.text, file=outfile)
If you want to avoid script tag contents being written to results, you can find all tags on the top-level (see recursive=False) and join the text:
print(''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)]))

Related

How do I scrape specific text from different wikipedia pages with beautifulsoup?

For a little personal project, I would like to scrape the episode summary in Wikipedia for TV series:
for example, I started with this page Andor.
I write this script and it seems to do what I would like:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/Obi-Wan_Kenobi_(TV_series)').read()
# Make a soup
soup = BeautifulSoup(source,'lxml')
print(set([text.parent.name for text in soup.find_all(text=True)]))
tab = soup.find("table",{"class":"wikitable plainrowheaders wikiepisodetable"})
spans = tab.find_all('td')
# tds with actual text
x = [i for i in range(4,len(spans),5)]
tds = [i for i in spans if spans.index(i) in x]
text = ''
for paragraph in tds:
text += paragraph.text
#cleaning a bit
import re
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
text
The problem is that this is not working in other cases:
Big bang theory page
Here, you have to go to main page for the episode list, then there is a page for each season.
Or another different example is:
Loki
Here, the link for the episode summary is in the same page of the main article, but still you have to pass by another page to access the summary.
I would like to know if there is a way to create a script that can take care in a simple way for all these cases. Or there is a simpler way (maybe instead of scraping there is a Wikipedia database that can be access and thus to access the same information).
You don't need to scrape Wikipedia because they already have Client Library;
pip install Wikipedia
detailed documentation:
https://wikipedia.readthedocs.io/en/latest/code.html#api

How to Use Python to Iterate Through A Basic Website To Create List of URLs and then Print The Text of Each

I would like to use Python to scrape all links on the Civil Procedure URL of the Montana Code Annotated, as well as all pages linked on that page, and eventually capture the substantive text at the last link. The problem is that the base URL links to Chapters that also have URLs to Parts. And the Parts URLs have links to the text I want. So this is a "three deep" URL structure with a URL naming convention that does not use a sequential ending, like 1,2,3,4,etc.
I am new to Python, so I broke this down into steps.
FIRST, I used this to extract the text from a single URL with substantive text (i.e., three levels deep):
import requests
from bs4 import BeautifulSoup
url = 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
with open("Rsync_Test.txt", "w") as f:
print(href_elem.text,"PAGE_END", file = f)
f.close()
SECOND, I created a list of URLS and exported it to a .txt file:
import os
from bs4 import BeautifulSoup
import urllib.request
html_page = urllib.request.urlopen("http://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/sections_index.html")
soup = BeautifulSoup(html_page, "html.parser")
url_base="https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/"
for link in soup.findAll('a'):
print(url_base+link.get('href')[2:])
os.chdir("/home/rsync/Downloads/")
with open("All_URLs.txt", "w") as f:
for link in soup.findAll('a'):
print(url_base+link.get('href')[2:], file = f)
f.close()
THIRD, I tried scrape the text from the resulting URL list:
import os
import requests
from bs4 import BeautifulSoup
url_lst = [ 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
]
for link in url_lst:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
for link in url_lst:
with open("Rsync_Test.txt", "w") as f:
print(href_elem.text,"PAGE_END", file = f)
f.close()
My plan was to put it all together into a single script (after figuring out how to extract URLs that are three levels deep from the base URL). But the third script iterates on itself without printing separate pages for each URL, resulting in just the text from the last URL.
Any tips on how to either fix the third script so it scrapes and prints the text from all 16 of the URLs from the second script would be welcome! As would ideas on how to "pull this together" into something less convoluted.
You are iterating through url_list twice.
Assuming you want the text of each href written to a file, removing the duplicated for loop, saving the results into a list scraped data, then writing that list to a file in its own for loop works
import os
import requests
from bs4 import BeautifulSoup
url_lst = [ 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
]
new_url_list = []
for link in url_lst:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
href_elem = soup.find('div', class_='mca-content mca-toc')
new_url_list.append(href_elem.text)
MyFile=open('output.txt','w', encoding='utf-8')
for link in new_url_list:
MyFile.write(link)
MyFile.close()
This will output a file like this
Montana Code Annotated 2019
TITLE 25. CIVIL PROCEDURE
CHAPTER 19. UNIFORM DISTRICT COURT RULES
Part 1. Rules
Form Of Papers Presented For Filing
Rule 1 - Form of Papers Presented for Filing.
(a) Papers Defined. The word "papers" as used in this Rule includes all documents and copies except exhibits and records on appeal from lower courts.
(b) Pagination, Printing, Etc. All papers shall be:
(1) Typewritten, printed or equivalent;
(2) Clear and permanent;
(3) Equally legible to printing;
(4) Of type not smaller than pica;
(5) Only on standard quality opaque, unglazed, recycled paper, 8 1/2" x 11" in size.
(6) Printed one side only, except copies of briefs may be printed on both sides. The original brief shall be printed on one side.
(7) Lines unnumbered or numbered consecutively from the top;
(8) Spaced one and one-half or double;
(9) Page numbered consecutively at the bottom; and
(10) Bound firmly at the top. Matters such as property descriptions or direct quotes may be single spaced. Extraneous documents not in the above format and not readily conformable may be filed in their original form and length.
(c) Format. The first page of all papers shall conform to the following:
And so on until rule 16 in the data.

Beautiful Soup web scraping complex html for data

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you
As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.
Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

How do I draw out specific data from an opened url in Python using urllib2?

I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this:
http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13
where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in 'bigdrizzle13' and it could output those numbers.
As another poster mentioned, BeautifulSoup is a wonderful tool for this job.
Here's the entire, ostentatiously-commented program. It could use a lot of error tolerance, but as long as you enter a valid username, it will pull all the scores from the corresponding web page.
I tried to comment as well as I could. If you're fresh to BeautifulSoup, I highly recommend working through my example with the BeautifulSoup documentation handy.
The whole program...
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys
URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]
# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)
# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})
# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]
# Helper function to return concatenation of all character data in an element
def parse_string(el):
text = ''.join(el.findAll(text=True))
return text.strip()
for row in rows:
# Get all the text from the <td>s
data = map(parse_string, row.findAll('td'))
# Skip the first td, which is an image
data = data[1:]
# Do something with the data...
print data
And here's a test run.
> test.py bigdrizzle13
[u'Overall', u'87,417', u'1,784', u'78,772,017']
[u'Attack', u'140,903', u'88', u'4,509,031']
[u'Defence', u'123,057', u'85', u'3,449,751']
[u'Strength', u'325,883', u'84', u'3,057,628']
[u'Hitpoints', u'245,982', u'85', u'3,571,420']
[u'Ranged', u'583,645', u'71', u'856,428']
[u'Prayer', u'227,853', u'62', u'357,847']
[u'Magic', u'368,201', u'75', u'1,264,042']
[u'Cooking', u'34,754', u'99', u'13,192,745']
[u'Woodcutting', u'50,080', u'93', u'7,751,265']
[u'Fletching', u'53,269', u'99', u'13,051,939']
[u'Fishing', u'5,195', u'99', u'14,512,569']
[u'Firemaking', u'46,398', u'88', u'4,677,933']
[u'Crafting', u'328,268', u'62', u'343,143']
[u'Smithing', u'39,898', u'77', u'1,561,493']
[u'Mining', u'31,584', u'85', u'3,331,051']
[u'Herblore', u'247,149', u'52', u'135,215']
[u'Agility', u'225,869', u'60', u'276,753']
[u'Thieving', u'292,638', u'56', u'193,037']
[u'Slayer', u'113,245', u'73', u'998,607']
[u'Farming', u'204,608', u'51', u'115,507']
[u'Runecraft', u'38,369', u'71', u'880,789']
[u'Hunter', u'384,920', u'53', u'139,030']
[u'Construction', u'232,379', u'52', u'125,708']
[u'Summoning', u'87,236', u'64', u'419,086']
Voila :)
You can use Beautiful Soup to parse the HTML.

Categories