HTML Scraping the website with duplicated div class name - python

I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella

I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps

you can use xpath expressions and create an absolute path on what you want to scrape

Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]

What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

Related

Scraping webpage using Python and Requests Package

I am looking to scrape a certain number from a website.
When inspecting in chrome I see the following div I want to pull:
<div class="sc-18nh1jk-0 bTfoun css-1p6fq9y">2472.38</div>
This class name looks weird to me. Here is the code simple code I use to try and pull the '2472.38' number:
from lxml import html
import requests
r = requests.get('MYWEBSITE')
tree = html.fromstring(r.content)
CurrentPrice = tree.xpath('//div[#class="sc-18nh1jk-0 bTfoun css-1p6fq9y"]')
print(CurrentPrice)
output is: []
Any suggestions? Thanks ahead of time!
If you provided the websites url it would have been nice, but I think that this webpage you are trying to scrape is using generated class names which means that the class will be dynamic.

Beautiful Soup web scraping complex html for data

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you
As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.
Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

Accessing web table using Python - NIST website

I am trying to access a table from the NIST website here:
http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html
Assume that I click the element zinc. I would like to retrieve the information for Energy, u/p and u[en]/p into 3 columns of a table using python 2.7.
I am beginning to learn BeautifulSoup and Mechanize. However, I am finding it hard to identify a clear pattern in the HTML code relating to the table on this site.
What I am looking for is some way to something like this:
import mechanize
from bs4 import BeautifulSoup
page=mech.open("http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html")
html = page.read()
soup = BeautifulSoup(html)
My thought was to try:
table = soup.find("table",...)
The ... above would be some identifier. I can't find a clear identifier on the NIST website above.
How would I be able to import this table using python 2.7?
EDIT: Is it possible to put these 3 columns in a table?
If I understood you well,
Try this:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('table').find_all('tr')
for i in range(3 , len(l)):
print l[i].get_text()
Edit:
Other way (Getting ASCII column) and put rows to the list l:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('pre').get_text()[145:].split("\n")
print l

Downloading files from multiple websites.

This is my first Python project so it is very basic and rudimentary.
I often have to clean off viruses for friends and the free programs that I use are updated often. Instead of manually downloading each program, I was trying to create a simple way to automate the process. Since I am also trying to learn python I thought it would be a good opportunity to practice.
Questions:
I have to find the .exe file with some of the links. I can find the correct URL, but I get an error when it tries to download.
Is there a way to add all of the links into a list, and then create a function to go through the list and run the function on each url? I've Google'd quite a bit and I just cannot seem to make it work. Maybe I am not thinking in the right direction?
import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup
# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []
# Find exe files to download
match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)
# Check links
#def findexe():
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
As you can see, I have left the function commented out as I cannot get it to work correctly.
Should I abandon the list and just download them individually? I was trying to be efficient.
Any suggestions or if you could point me in the right direction, it would be most appreciated.
In addition to mikez302's answer, here's a slightly more readable way to write your code:
import os
import re
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
websites = [
'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
'http://www.simplysup.com/tremover/download.html'
]
download_links = []
for url in websites:
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection)
connection.close()
for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
download_links.append(link['href'])
for url in download_links:
urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))
urllib2.urlopen is a function for accessing a single URL. If you want to access multiple ones, you should loop over the list. You should do something like this:
for url in urllist:
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
# Check links
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
The code above didn't work for me, in my case it was because the pages assemble their links through a script instead of including it in the code. When I ran into that problem I used the following code which is just a scraper:
import os
import re
import urllib
import urllib2
from bs4 import BeautifulSoup
url = ''
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here
regex = '(.+?).zip' #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
link[x]=i.split(' ')[len(i.split(' '))-1]
# When it finds all the .zip, it usually comes back with a lot of undesirable
# text, luckily the file name is almost always separated by a space from the
# rest of the text which is why we do the split
x+=1
os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download
for i in link:
urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that.
This is not as efficient as the codes in the previous answers but it will work for most almost any site.

Categories