Python webscraping - NoneObeject Failure - broken HTML?

Python webscraping - NoneObeject Failure - broken HTML? - python

Ive got a problem with my parsing script in python. Ive tried it already at another page (yahoo-Finance) and it worked fine. On morningstar nevertheless its not working.
I get the Error in the terminal "NoneObject" of the table variable. I guess it has to do with the structure of the moriningstar site, but iḿ not sure. Maybey somneone can tell me what went wrong.
Or is it not possible because of the sitestructure of the Morningstar site to use my simple script?
A simple csv export direct from morningstar is not a solution because I would like to use the script for other sites which dont have this functionality.
import requests
import csv
from bs4 import BeautifulSoup
from lxml import html
url = 'http://financials.morningstar.com/ratios/r.html?t=SBUX&region=USA&culture=en_US'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print table.prettify() #debugging
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells =[]
for cell in row.findAll(['th','td']):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
print list_of_rows #debugging
outfile = open("./test.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)

The table is dynamically loaded with a separate XHR call to an endpoint which would return JSONP response. Simulate that request, extract the JSON string from the JSONP response, load it with json, extract the HTML from the componentData key and load with BeautifulSoup:
import json
import re
import requests
from bs4 import BeautifulSoup
# make a request
url = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX&region=usa&culture=en-US&cur=&order=asc&_=1450279445578'
response = requests.get(url)
# extract the HTML under the "componentData"
data = json.loads(re.sub(r'([a-zA-Z_0-9\.]*\()|(\);?$)', '', response.content))["componentData"]
# parse HTML
soup = BeautifulSoup(data, "html.parser")
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print(table.prettify())

Related

Extracting link from soup python

I'm trying make an app gets the source links on bandcamp but im kinda stuck. Is there a way to get the source link with beautifulsoup.
The link im trying to get
Bandcamp

The data is within the <script> tags in json format. So use BeautifulSoup to get the 'script'. The data you are after is in the data-tralbum attribute.
Onece you get thet, have json read it in, then just iterate through the json structure:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://vine.bandcamp.com/album/another-light'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = str(soup.find_all('script')[4]['data-tralbum'])
jsonData = json.loads(script)
trackinfo = jsonData['trackinfo']
links = []
for each in trackinfo:
links.append(each['file']['mp3-128'])
Output:
print(links)
['https://t4.bcbits.com/stream/efbba461835eff472bd04a2f9e9910a9/mp3-128/1761020287?p=0&ts=1638288735&t=8ae6343808036ab513cd5436ea009e5d0de784e4&token=1638288735_9139d56ec86f2d44b83a11f3eed8caf7075d6039', 'https://t4.bcbits.com/stream/3e5ef92e6d83e853958ed01955c95f5f/mp3-128/1256475880?p=0&ts=1638288735&t=745a6c701cf1c5772489da5467f6cae5d3622818&token=1638288735_7e86a32c635ba92e0b8320ef56a457d988286cff', 'https://t4.bcbits.com/stream/bbb49d4a72cb80feaf759ec7890abbb6/mp3-128/3439518541?p=0&ts=1638288735&t=dcc7ef7d1d7823e227339fb3243385089478ebe7&token=1638288735_5db36a29c58ea038828d7b34b67e13bd80597dd8', 'https://t4.bcbits.com/stream/8c8a69959337f6f4809f6491c2822b45/mp3-128/1330130896?p=0&ts=1638288735&t=d108dac84dfaac901a546c5fcf5064240cca376b&token=1638288735_8d9151aa82e7a00042025f924660dd3a093c2f74', 'https://t4.bcbits.com/stream/4d4253633405f204d7b1c101379a73be/mp3-128/2478242466?p=0&ts=1638288735&t=a8cd539d0ce8ff417f9b69740070870ed9a182a5&token=1638288735_ad8b5e93c8ffef6623615ce82a6754678fa67b67', 'https://t4.bcbits.com/stream/6c4feee38e289aea76080e9ddc997fa5/mp3-128/2243532902?p=0&ts=1638288735&t=83417c3aba0cef0f969f93bac5165e582f24a588&token=1638288735_c1d9d43b4e10cc6d02c822de90eda3a52c382df2', 'https://t4.bcbits.com/stream/a24dc5dad7b619d47b006e26084ff38f/mp3-128/3054008347?p=0&ts=1638288735&t=4563c326a272c9f5b8462fef1d082e46fac7f605&token=1638288735_55978e7edbe0410ff745913224b8740becad59d5', 'https://t4.bcbits.com/stream/6221790d7f55d3b1f006bd5fac5458fe/mp3-128/1500140939?p=0&ts=1638288735&t=9ecc210c53af05f4034ee00cd1a96a043312a4a7&token=1638288735_0f2faba41da8952f841669513d04bdaaae35a629', 'https://t4.bcbits.com/stream/030506909569626a0d2d7d182b61c691/mp3-128/1707615013?p=0&ts=1638288735&t=c8dcbb2c491789928f5cb6ef8b755df999cb58b8&token=1638288735_b278ba825129ae1b5588b47d5cda345ef2db4e58', 'https://t4.bcbits.com/stream/d1ae0cbc281fc81ddd91f3a3e3d80973/mp3-128/2808772965?p=0&ts=1638288735&t=1080ff51fc40bb5b7afb3a2460f3209cbda549e3&token=1638288735_c93249c847acba5cf23521fa745e05b426a5ba05', 'https://t4.bcbits.com/stream/1b9d50f8210bdc3cf4d2e33986f319ae/mp3-128/2751220220?p=0&ts=1638288735&t=9f24f06dfc5c8a06f24f28664438a6f1a75a038c&token=1638288735_f3a98a20b3c344dc5a37a602a41572d5fe8539c1', 'https://t4.bcbits.com/stream/203cd15629ba03e3249f850d5e1ac42e/mp3-128/4188265472?p=0&ts=1638288735&t=4b4bc2f2194c63a1d3b957e3dd6046bd764c272a&token=1638288735_53a70e7d83ce8c2800baeaf92a5c19db4e146e3f', 'https://t4.bcbits.com/stream/c63b5c9ca090b233e675974c7e7ee4b2/mp3-128/258670123?p=0&ts=1638288735&t=a81ae9dc33dea2b2660d13dbbec93dbcb06e6b63&token=1638288735_446d0ae442cbbadbceb342fe4f7b69d0fbab2928', 'https://t4.bcbits.com/stream/2e824d3c643658c8e9e24b548bc8cb0b/mp3-128/2332945345?p=0&ts=1638288735&t=5bdf0264b9ffe4616d920c55f5081744bf0822d4&token=1638288735_872191bb67a3438ef0fd1ce7e8a9e5ca09e6c37e']

Why am I unable to web-scrape URL from a hyperlink in this website?

I tried to extract URL from a hyperlink in this web: https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/
I used the following Python code:
import requests
from bs4 import BeautifulSoup
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
The problem is this code does not return any URL.
I want to get all of this urls:

You are unable to parse it as the data is dynamically loaded. As you can see in the following image, the HTML data that is being written to the page doesn't actually exist when you download the HTML source code. The JavaScript later parses the window.__SITE variable and extracts the data from there:
However, we can replicate this in Python. After downloading the page:
import requests
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
You can use re (regex) to extract the encoded page source:
import re
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
Afterwards, you can use urllib to URL-decode the text, and json to parse the JSON string data:
from urllib.parse import unquote
from json import loads
json_data = loads(unquote(encoded_data))
You can then parse the JSON tree to get to the HTML source data:
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
At that point, you can use your own code to parse the HTML:
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
If you put it all together, here's the final script:
import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup
# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
# Decode and load as dictionary
json_data = loads(unquote(encoded_data))
# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
# Get links
links = soup.find_all('a')
# For each link...
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")

The links are generated dynamically by javascript code and the data can be found un the structure below.
<script id="site-injection">
window.__SITE="your data is here"
</script>
So you need to grab this script element and parse the value of window.__SITE

How to access specific table shown in inspect using Python and BeautifulSoup for web scraping

I am working on web scraping using Python and BeautifulSoup. My purpose is to pull members data from https://thehia.org/directory?&tab=1. There are around 1685 records.
When I view the page source on my Chrome, I cannot find the table. Seems it dynamically pulls the data. But when I use the inspect option of Chrome, I can find the "membersTable" table in the div that I need.
How can I use BeautifulSoup to access that membersTable that I can access in the inspect.

You can mimic the POST request the page makes for content then use hjson to handle unquoted keys in string pulled out of response
import requests, hjson
import pandas as pd
data = {'formId': '3721260'}
r = requests.post('https://thehia.org/Sys/MemberDirectory/LoadMembers', data=data)
data = hjson.loads(r.text.replace('while(1); ',''))
total = data['TotalCount']
structure = data['JsonStructure']
members = hjson.loads(structure)
df = pd.DataFrame([[member[k][0]['v'] for k in member.keys()] for member in members['members'][0]]
,columns = ['Organisation', 'City', 'State','Country'])
print(df)

Try this one
import requests
from bs4 import BeautifulSoup
url = "https://thehia.org/directory?&tab=1"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'membersTable'})
row_list = []
for row in table.findAll('tr',{'class':['normal']}):
data= []
for cell in row.findAll('td'):
data.append(cell.text)
row_list.append(data)
print(row_list)

Trying to print all TR elements and all TD elements from a web page

I am playing around with the script below and trying to get it to write all TR elements and all TD elements from a web page into a CSV file. For some unknown reason, I'm getting no data, at all, in the CSV file.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
url = "https://my_url"
page = requests.get(url)
pagetext = page.text
soup = BeautifulSoup(pagetext, 'html.parser')
file = open("C:/my_path/test.csv", 'w')
for row in soup.find_all('tr'):
for col in row.find_all('td'):
print(col.text)
I am using Python 3.6.

Your url is not a website so it won't be able to find anything. You just need to fix the url and try again.
I have fixed the code so that you can finish it. It will only add the first line of data in the list to the csv file.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
url = "https://www.w3schools.com/html/html_tables.asp"
page = requests.get(url)
pagetext = page.text
soup = BeautifulSoup(pagetext, 'html.parser')
file = open("C:/Test/test2.csv", 'w')
for row in soup.find_all('tr'):
for col in row.find_all('td'):
info= col.text
print(info)
file.write(info)
file.close()

Python BeautifulSoup find_all() only returns first table from html instead of all tables?

I'm trying to use BeautifulSoup to get the tables from this website: https://www.basketball-reference.com/players/b/bryanko01.html
My code is as follows:
from bs4 import BeautifulSoup
from urllib.request import urlopen
f = open("testhtml.txt", 'w')
url = "https://www.basketball-reference.com/players/b/bryanko01.html"
html = urlopen(url)
bs = BeautifulSoup(html, "html5lib")
totals = [s.encode('utf-8') for s in bs.find_all("table")]
print(len(totals)) # prints 1
f.write(bs.prettify().encode('utf-8'))
f.close()
I write to a file to look at the raw html and there are multiple tables (with the table tags), but for some reason, my call to find_all("table") only returns one table.
Please let me know if you have any thoughts as to what I may be doing wrong.f

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python webscraping - NoneObeject Failure - broken HTML? - python

Related

Extracting link from soup python

Why am I unable to web-scrape URL from a hyperlink in this website?

How to access specific table shown in inspect using Python and BeautifulSoup for web scraping

Trying to print all TR elements and all TD elements from a web page

Python BeautifulSoup find_all() only returns first table from html instead of all tables?

Categories

Resources