I have the following code:
url = 'https://finance.yahoo.com/quote/SPY'
result = requests.get(url)
c = result.content
html = BeautifulSoup(c, 'html.parser')
scripts = html.find_all('script')
sl =[]
for s in scripts:
sl.append(s)
s = (sl[-3])
s = s.contents
s = str(s)
s = s[119:-16]
s = json.dumps(s)
json_data = json.loads(s)
Once I check the data type for json_data I get a string. I am assuming that there are potentially some text encoding errors in the json data and it cannot properly be recognized as a json object.
However when I try dumping the data into a file and entering it into an online json parser, the parser can read the json data properly and recognize keys and values.
How can I fix this so that I can properly access the data within the json object?
You have to change [119:-16] into [112:-12] and you can get json as dictionary
import requests
from bs4 import BeautifulSoup
import json
url = 'https://finance.yahoo.com/quote/SPY'
result = requests.get(url)
html = BeautifulSoup(result.content, 'html.parser')
script = html.find_all('script')[-3].text
data = script[112:-12]
json_data = json.loads(data)
print(type(json_data))
#print(json_data)
print(json_data.keys())
print(json_data['context'].keys())
print(json_data['context']['dispatcher']['stores']['PageStore']['currentPageName'])
Result:
<class 'dict'>
dict_keys(['context', 'plugins'])
dict_keys(['dispatcher', 'options', 'plugins'])
quote
Related
I am trying to read data from multiple URLs, convert each JSON dataset to a dataframe, and save each dataframe in tabular format, like CSV. I am testing this code.
import requests
url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()
url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json'
r = requests.get(url)
data = r.json()
url = 'https://www.mskcc.org/teaser/standard-charges-nyc.json'
r = requests.get(url)
data = r.json()
That code seems to read each URL fine. I guess I'm stuck with how to standardize the process of converting multiple JSON data sources into dataframes, and save each dataframe as a CSV. I tested this code.
import pandas as pd
import requests
import json
url = 'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\northwell.csv')
url = 'https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json'
r = requests.get(url)
data = r.json()
df = pd.json_normalize(data)
df.to_csv(r'C:\Users\\ryans\\Desktop\\chsli.csv')
That seems to save data in two CSVs and each one has many, many of columns and just a few rows of data. I'm not sure why this happens. Somehow, it seems like pd.json_normalize is NOT converting the JSON into a tabular shape. Any thoughts?
Also, I'd like to parse the URL to include it in the name of the CSV that is saved. So, this 'https://www.northwell.edu/' becomes this 'C:\Users\ryans\Desktop\northwell.csv' and this 'https://www.chsli.org/' becomes this 'C:\Users\ryans\Desktop\chsli.csv'.
For the JSON decoding :
The problem is that each url has its own data format
For example with "https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json" → the json data is inside the data field.
import requests
import json
import pandas as pd
url = 'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json'
r = requests.get(url)
data = r.json()
data = pd.DataFrame(data['data'], columns=data['columns'], index=data['index'])
For the URL parsing :
urls = ['https://www.chsli.org/sites/default/files/transparency/111888924_GoodSamaritanHospitalMedicalCenter_standardcharges.json',
'https://www.northwell.edu/sites/northwell.edu/files/Northwell_Health_Machine_Readable_File.json',
'https://www.montefiorehealthsystem.org/documents/charges/462931956_newrochelle_Standard_Charges.json',
'https://www.kaleidahealth.org/general-information/330005_Kaleida-Health_StandardCharges.json',
'https://www.mskcc.org/teaser/standard-charges-nyc.json']
for u in urls:
print('C:\\Users\\ryans\\Desktop\\' + u.split('.')[1] + '.csv')
output :
C:\Users\ryans\Desktop\chsli.csv
C:\Users\ryans\Desktop\northwell.csv
C:\Users\ryans\Desktop\montefiorehealthsystem.csv
C:\Users\ryans\Desktop\kaleidahealth.csv
C:\Users\ryans\Desktop\mskcc.csv
Below is the code
i am trying to scrape the data and try to push to elastic search
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch
es_client = Elasticsearch(['http://localhost:9200'])
#drop_index = es_client.indices.create(index='blog-sysadmins', ignore=400)
create_index = es_client.indices.delete(index='blog-sysadmins', ignore=[400, 404])
def urlparser(title, url):
# scrape title
p = {}
post = title
page = requests.get(post).content
soup = BeautifulSoup(page, 'lxml')
title_name = soup.title.string
# scrape tags
tag_names = []
desc = soup.findAll(attrs={"property":"article:tag"})
for x in range(len(desc)):
tag_names.append(desc[x-1]['content'].encode('utf-8'))
print (tag_names)
# payload for elasticsearch
doc = {
'date': time.strftime("%Y-%m-%d"),
'title': title_name,
'tags': tag_names,
'url': url
}
# ingest payload into elasticsearch
res = es_client.index(index="blog-sysadmins", doc_type="docs", body=doc)
time.sleep(0.5)
sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urlss = [element.text for element in sitemap_index.findAll('loc')]
urls = urlss[0:2]
print ('urls',urls)
for x in urls:
urlparser(x, x)
my error:
SerializationError: ({'date': '2020-07-04', 'title': 'Persistent Storage with OpenEBS on Kubernetes', 'tags': [b'Cassandra', b'Kubernetes', b'Civo', b'Storage'], 'url': 'http://sysadmins.co.za/persistent-storage-with-openebs-on-kubernetes/'}, TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)",))
The json serialization error appears when you try to indicize a data that is not a primitive datatype of javascript, the language with which json was developed. It is a json error and not an elastic one. The only rule of json format is that it accepts inside itself only these datatypes - for more explanation please read here. In your case the tags field has a bytes datatype as written in your error stack:
TypeError("Unable to serialize b'Cassandra' (type: <class 'bytes'>)
To solve your problem you should simply cast your tags content to string. So just change this line:
tag_names.append(desc[x-1]['content'].encode('utf-8'))
to:
tag_names.append(str(desc[x-1]['content']))
I am trying to parse through an html page using beautiful soup. Specifically, I am looking at this very large array called "g_rgTopCurators" that can be summarized below:
g_rgTopCurators =
[{\"curator_description\":\"Awesome and sometimes overlooked indie games
curated by the orlygift.com team\",
\"last_curation_date\":1538400354,
\"discussion_url\":null,
\"rgTagLineLocalizations\":[],
\"broadcasters\":[],
\"broadcasters_info_available\":1,
\"bFollowed\":null,
\"m_rgAppRecommendations\":
[{ \"appid\":495600,
\"clanid\":9254464,
\"link_url\":\"https:\\\/\\\/www.orlygift.com\\\/games\\\/asteroid-fight\",
\"link_text\":\"\",
\"blurb\":\"Overall, we found Asteroid Fight to be a cool space game. If you want to manage a base and also handle asteroids, this is the right game for you. It\\u2019s definitely fun, unique and it has its own twist.\",
\"time_recommended\":1538400354,
\"comment_count\":0,
\"upvote_count\":0,
\"accountid_creator\":10142231,
\"recommendation_state\":0,
\"received_compensation\":0,
\"received_for_free\":1},
{other app with same params as above},
{other app},
{other app}
],
\"m_rgCreatedApps\":[],
\"m_strCreatorVanityURL\":\"\",
\"m_nCreatorPartnerID\":0,
\"clanID\":\"9254464\",
\"name\":\"Orlygift\",
\"communityLink\":\"https:\\\/\\\/steamcommunity.com\\\/groups\\\/orlygift\",
\"strAvatarHash\":\"839146c7ccac8ee3646059e3af616cb7691e1440\",
\"link\":\"https:\\\/\\\/store.steampowered.com\\\/curator\\\/9254464-Orlygift\\\/\",
\"youtube\":null,
\"facebook_page\":null,
\"twitch\":null,
\"twitter\":null,
\"total_reviews\":50,
\"total_followers\":38665,
\"total_recommended\":50,
\"total_not_recommended\":0,
\"total_informative\":0
},
{another curator},
{another curator}
];
I am trying to figure out how to properly use soup.select() to get every \"name\" for every curator in this large array.
soup = bs4.BeautifulSoup(data["results_html"], "html.parser")
curators = soup.select(" ??? ")
As the response is JSON containing HTML which contains a script element containing more JSON my first approach was this:
import requests
import json
from bs4 import BeautifulSoup
url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
loaded_response = response.json() # Get the JSON response containing the HTML containing the required JSON.
results_html = loaded_response['results_html'] # Get the HTML from the JSON
soup = BeautifulSoup(results_html, 'html.parser')
text = soup.find_all('script')[1].text # Get the script element from the HTML.
# Get the JSON in the HTML script element
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn: # Iterate through JSON
print (i['name'])
Outputs:
Cynical Brit Gaming
PC Gamer
Just Good PC Games
...
WGN Chat
Bloody Disgusting Official
Orlygift
There is a quicker way of doing it just get the response as bytes decode and escape it then go straight to the desired JSON with string manipulation:
import requests
import json
url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
text = response.content.decode("unicode_escape") # response body as bytes decode and escape
# find the JSON
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn: # Iterate through JSON
print (i['name'])
TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
import json
import urllib.request as req
from urllib.parse import urlencode
url = "https://apiurl.example/search/"
payload = {"SearchString":"mysearch"}
response = req.urlopen(url, urlencode(payload))
data = response.read()
print(data.decode("utf-8"))
What am I doing wrong? There is nothing wrong with the url or "payload" as i tried it in the API's online interface. Before I added the urlencode and utf-8 decode I got an error saying: "TypeError: can't concat str to bytes". At some point it returned an empty list, but don't remember what I did then. Anyway it should return some data as mentioned. Thanks for your time.
I've never used requests that way. Here's an example of how I've done it, checking the result code and decoding the JSON if it was successful:
import json
import requests
action_url = "https://apiurl.example/search/"
# Prepare the headers
header_dict = {}
header_dict['Content-Type'] = 'application/json'
# make the URL request
result = requests.get(action_url, headers=header_dict)
status_code = result.status_code
if (status_code == requests.codes.ok):
records = json.loads(result.content)
print 'Success. Records:'
print records
else:
print 'ERROR. Status: {0}'.format(status_code)
print 'headers: {0}'.format(header_dict)
print 'action_url: {0}'.format(action_url)
# Show the error messages.
print result.text
I found out now.
import urllib.request
import urllib.parse
url = "https://apiurl.example/search"
search_input = input("Search ")
payload = {"SearchString":search_input}
params = urllib.parse.urlencode(payload)
params = params.encode('utf-8')
f = urllib.request.urlopen(url, params)
output = f.read()
print(output)
I am trying to extract tables from an HTML document using the xpath module in Python. If I print the downloaded HTML, I see the full DOM as it should be. However, when I use xpath.get, it give me a tbody section, but not the one I want and certainly not the only one that should be there. Here is the script.
import requests
from webscraping import download, xpath
D = download.Download()
url = 'http://labs.mementoweb.org/timemap/json/http://www.awebsiteimscraping.com'
r = requests.get(url)
data = []
mementos = r.json()['mementos']['list']
for memento in mementos:
data.append(D.get(memento['uri']))
# print xpath.get(data[10], '//table')
print type(data[0])
# print data[10]
print len(data)
I'm new to this, so idk if it matters, but the type of each element in 'data' is str.
Convert type of data to dict using json.loads()
Try this,
import requests
import json
from webscraping import download, xpath
D = download.Download()
url = 'http://labs.mementoweb.org/timemap/json/http://www.awebsiteimscraping.com'
r = requests.get(url)
data = []
mementos = r.json()['mementos']['list']
for memento in mementos:
data.append(D.get(memento['uri']))
# print xpath.get(data[10], '//table')
print type(data[0])
# print data[10]
print len(data)
json_data = json.loads(data)
print type(json_data[0])