Extract json data in web page using pd.read_json()? - python

Trying to extract the table from this page "https://www.hkex.com.hk/Market-Data/Statistics/Consolidated-Reports/Monthly-Bulletin?sc_lang=en#select1=0&select2=28". By inspect/network function of chorme, the data request link is "https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485". This links looks like json format when access directly. However, the codes using this link does not work.
My codes:
import pandas as pd
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485"
df = pd.read_json(url)
print(df.info(verbose=True))
print(df)
also tried:
url="https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?"

You can try downloading the json first and then convert it back to DataFrame
import pandas as pd
url='https://www.hkex.com.hk/eng/stat/smstat/mthbull/rpt_turnover_short_selling_current_month_1910.json?_=1574650413485'
import urllib.request, json
with urllib.request.urlopen(url) as r:
data = json.loads(r.read().decode())
df = pd.DataFrame(data['tables'][0]['body'])
columns = [item['text'] for item in data['tables'][0]['header']]
row_count = max(df['row'])
new_df = pd.DataFrame(df.text.values.reshape((row_count,-1)),columns = columns)

Related

Convert json data into dataframe

I am unable to flatten the json data from this API into a dataframe. I have tried using json_normalize but I gives a NotImplemented error. Can someone help me with it? I need the columns: stationId, start, timestep, temperature where there are several values for temperature and rest of the columns should have same values.
import requests
import json
import pandas as pd
response_API = requests.get('https://dwd.api.proxy.bund.dev/v30/stationOverviewExtended?stationIds=10865,G005')
print(response_API.status_code)
data = response_API.text
json.loads(data)
df= ?
You can do it many ways, but your current approach should use json() instead of text
import requests
import json
import pandas as pd
response_API = requests.get('https://dwd.api.proxy.bund.dev/v30/stationOverviewExtended?stationIds=10865,G005')
print(response_API.status_code)
data = response_API.json() <--- it should be json()
print(data)
OR directly read json to df from the URL using read_json()
df = pd.read_json("https://dwd.api.proxy.bund.dev/v30/stationOverviewExtended?stationIds=10865,G005")
print(df)
Edit:
import requests
import json
import pandas as pd
response_API = requests.get('https://dwd.api.proxy.bund.dev/v30/stationOverviewExtended?stationIds=10865,G005')
print(response_API.status_code)
data = response_API.json()
result = []
for station, value in data.items():
for forecast, val in value.items():
if forecast in ['forecast1', 'forecast2']:
result.append(val)
df = pd.DataFrame(result)
print(df)

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])

How to make a DataFrame from the nested JSON dictionary

I am trying to make a DataFrame with all values from this address: https://www.ebi.ac.uk/pdbe/api/pisa/interfacecomponent/3gcb/0/1/energetics. But The DataFrame I get is very messy and it doesnt provide all the information contained in the JSON dictionary. I am using this code but the result is bad:
import numpy as np
import pandas as pd
import requests
import json
url = 'https://www.ebi.ac.uk/pdbe/api/pisa/interfacecomponent/3gcb/0/1/energetics'
JSONContent = requests.get(url).json()
content = json.dumps(JSONContent, indent = 4, sort_keys=True)
data = json.loads(content)
df = pd.io.json.json_normalize(data)
print df
Can someone help please?

Python loop through URL's

I'm attempting take dates from a dataframe and loop through them within URL's. I've managed to print the URL's (1st code), but when I attempt to turn the URL's json into a dataframe (2nd code) I get this response.
AttributeError: 'str' object has no attribute 'json'
#1st code
import requests
import pandas as pd
df = pd.read_csv('NBADates.csv')
df.to_dict('series')
for row in df.loc[ : ,"Date"]:
url = url_template.format(row=row)
print(url)
Any ideas on what I'm doing wrong?
#2nd code
import requests
import csv
import pandas as pd
url_template = "https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom={row}&DateTo={row}&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=Totals&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=SpeedDistance&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
df = pd.read_csv('NBADates.csv')
df.to_dict('series')
for row in df.loc[ : ,"Date"]:
url = url_template.format(row=row)
stats = url.json()['resultSets'][0]['rowSet']
headers = url.json()['resultSets'][0]['headers']
stats_df = pd.DataFrame(stats, columns=headers)
# Append to the big dataframe
lineup_df = lineup_df.append(stats_df, ignore_index=True)
lineup_df.to_csv("Stats.csv")
I think you forgot to request the URL. you should send a request and if the response is a json, you should parse it

Converting html table to a pandas dataframe

I have been trying to import a html table from a website and to convert it into a pandas DataFrame. This is my code:
import pandas as pd
table = pd.read_html("http://www.sharesansar.com/c/today-share-price.html")
dfs = pd.DataFrame(data = table)
print dfs
It just displays this:
0 S.No ...
But if I do;
for df in dfs:
print df
It outputs the table..
How can I use pd.Dataframe to scrape the table?
HTML table on the given url is javascript rendered. pd.read_html() doesn't supports javascript rendered pages. You can try with dryscrape like so:
import pandas as pd
import dryscrape
s = dryscrape.Session()
s.visit("http://www.sharesansar.com/c/today-share-price.html")
df = pd.read_html(s.body())[5]
df.head()
Output:

Categories