I am new to python and trying to pull a table from a wiki page into a pandas dataframe. I am using the wikipediaapi to retrieve the URL for the site. (Is there a way to pull the table directly using the api instead of pandas?). Also noteworthy I am trying to use the method as described here.
Below is my current code:
#packages
import wikipediaapi as wpa
import pandas as pd
#get page url
wiki_page = wpa.Wikipedia('en')
page_py = wiki_page.page('Python_(programming_language)')
page_url = page_py.fullurl
#print(page_url)
page_tables = pd.read_html(page_url)[1]
#print(page_tables)
sum_table_df = pd.DataFrame(data=page_tables)
print("page_tables type: ", type(page_tables))
print("sum_table_df type: ", type(page_tables))
print(sum_table_df)
Output:
page_tables type: <class 'pandas.core.frame.DataFrame'>
sum_table_df type: <class 'pandas.core.frame.DataFrame'>
My problem is that the output has not printed out as a normal dataframe would (table formatting). Not sure what I have done incorrectly? Red underlines are the headers for each column..
If you are using Jupyter Notebook and you want your data frame to be formatted as table, do not use print() function to print you data frame. Just run the cell with the name of your data frame, like this:
In [1]: sum_table_df
Out[1]: your formatted data frame
Related
I am new to using XPath and was confused about how to retrieve text that is wrapped in href (where I do not need href info.).
URL I am using:
https://en.wikipedia.org/wiki/2022_Major_League_Soccer_season
HTML looks like below
Where I want to extract text "Gonzalo Pineda" and the code I have written is this
import requests
from lxml import HTML
t = requests.get("https://en.wikipedia.org/wiki/2022_Major_League_Soccer_season")
dom_tree = html.fromstring(t.content)
coach = updated_dom_tree.xpath('//table[contains(#class, "wikitable")][2]//tbody/tr/td[2]/text()')
Where this is second table from a Wikipedia page so that's why I wrote //table[contains(#class, "wikitable")][2].
Then I wrote //tbody/tr/td[2]/text() just following the HTML structure. However, the outcome of the code returns " " (empty string with \n") without actual text info.
I tried to change the code to a[contains(#href, "wiki")] after searching through the stack overflow but I was getting an output of an empty list ([]).
I would really appreciate it if you could help me with this!
Edit:
I can't copy-paste HTML because I just got that from inspect tool (F12).
The HTML line can be found in "Personnel and sponsorships" table second column. Thank you!
this is the approach you can take the extract the table. extra import pandas just to see how the table data is coming along.
import requests
from lxml import html
import pandas as pd # to data storage, using pandas
t = requests.get("https://en.wikipedia.org/wiki/2022_Major_League_Soccer_season")
dom_tree = html.fromstring(t.content)
#coach = updated_dom_tree.xpath('//table[contains(#class, "wikitable")][2]//tbody/tr/td[2]/text()')
table_data=[]
table_header=[]
table=dom_tree.xpath("//table")[2]
for e_h in table.findall(".//tr")[0]:
table_header.append(e_h.text_content().strip())
table_data={0:[],1:[],2:[],3:[],4:[]}
for e_tr in table.findall(".//tr")[1:]:
col_no=0
for e_td in e_tr.findall(".//td"):
if e_td.text_content().strip():
table_data[col_no].append(e_td.text_content().strip())
else:
table_data[col_no].append('N/A') # handle spaces,
if e_td.get('colspan') == '2': ## handle example team Colorado Rapids, where same td spans
col_no+=1
table_data[col_no].append('N/A')
else:
col_no+=1
for i in range(0,5):
table_data[table_header[i]] = table_data.pop(i)
df=pd.DataFrame(table_data)
if just want Head Coach column, then table_data['Head coach'] will give those.
output:
If I understand you correctly, what you are looking for is
dom_tree.xpath('(//table[contains(#class, "wikitable")]//span[#class="flagicon"])[1]//following-sibling::span//a/text()')[0]
Output should be
'Gonzalo Pineda'
EDIT:
To get the name of all head coaches, try:
for coach in dom_tree.xpath('(//table[contains(#class, "wikitable")])[2]//tbody//td[2]'):
print(coach.xpath('.//a/text()')[0])
Another way to do it - using pandas
import pandas as pd
tab = dom_tree.xpath('(//table[contains(#class, "wikitable")])[2]')[0]
pd.read_html(HTML.tostring(tab))[0].loc[:, 'Head coach']
I'm trying to extract nutritional information using the Nutritionix API/database in Python. I was able to get a successful query and placed it into a pandas dataframe. However, I'm a bit confused though because the resulting json claims that there are several thousand 'hits' for my query but at most 10 are ever returned. For instance, when I query for Garbanzo, the json file says that there are 513 total_hits, but only 10 are actually returned. Does anyone know what is causing this? The code I'm using is below.
import requests
import json
import pandas as pd
from nutritionix import Nutritionix
nix_apikey = ''
nix_appid = ''
nix = Nutritionix(app_id = nix_appid, api_key = nix_apikey)
results = nix.search('Garbanzo').json()
df = pd.json_normalize(results, record_path = ['hits'])
I'm not including the my api_key or app_id for obvious reasons. Here's a link to the Nutritionix API: https://github.com/leetrout/python-nutritionix
Thanks for any suggestions!
I need to get data from this website.
It is possible to request information about parcels by help of a URL pattern, e.g. https://uldk.gugik.gov.pl/?request=GetParcelById&id=260403_4.0001.186/2.
The result for this example will look like this:
0
0103000020840800000100000005000000CBB2062D8C6F224110297D382512144128979BC870702241200E9D7C57161441CFC255973F702241C05EAADB7D161441C7AF26C2606F2241A0AD0EFB67121441CBB2062D8C6F224110297D3825121441
This is wkb format with information about geometry of the parcel.
The problem is:
I have excel spreadsheet with hundreds of parcels id. How can I get each id from the Excel file, make request as described at the begining and write result to file (for example to Excel)?
Use the xlrd library to read the Excel file and process the parcel ids.
For each of the parcel id you can access the url and extract the required information. Following code does this job for the given URL:
import requests
r = requests.get('https://uldk.gugik.gov.pl/?request=GetParcelById&id=260403_4.0001.186/2')
result = str(r.content, 'utf-8').split()
# ['0', '0103000020840800000100000005000000CBB2062D8C6F224110297D382512144128979BC870702241200E9D7C57161441CFC255973F702241C05EAADB7D161441C7AF26C2606F2241A0AD0EFB67121441CBB2062D8C6F224110297D3825121441']
As you have several hundreds of those ids, i'd write a function to do exectly this job:
import requests
def get_parcel_info(parcel_id):
url = f'https://uldk.gugik.gov.pl/?request=GetParcelById&id={parcel_id}'
r = requests.get(url)
return str(r.content, 'utf-8').split()
get_parcel_info('260403_4.0001.186/2')
# ['0', '0103000020840800000100000005000000CBB2062D8C6F224110297D382512144128979BC870702241200E9D7C57161441CFC255973F702241C05EAADB7D161441C7AF26C2606F2241A0AD0EFB67121441CBB2062D8C6F224110297D3825121441']
I have been trying to add the JSON data from this API to a pandas data frame. Here is the code I have tried:
url = 'https://api.covid19api.com/summary'
df = pd.read_json(url)
print(df.head())
When running this code, I receive the following error:
ValueError: Mixing dicts with non-Series may lead to ambiguous
ordering.
Any advice on this would be helpful. Thanks in advance.
Hi Matt and welcome on SO. Whenever you work with json it's better to first get the data and have a look at it. In your particular case the key Global is different from the ones in Countries that's why you get that error
import urllib.request
import json
import pandas as pd
url = 'https://api.covid19api.com/summary'
response = urllib.request.urlopen(url)
# the following is the data you should explore
data = json.loads(response.read())
df = pd.DataFrame(data["Countries"])
The JSON has a couple of elements ('Global', 'Countries' and 'Date'), so it would make sense to split it up into separate dataframes, which is not easy to do using pandas.read_json().
import requests
url = 'https://api.covid19api.com/summary'
r = requests.get(url)
data = r.json()
global_data = pd.DataFrame(data['Global'])
countries = pd.DataFrame(data['Countries'])
I'm trying to get some stats from the NBA stats page. I'm following this tutorial-idea
https://towardsdatascience.com/using-python-pandas-and-plotly-to-generate-nba-shot-charts-e28f873a99cb
The basic idea is put the data into a csv file.
So I try this code, to get the data from the nba web, trying to get the json file and the convert it to a csv:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/events/?flag=3&CFID=33&CFPARAMS=2017-18&PlayerID="
player_id="202695"
shot_data_url_end="&ContextMeasure=FGA&Season=2017-18§ion=player&sct=plot"
def shoy_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
json = requests.get(full_url, headers=headers).json()
return(json)
data = json['resultSets'][0]['rowSets']
columns = json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
And this is the error that notebook shows to me:
TypeError Traceback (most recent call last)
<ipython-input-42-a3452c3a4fc8> in <module>
18
19
---> 20 data = json['resultSets'][0]['rowSets']
21 columns = json['resultSets'][0]['headers']
22
TypeError: 'module' object is not subscriptable
Anyone can help me, or know another way to get the data into a .csv or excel file?
When imported with import json, the name json is referring to the JSON module of the Python standard library. You cannot use it as a regular variable name. If you rename your variable to something else such as response_json, this part of your code will work.
Regarding the rest of the code, the page https://stats.nba.com/events/ doesn't return any JSON text, it is a regular web page with images, menus, a video player, etc... If you want to access the API that returns the shots in JSON format, you will have to use the https://stats.nba.com/stats/shotchartdetail (with the right query string). This API endpoint is mentioned in the tutorial, in the "Chrome XHR tab and resulting json linked by url" image.
Ok I've changed the code like this:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
def shot_chart(player_id):
full_url = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID=202695&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
response_json = requests.get(full_url, headers=headers)
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2019-20&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id="202330"
shot_data_url_end="&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def shot_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
response_json = requests.get(full_url).json()
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
shot_chart("202330")
What is going on now? the notebook is tucked right know
Try this out
import pandas as pd
from pandas import DataFrame as df
shot_data_url_start = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id = "204001"
shot_data_url_end = "&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def get_shot_data(player_id):
full_url = shot_data_url_start + player_id + shot_data_url_end
data = requests.get(
full_url,
headers = {
"User-Agent": "PostmanRuntime/7.4.0"
}
)
return data.json()
shot_results = get_shot_data(player_id)
result_sets = shot_results['resultSets']
first_result_set = result_sets[0]
row_set = first_result_set['rowSet']
set_headers = first_result_set['headers']
df = pd.DataFrame.from_records(row_set, columns=set_headers)
I see how you got confused with that medium post. You were missing the headers and the url for the NBA api wasn't right. That's what #pierre was trying to say in his response. The url you're using isn't right. If you reread that post you were following, you'll see that the author said he had to dig in to dev tools in order to find that actual url to use in order to grab the JSON.
Edit: Forgot to mention that when I didn't pass a User-Agent in the headers, the request would timeout. If you don't pass that in, you won't get a successful response.