Parse/split URLs in a pandas dataframe using urllib

Parse/split URLs in a pandas dataframe using urllib - python

I'm trying to split URLs and put the fragments in a dataframe. I found this thread pythonic way to parse/split URLs in a pandas dataframe and try to apply it, but for some reason it gives me an error.
I am under Python 3.x so I used the following:
import pandas
import urllib
urls = ['https://www.google.com/something','https://mail.google.com/anohtersomething', 'https://www.amazon.com/yetanotherthing']
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urllib.parse.urlsplit))
I get an error saying KeyError: 'urls', not sure what it means.
If someone could help would be great. Thanks.

The example you used assumes that the links are in a dataframe. Here's the correct solution:
import urllib
import pandas as pd
df = pd.DataFrame()
urls = ['https://www.google.com/something','https://mail.google.com/anohtersomething', 'https://www.amazon.com/yetanotherthing']
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*[urllib.parse.urlsplit(x) for x in urls])
Result
protocol domain path query fragment
0 https www.google.com /something
1 https mail.google.com /anohtersomething
2 https www.amazon.com /yetanotherthing

Related

Nutritionix API only returns first 10 results?

I'm trying to extract nutritional information using the Nutritionix API/database in Python. I was able to get a successful query and placed it into a pandas dataframe. However, I'm a bit confused though because the resulting json claims that there are several thousand 'hits' for my query but at most 10 are ever returned. For instance, when I query for Garbanzo, the json file says that there are 513 total_hits, but only 10 are actually returned. Does anyone know what is causing this? The code I'm using is below.
import requests
import json
import pandas as pd
from nutritionix import Nutritionix
nix_apikey = ''
nix_appid = ''
nix = Nutritionix(app_id = nix_appid, api_key = nix_apikey)
results = nix.search('Garbanzo').json()
df = pd.json_normalize(results, record_path = ['hits'])
I'm not including the my api_key or app_id for obvious reasons. Here's a link to the Nutritionix API: https://github.com/leetrout/python-nutritionix
Thanks for any suggestions!

What is the most effective method for adding nested data to a pandas dataframe?

I have been trying to add the JSON data from this API to a pandas data frame. Here is the code I have tried:
url = 'https://api.covid19api.com/summary'
df = pd.read_json(url)
print(df.head())
When running this code, I receive the following error:
ValueError: Mixing dicts with non-Series may lead to ambiguous
ordering.
Any advice on this would be helpful. Thanks in advance.

Hi Matt and welcome on SO. Whenever you work with json it's better to first get the data and have a look at it. In your particular case the key Global is different from the ones in Countries that's why you get that error
import urllib.request
import json
import pandas as pd
url = 'https://api.covid19api.com/summary'
response = urllib.request.urlopen(url)
# the following is the data you should explore
data = json.loads(response.read())
df = pd.DataFrame(data["Countries"])

The JSON has a couple of elements ('Global', 'Countries' and 'Date'), so it would make sense to split it up into separate dataframes, which is not easy to do using pandas.read_json().
import requests
url = 'https://api.covid19api.com/summary'
r = requests.get(url)
data = r.json()
global_data = pd.DataFrame(data['Global'])
countries = pd.DataFrame(data['Countries'])

Json problems to csv

I'm trying to get some stats from the NBA stats page. I'm following this tutorial-idea
https://towardsdatascience.com/using-python-pandas-and-plotly-to-generate-nba-shot-charts-e28f873a99cb
The basic idea is put the data into a csv file.
So I try this code, to get the data from the nba web, trying to get the json file and the convert it to a csv:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/events/?flag=3&CFID=33&CFPARAMS=2017-18&PlayerID="
player_id="202695"
shot_data_url_end="&ContextMeasure=FGA&Season=2017-18&section=player&sct=plot"
def shoy_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
json = requests.get(full_url, headers=headers).json()
return(json)
data = json['resultSets'][0]['rowSets']
columns = json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
And this is the error that notebook shows to me:
TypeError Traceback (most recent call last)
<ipython-input-42-a3452c3a4fc8> in <module>
18
19
---> 20 data = json['resultSets'][0]['rowSets']
21 columns = json['resultSets'][0]['headers']
22
TypeError: 'module' object is not subscriptable
Anyone can help me, or know another way to get the data into a .csv or excel file?

When imported with import json, the name json is referring to the JSON module of the Python standard library. You cannot use it as a regular variable name. If you rename your variable to something else such as response_json, this part of your code will work.
Regarding the rest of the code, the page https://stats.nba.com/events/ doesn't return any JSON text, it is a regular web page with images, menus, a video player, etc... If you want to access the API that returns the shots in JSON format, you will have to use the https://stats.nba.com/stats/shotchartdetail (with the right query string). This API endpoint is mentioned in the tutorial, in the "Chrome XHR tab and resulting json linked by url" image.

Ok I've changed the code like this:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
def shot_chart(player_id):
full_url = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID=202695&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
response_json = requests.get(full_url, headers=headers)
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)

import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2019-20&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id="202330"
shot_data_url_end="&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def shot_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
response_json = requests.get(full_url).json()
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
shot_chart("202330")
What is going on now? the notebook is tucked right know

Try this out
import pandas as pd
from pandas import DataFrame as df
shot_data_url_start = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id = "204001"
shot_data_url_end = "&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def get_shot_data(player_id):
full_url = shot_data_url_start + player_id + shot_data_url_end
data = requests.get(
full_url,
headers = {
"User-Agent": "PostmanRuntime/7.4.0"
}
)
return data.json()
shot_results = get_shot_data(player_id)
result_sets = shot_results['resultSets']
first_result_set = result_sets[0]
row_set = first_result_set['rowSet']
set_headers = first_result_set['headers']
df = pd.DataFrame.from_records(row_set, columns=set_headers)
I see how you got confused with that medium post. You were missing the headers and the url for the NBA api wasn't right. That's what #pierre was trying to say in his response. The url you're using isn't right. If you reread that post you were following, you'll see that the author said he had to dig in to dev tools in order to find that actual url to use in order to grab the JSON.
Edit: Forgot to mention that when I didn't pass a User-Agent in the headers, the request would timeout. If you don't pass that in, you won't get a successful response.

Read csv from url

I have the following csv url which works correctly if simply pasted into a browser:
http://www.google.com/finance/historical?q=JSE%3AMTN&startdate=Nov 1, 2011&enddate=Nov 30, 2011&output=csv
However I can't seem to download the csv using pandas. I get the error:
urllib.error.HTTPERROR: HTTP ERROR 400: Bad Request
Code:
import pandas as pd
def main():
url = 'http://www.google.com/finance/historical?q=JSE%3AMTN&startdate=Nov 1, 2011&enddate=Nov 30, 2011&output=csv'
df = pd.read_csv(url)
print(df)
Please could someone point me in the right direction.

That URL is not properly encoded. Your browser automagically replaces the spaces ' ' by '%20', the underlying urllib request from the python standard library doesn't do that. Replace all spaces by '%20' and you are fine.
Also, if you are using pandas 0.16 you can skip all of this since support for Google Finance data is built in now (see http://pandas.pydata.org/pandas-docs/stable/remote_data.html#remote-data-google):
import pandas.io.data as web
df = web.DataReader("F", 'JSE:MTN', "2011-11-01", "2011-11-30")

Given a URL to a text file, what is the simplest way to read the contents of the text file?

In Python, when given the URL for a text file, what is the simplest way to access the contents off the text file and print the contents of the file out locally line-by-line without saving a local copy of the text file?
TargetURL=http://www.myhost.com/SomeFile.txt
#read the file
#print first line
#print second line
#etc

Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2
Actually the simplest way is:
import urllib2 # the lib that handles the url stuff
data = urllib2.urlopen(target_url) # it's a file like object and works just like a file
for line in data: # files are iterable
print line
You don't even need "readlines", as Will suggested. You could even shorten it to: *
import urllib2
for line in urllib2.urlopen(target_url):
print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:
import urllib2
data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
* Second example in Python 3:
import urllib.request # the lib that handles the url stuff
for line in urllib.request.urlopen(target_url):
print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is

I'm a newbie to Python and the offhand comment about Python 3 in the accepted solution was confusing. For posterity, the code to do this in Python 3 is
import urllib.request
data = urllib.request.urlopen(target_url)
for line in data:
...
or alternatively
from urllib.request import urlopen
data = urlopen(target_url)
Note that just import urllib does not work.

The requests library has a simpler interface and works with both Python 2 and 3.
import requests
response = requests.get(target_url)
data = response.text

There's really no need to read line-by-line. You can get the whole thing like this:
import urllib
txt = urllib.urlopen(target_url).read()

import urllib2
for line in urllib2.urlopen("http://www.myhost.com/SomeFile.txt"):
print line

Another way in Python 3 is to use the urllib3 package.
import urllib3
http = urllib3.PoolManager()
response = http.request('GET', target_url)
data = response.data.decode('utf-8')
This can be a better option than urllib since urllib3 boasts having
Thread safety.
Connection pooling.
Client-side SSL/TLS verification.
File uploads with multipart encoding.
Helpers for retrying requests and dealing with HTTP redirects.
Support for gzip and deflate encoding.
Proxy support for HTTP and SOCKS.
100% test coverage.

import urllib2
f = urllib2.urlopen(target_url)
for l in f.readlines():
print l

For me, none of the above responses worked straight ahead. Instead, I had to do the following (Python 3):
from urllib.request import urlopen
data = urlopen("[your url goes here]").read().decode('utf-8')
# Do what you need to do with the data.

requests package works really well for simple ui
as #Andrew Mao suggested
import requests
response = requests.get('http://lib.stat.cmu.edu/datasets/boston')
data = response.text
for i, line in enumerate(data.split('\n')):
print(f'{i} {line}')
o/p:
0 The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
1 prices and the demand for clean air', J. Environ. Economics & Management,
2 vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
3 ...', Wiley, 1980. N.B. Various transformations are used in the table on
4 pages 244-261 of the latter.
5
6 Variables in order:
Checkout kaggle notebook on how to extract dataset/dataframe from URL

I do think requests is the best option. Also note the possibility of setting encoding manually.
import requests
response = requests.get("http://www.gutenberg.org/files/10/10-0.txt")
# response.encoding = "utf-8"
hehe = response.text

Just updating here the solution suggested by #ken-kinder for Python 2 to work with Python 3:
import urllib
urllib.request.urlopen(target_url).read()

You can use this, as well for simple methodology:
import requests
url_res = requests.get(url= "http://www.myhost.com/SomeFile.txt")
with open(filename + ".txt", "wb") as file:
file.write(url_res.content)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse/split URLs in a pandas dataframe using urllib - python

Related

Nutritionix API only returns first 10 results?

What is the most effective method for adding nested data to a pandas dataframe?

Json problems to csv

Read csv from url

Given a URL to a text file, what is the simplest way to read the contents of the text file?

Categories

Resources