So I am trying to get data from NFLfastR and my R equivalent code is:
data <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'))
data
I have previously tried pyreadr module as well but that did not work for me. Currently I am using rpy2 module to make it work. Here is the code I am trying:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
import os
os.environ["R_HOME"] = r"C:\Program Files\R\R-3.6.3"
os.environ["PATH"] = r"C:\Program Files\R\R-3.6.3\bin\x64" + ";" + os.environ["PATH"]
pandas2ri.activate()
readRDS = robjects.r['readRDS']
df = readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'))
df = pandas2ri.ri2py(df)
Rds and Rdata files are difficult to read in other languages than R as the format although open is undocumented. Therefore there are not many options on how to read them in python. One is what you propose. Another is to use pyreadr, but you have to download the file to disk first as pyreadr cannot read directly from an url:
import pyreadr
from urllib.request import urlopen
link="https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds"
response = urlopen(link)
content = response.read()
fhandle = open( 'play_by_play_2019.rds', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("play_by_play_2019.rds")
print(result.keys())
EDIT
pyreadr 0.3.7 includes now a function to download files:
import pyreadr
url = "https://github.com/hadley/nycflights13/blob/master/data/airlines.rda?raw=true"
dst_path = "/some/path/on/disk/airlines.rda"
res = pyreadr.read_r(pyreadr.download_file(url, dst_path), dst_path)
In R, unlike Python, you do not have to qualify every function with its package source unless you face name conflicts. Additionally in R, there is no built-in method. Every function you call resides in a package. But R ships with default packages such as utils, base, stats for routine methods.
Specifically, your working R code calls two functions from base package as shown with double colon aliases:
nfl_url <- 'https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'
data <- base::readRDS(base::url(NFL))
data
Consequently, you need to run the analogous procedure in Python's rpy2 by explicitly importing base package:
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
base = importr("base")
nfl_url <- 'https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'
r_df <- base.readRDS(base.url(nfl_url))
pandas2ri.activate()
py_df = pandas2ri.ri2py(r_df)
Well if you just want to read nflFastR data, you can directly read it in python as follows:
import pandas as pd
pd.read_csv('https://github.com/guga31bb/nflfastR-data/blob/master/data/' \
'play_by_play_2019.csv.gz?raw=True',
compression='gzip', low_memory=False)
But as of now, there is not a way to do this via python. Its hard enough to read an on-premises (.rds) file while reading from a url is something I never saw implemented. So you have to download the file on-premises then you can have read it directly using pyreadr package or rpy2 (if you have R installed) as you mentioned.
Related
So an update, I found my compile issue was that I needed to change my notebook to a py file and choosing save as doesn't do that. So I had to run a different script turn my notebook to a py file. And part of my exe issue was I was using the fopen command that apparently isn't useable when compiled into a exe. So I redid the code to what is above. But now I get a write error when trying to run the script. I can not find anything on write functions with os is there somewhere else I should look?
Original code:
import requests
import json
import pandas as pd
import csv
from pathlib import Path
response = requests.get('url', headers={'CERT': 'cert'}, stream=True).json()
json2 = json.dumps(response)
f = open('data.json', 'r+')
f.write(json2)
f.close()
Path altered code:
import requests
import json
import pandas as pd
import csv
from pathlib import Path
response = requests.get('url', headers={'CERT': 'cert'}, stream=True).json()
json2 = json.dumps(response)
filename = 'data.json'
if '_MEIPASS2' in os.environ:
filename = os.path.join(os.environ['_MEIPASS2'], filename)
fd = open(filename, 'r+')
fd.write(json2)
fd.close()
The changes to the code allowed me to get past the fopen issue but created a write issue. Any ideas?
If you want to write to a file, you have to open it as writable.
fd = open(filename, 'wb')
Although I don't know why you're opening it in binary if you're writing text.
I am doing a data science project.
I am using google notebook for my job
My dataset is residing at here which I want to access directly at python Notebook.
I am using following line of code to get out of it.
df = pd.read_csv('link')
But Command line is throwing an error like below
What should I do?
Its difficult to answer exactly as there lack of data but here you go for this kind of request..
you have to import ZipFile & urlopen in order to get data from url and extract the data from Zip and the use the csv file for pandas processings.
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
import os
URL = 'https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip'
# open and save the zip file onto computer
url = urlopen(URL)
output = open('05d2b4ea-c-Dataset.zip', 'wb') # note the flag: "wb"
output.write(url.read())
output.close()
# read the zip file as a pandas dataframe
df = pd.read_csv('05d2b4ea-c-Dataset.zip') zip files
# if keeping on disk the zip file is not wanted, then:
os.remove(zipName) # remove the copy of the zipfile on disk
Use urllib module to download into memory the zip file which returns a file-like object that you can read(), pass it to ZipFile(standard package).
Since here there are multiple files like
['test_data/AggregateData_Test.csv', 'test_data/TransactionData_Test.csv', 'train_data/AggregateData_Train.csv', 'train_data/Column_Descriptions.xlsx', 'train_data/sample_submission.csv', 'train_data/TransactionData_Train.csv']
Load it to a dict of dataframes with filename as the key. Altogether the code will be.
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO
zip_in_memory = urlopen("https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip").read()
z = ZipFile(BytesIO(zip_in_memory))
dict_of_dfs = {file.filename: pd.read_csv(z.open(file.filename))\
for file in z.infolist()\
if file.filename.endswith('.csv')}
Now you can access dataframes of each csv like dict_of_dfs['test_data/AggregateData_Test.csv'].
Ofcourse all of this is unnecessary if you will just download the zip from the link and pass it as a zipfile.
I've been going through the Q&A on this site, for an answer to my question. However, I'm a beginner and I find it difficult to understand some of the solutions. I need a very basic solution.
Could someone please explain a simple solution to 'Downloading a file through http' and 'Saving it to disk, in Windows', to me?
I'm not sure how to use shutil and os modules, either.
The file I want to download is under 500 MB and is an .gz archive file.If someone can explain how to extract the archive and utilise the files in it also, that would be great!
Here's a partial solution, that I wrote from various answers combined:
import requests
import os
import shutil
global dump
def download_file():
global dump
url = "http://randomsite.com/file.gz"
file = requests.get(url, stream=True)
dump = file.raw
def save_file():
global dump
location = os.path.abspath("D:\folder\file.gz")
with open("file.gz", 'wb') as location:
shutil.copyfileobj(dump, location)
del dump
Could someone point out errors (beginner level) and explain any easier methods to do this?
A clean way to download a file is:
import urllib
testfile = urllib.URLopener()
testfile.retrieve("http://randomsite.com/file.gz", "file.gz")
This downloads a file from a website and names it file.gz. This is one of my favorite solutions, from Downloading a picture via urllib and python.
This example uses the urllib library, and it will directly retrieve the file form a source.
For Python3+ URLopener is deprecated.
And when used you will get error as below:
url_opener = urllib.URLopener() AttributeError: module 'urllib' has no
attribute 'URLopener'
So, try:
import urllib.request
urllib.request.urlretrieve(url, filename)
As mentioned here:
import urllib
urllib.urlretrieve ("http://randomsite.com/file.gz", "file.gz")
EDIT: If you still want to use requests, take a look at this question or this one.
Four methods using wget, urllib and request.
#!/usr/bin/python
import requests
from StringIO import StringIO
from PIL import Image
import profile as profile
import urllib
import wget
url = 'https://tinypng.com/images/social/website.jpg'
def testRequest():
image_name = 'test1.jpg'
r = requests.get(url, stream=True)
with open(image_name, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
def testRequest2():
image_name = 'test2.jpg'
r = requests.get(url)
i = Image.open(StringIO(r.content))
i.save(image_name)
def testUrllib():
image_name = 'test3.jpg'
testfile = urllib.URLopener()
testfile.retrieve(url, image_name)
def testwget():
image_name = 'test4.jpg'
wget.download(url, image_name)
if __name__ == '__main__':
profile.run('testRequest()')
profile.run('testRequest2()')
profile.run('testUrllib()')
profile.run('testwget()')
testRequest - 4469882 function calls (4469842 primitive calls) in 20.236 seconds
testRequest2 - 8580 function calls (8574 primitive calls) in 0.072 seconds
testUrllib - 3810 function calls (3775 primitive calls) in 0.036 seconds
testwget - 3489 function calls in 0.020 seconds
I use wget.
Simple and good library if you want to example?
import wget
file_url = 'http://johndoe.com/download.zip'
file_name = wget.download(file_url)
wget module support python 2 and python 3 versions
Exotic Windows Solution
import subprocess
subprocess.run("powershell Invoke-WebRequest {} -OutFile {}".format(your_url, filename), shell=True)
import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/dnishimoto/python-deep-learning/master/list%20iterators%20and%20generators.ipynb", "test.ipynb")
downloads a single raw juypter notebook to file.
For text files, you can use:
import requests
url = 'https://WEBSITE.com'
req = requests.get(url)
path = "C:\\YOUR\\FILE.html"
with open(path, 'wb') as f:
f.write(req.content)
I started down this path because ESXi's wget is not compiled with SSL and I wanted to download an OVA from a vendor's website directly onto the ESXi host which is on the other side of the world.
I had to disable the firewall(lazy)/enable https out by editing the rules(proper)
created the python script:
import ssl
import shutil
import tempfile
import urllib.request
context = ssl._create_unverified_context()
dlurl='https://somesite/path/whatever'
with urllib.request.urlopen(durl, context=context) as response:
with open("file.ova", 'wb') as tmp_file:
shutil.copyfileobj(response, tmp_file)
ESXi libraries are kind of paired down but the open source weasel installer seemed to use urllib for https... so it inspired me to go down this path
Another clean way to save the file is this:
import csv
import urllib
urllib.retrieve("your url goes here" , "output.csv")
Is there a way to read a SPSS (.sav) file into a data-frame without downloading a 3rd party library (I then need to save it as a Stata (.dta) file?
I have looked up and tried:
import from rpy2.robjects import pandas2ri, r
filename = 'small test data.sav'
w = r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)
df = pandas2ri.ri2py(w)
df.head()
but I get a ImportError: DLL load failed: The specified procedure could not be found as I don't have the library.
I've been going through the Q&A on this site, for an answer to my question. However, I'm a beginner and I find it difficult to understand some of the solutions. I need a very basic solution.
Could someone please explain a simple solution to 'Downloading a file through http' and 'Saving it to disk, in Windows', to me?
I'm not sure how to use shutil and os modules, either.
The file I want to download is under 500 MB and is an .gz archive file.If someone can explain how to extract the archive and utilise the files in it also, that would be great!
Here's a partial solution, that I wrote from various answers combined:
import requests
import os
import shutil
global dump
def download_file():
global dump
url = "http://randomsite.com/file.gz"
file = requests.get(url, stream=True)
dump = file.raw
def save_file():
global dump
location = os.path.abspath("D:\folder\file.gz")
with open("file.gz", 'wb') as location:
shutil.copyfileobj(dump, location)
del dump
Could someone point out errors (beginner level) and explain any easier methods to do this?
A clean way to download a file is:
import urllib
testfile = urllib.URLopener()
testfile.retrieve("http://randomsite.com/file.gz", "file.gz")
This downloads a file from a website and names it file.gz. This is one of my favorite solutions, from Downloading a picture via urllib and python.
This example uses the urllib library, and it will directly retrieve the file form a source.
For Python3+ URLopener is deprecated.
And when used you will get error as below:
url_opener = urllib.URLopener() AttributeError: module 'urllib' has no
attribute 'URLopener'
So, try:
import urllib.request
urllib.request.urlretrieve(url, filename)
As mentioned here:
import urllib
urllib.urlretrieve ("http://randomsite.com/file.gz", "file.gz")
EDIT: If you still want to use requests, take a look at this question or this one.
Four methods using wget, urllib and request.
#!/usr/bin/python
import requests
from StringIO import StringIO
from PIL import Image
import profile as profile
import urllib
import wget
url = 'https://tinypng.com/images/social/website.jpg'
def testRequest():
image_name = 'test1.jpg'
r = requests.get(url, stream=True)
with open(image_name, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
def testRequest2():
image_name = 'test2.jpg'
r = requests.get(url)
i = Image.open(StringIO(r.content))
i.save(image_name)
def testUrllib():
image_name = 'test3.jpg'
testfile = urllib.URLopener()
testfile.retrieve(url, image_name)
def testwget():
image_name = 'test4.jpg'
wget.download(url, image_name)
if __name__ == '__main__':
profile.run('testRequest()')
profile.run('testRequest2()')
profile.run('testUrllib()')
profile.run('testwget()')
testRequest - 4469882 function calls (4469842 primitive calls) in 20.236 seconds
testRequest2 - 8580 function calls (8574 primitive calls) in 0.072 seconds
testUrllib - 3810 function calls (3775 primitive calls) in 0.036 seconds
testwget - 3489 function calls in 0.020 seconds
I use wget.
Simple and good library if you want to example?
import wget
file_url = 'http://johndoe.com/download.zip'
file_name = wget.download(file_url)
wget module support python 2 and python 3 versions
Exotic Windows Solution
import subprocess
subprocess.run("powershell Invoke-WebRequest {} -OutFile {}".format(your_url, filename), shell=True)
import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/dnishimoto/python-deep-learning/master/list%20iterators%20and%20generators.ipynb", "test.ipynb")
downloads a single raw juypter notebook to file.
For text files, you can use:
import requests
url = 'https://WEBSITE.com'
req = requests.get(url)
path = "C:\\YOUR\\FILE.html"
with open(path, 'wb') as f:
f.write(req.content)
I started down this path because ESXi's wget is not compiled with SSL and I wanted to download an OVA from a vendor's website directly onto the ESXi host which is on the other side of the world.
I had to disable the firewall(lazy)/enable https out by editing the rules(proper)
created the python script:
import ssl
import shutil
import tempfile
import urllib.request
context = ssl._create_unverified_context()
dlurl='https://somesite/path/whatever'
with urllib.request.urlopen(durl, context=context) as response:
with open("file.ova", 'wb') as tmp_file:
shutil.copyfileobj(response, tmp_file)
ESXi libraries are kind of paired down but the open source weasel installer seemed to use urllib for https... so it inspired me to go down this path
Another clean way to save the file is this:
import csv
import urllib
urllib.retrieve("your url goes here" , "output.csv")