Is there a way to read a SPSS (.sav) file into a data-frame without downloading a 3rd party library (I then need to save it as a Stata (.dta) file?
I have looked up and tried:
import from rpy2.robjects import pandas2ri, r
filename = 'small test data.sav'
w = r('foreign::read.spss("%s", to.data.frame=TRUE)' % filename)
df = pandas2ri.ri2py(w)
df.head()
but I get a ImportError: DLL load failed: The specified procedure could not be found as I don't have the library.
Related
So I am trying to get data from NFLfastR and my R equivalent code is:
data <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'))
data
I have previously tried pyreadr module as well but that did not work for me. Currently I am using rpy2 module to make it work. Here is the code I am trying:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
import os
os.environ["R_HOME"] = r"C:\Program Files\R\R-3.6.3"
os.environ["PATH"] = r"C:\Program Files\R\R-3.6.3\bin\x64" + ";" + os.environ["PATH"]
pandas2ri.activate()
readRDS = robjects.r['readRDS']
df = readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'))
df = pandas2ri.ri2py(df)
Rds and Rdata files are difficult to read in other languages than R as the format although open is undocumented. Therefore there are not many options on how to read them in python. One is what you propose. Another is to use pyreadr, but you have to download the file to disk first as pyreadr cannot read directly from an url:
import pyreadr
from urllib.request import urlopen
link="https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds"
response = urlopen(link)
content = response.read()
fhandle = open( 'play_by_play_2019.rds', 'wb')
fhandle.write(content)
fhandle.close()
result = pyreadr.read_r("play_by_play_2019.rds")
print(result.keys())
EDIT
pyreadr 0.3.7 includes now a function to download files:
import pyreadr
url = "https://github.com/hadley/nycflights13/blob/master/data/airlines.rda?raw=true"
dst_path = "/some/path/on/disk/airlines.rda"
res = pyreadr.read_r(pyreadr.download_file(url, dst_path), dst_path)
In R, unlike Python, you do not have to qualify every function with its package source unless you face name conflicts. Additionally in R, there is no built-in method. Every function you call resides in a package. But R ships with default packages such as utils, base, stats for routine methods.
Specifically, your working R code calls two functions from base package as shown with double colon aliases:
nfl_url <- 'https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'
data <- base::readRDS(base::url(NFL))
data
Consequently, you need to run the analogous procedure in Python's rpy2 by explicitly importing base package:
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
base = importr("base")
nfl_url <- 'https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.rds'
r_df <- base.readRDS(base.url(nfl_url))
pandas2ri.activate()
py_df = pandas2ri.ri2py(r_df)
Well if you just want to read nflFastR data, you can directly read it in python as follows:
import pandas as pd
pd.read_csv('https://github.com/guga31bb/nflfastR-data/blob/master/data/' \
'play_by_play_2019.csv.gz?raw=True',
compression='gzip', low_memory=False)
But as of now, there is not a way to do this via python. Its hard enough to read an on-premises (.rds) file while reading from a url is something I never saw implemented. So you have to download the file on-premises then you can have read it directly using pyreadr package or rpy2 (if you have R installed) as you mentioned.
I ran the following code as well as different version of same code but I keep running into this specific error: **ModuleNotFoundError: No module named 'pandas.core.internals.managers'; 'pandas.core.internals' is not a package
**
Please see code below:
import pickle
pickle_off = open(r"C:\Users\database.pkl","rb")
df = pd.read_pickle(pickle_off)
Let me know if this works:
import pickle
file_name = 'data.p'
with open(file_name,mode='rb') as f:
data = pickle.load(f)
[UPDATE - 5/15/2020 - I got this code and the entire flow working with parquet file format. However,I would be still interested in the approach using CSV]
I am trying to upload a csv file from a local machine to ADLS Gen 2 storage using the below command. This works fine, but the resulting csv file in ADLS is a continuous text with no new line character to separate each row. This CSV file cannot be loaded into Azure Synapse as is using Polybase.
Input CSV -
"col1","col2","col3"
"NJ","1","1/3/2020"
"NY","1","1/4/2020"
...
Output CSV that I get is like this -
"col1","col2","col3""NJ","1","1/3/2020""NY","1","1/4/2020"...
How do i make sure my final csv has the new line character after each row? There are few 100,000 records in each CSVs.
import os, uuid, sys
from azure.storage.filedatalake import DataLakeServiceClient
from azure.core._match_conditions import MatchConditions
from azure.storage.filedatalake._models import ContentSettings
try:
global service_client
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", "<storage-account>"), credential="<secret>")
file_system_client = service_client.get_file_system_client(file_system="import")
dest_directory_client = file_system_client.get_directory_client("Destination")
f = open("file-path/cashreceipts.csv",'r')
dest_file_client = dest_directory_client.create_file("cashreceipts.csv")
file_contents = f.read()
dest_file_client.upload_data(file_contents, overwrite=True)
f.close()
except Exception as e:
print(e)
I tried this approach as well -
dest_file_client.append_data(data=file_contents, offset=0, length=len(file_contents))
dest_file_client.flush_data(len(file_contents))
I am referring to the Microsoft documentation here which describes the approach for a text file - https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python
For some reason, when I attempt to read a hdf file from S3 using the pandas.read_hdf() method, I get a FileNotFoundError when I put an s3 url. The file definitely exists and I have tried using the pandas.read_csv() method with a csv file in the same s3 directory and that works. Is there something else I need to be doing? Here's the code:
import boto3
import h5py
import s3fs
import pandas as pd
csvDataframe = pd.read_csv('s3://BUCKET_NAME/FILE_NAME.csv', key='df')
print("Csv data:")
print(csvDataframe)
dataframe = pd.read_hdf('s3://BUCKET_NAME/FILE_NAME.h5', key='df')
print("Hdf data:")
print(dataframe)
Here is the error:
FileNotFoundError: File s3://BUCKET_NAME/FILE_NAME.h5 does not exist
In the actual code, BUCKET_NAME and FILE_NAME are replaced with their actual strings.
Please make sure file extension is .h5
I am doing a data science project.
I am using google notebook for my job
My dataset is residing at here which I want to access directly at python Notebook.
I am using following line of code to get out of it.
df = pd.read_csv('link')
But Command line is throwing an error like below
What should I do?
Its difficult to answer exactly as there lack of data but here you go for this kind of request..
you have to import ZipFile & urlopen in order to get data from url and extract the data from Zip and the use the csv file for pandas processings.
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
import os
URL = 'https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip'
# open and save the zip file onto computer
url = urlopen(URL)
output = open('05d2b4ea-c-Dataset.zip', 'wb') # note the flag: "wb"
output.write(url.read())
output.close()
# read the zip file as a pandas dataframe
df = pd.read_csv('05d2b4ea-c-Dataset.zip') zip files
# if keeping on disk the zip file is not wanted, then:
os.remove(zipName) # remove the copy of the zipfile on disk
Use urllib module to download into memory the zip file which returns a file-like object that you can read(), pass it to ZipFile(standard package).
Since here there are multiple files like
['test_data/AggregateData_Test.csv', 'test_data/TransactionData_Test.csv', 'train_data/AggregateData_Train.csv', 'train_data/Column_Descriptions.xlsx', 'train_data/sample_submission.csv', 'train_data/TransactionData_Train.csv']
Load it to a dict of dataframes with filename as the key. Altogether the code will be.
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO
zip_in_memory = urlopen("https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip").read()
z = ZipFile(BytesIO(zip_in_memory))
dict_of_dfs = {file.filename: pd.read_csv(z.open(file.filename))\
for file in z.infolist()\
if file.filename.endswith('.csv')}
Now you can access dataframes of each csv like dict_of_dfs['test_data/AggregateData_Test.csv'].
Ofcourse all of this is unnecessary if you will just download the zip from the link and pass it as a zipfile.