Can't save pandas dataframe to HDF file

Can't save pandas dataframe to HDF file - python

I have been trying for a while to save a pandas dataframe to an HDF5 file. I tried various different phrasings eg. df.to_hdf etc. but to no avail. I am running this in a python virtual environment see here. Even without the use of the VE it has the same error. The following script comes up with the error below:
''' This script reads in a pickles dictionary converts it to panda
dataframe and then saves it to an hdf file. The arguments are the
file names of the pickle files.
'''
import numpy as np
import pandas as pd
import pickle
import sys
# read in filename arguments
for fn in sys.argv[1:]:
print 'converting file %s to hdf format...' % fn
fl = open(fn, 'r')
data = pickle.load(fl)
fl.close()
frame = pd.DataFrame(data)
fnn = fn.split('.')[0]+'.h5'
store = pd.HDFStore(fnn)
store.put([fn.split('.')[0]], frame)
store.close()
frame = 0
data = 0
Error is:
$ ./p_to_hdf.py LUT_*.p
converting file LUT_0.p to hdf format...
Traceback (most recent call last):
File "./p_to_hdf.py", line 22, in <module>
store = pd.HDFStore(fnn)
File "/usr/lib/python2.7/site-packages/pandas/io/pytables.py", line 270, in __init__
raise Exception('HDFStore requires PyTables')
Exception: HDFStore requires PyTables
pip list shows both pandas and tables are installed and the latest versions.
pandas (0.16.2)
tables (3.2.0)

The solution had noting to do with the code but how to source a virtual environment in python. The correct way is to use . venv/bin/activate instead of source ~/venv/bin/activate. Now which python shows the python installed under ~/venv/bin/python and the code runs correctly.

Related

How can you read a gzipped parquet file in Python

I need to open a gzipped file, that has a parquet file inside with some data. I am having so much trouble trying to print/read what is inside the file. I tried the following:
with gzip.open("myFile.parquet.gzip", "rb") as f:
data = f.read()
This does not seem to work, as I get an error that my file id not a gz file. Thanks!

You can use read_parquet function from pandas module:
Install pandas and pyarrow:
pip install pandas pyarrow
use read_parquet which returns DataFrame:
data = read_parquet("myFile.parquet.gzip")
print(data.count()) # example of operation on the returned DataFrame

I am trying to upload a csv file onto Python (Azure) but am running into file IO Error does not exist

My code is:
import pandas as pd
df=pd.read_csv('Project_Wind_Data.csv'), usecols = ['U100', 'V100']) with open
('Project_Wind_Data.csv',"r") as csvfile:
I am trying to access certain columns within the csv file. I recive an error message saying that the data file does not exist
My data is in the following form:
This is must a be trivial issue but help would be much appreciated.

If your csv file is in the same working directory as your .py code, you use directly
import pandas as pd
df=pd.read_csv('Project_Wind_Data.csv'), usecols = ['U100', 'V100'])
If the file is in another directory, replace 'Project_Wind_Data.csv' with the full path to the file like c:User/Documents/file.txt

How do I use pandas.read_csv on Google Cloud ML?

I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.
I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.
How should I proceed (I would like to keep using pandas) ?
import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")
output :
ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist

You will require to use file_io from tensorflow.python.lib.io to do that as demonstrated below:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
from pandas import read_csv
# read csv file from google cloud storage
def read_data(gcs_path):
file_stream = file_io.FileIO(gcs_path, mode='r')
csv_data = read_csv(StringIO(file_stream.read()))
return csv_data
Now call the above function
gcs_path = 'gs://bucket/folder/file.csv' # change path according to your bucket, folder and path
df = read_data(gcs_path)
# print(df.head()) # displays top 5 rows including headers as default

Pandas does not have native GCS support. There are two alternatives:
1. copy the file to the VM using gsutil cli
2. use the TensorFlow file_io library to open the file, and pass the file object to pd.read_csv(). Please refer to the detailed answer here.

You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.
Make sure you have Dask is installed.
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
This is all you need to load the data.
You can filter and manipulate data with Pandas syntax now.
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()

File data\SPY.csv does not exist

Just started learning python and trying to read a CSV file with pandas.
import pandas as pd
df = pd.read_csv(os.path.join(os.path.dirname(__file__), "C:\\Anaconda\\SPY.csv"))
But I get the error:
File data\SPY.csv does not exist
Tried with both one and two / and \ and ' instead of "
this is the connection string: C:\Anaconda\SPY.csv
(This is a file from yahoo finance. I first tried to call to yahoo but was unable so instead I just downloaded the file and saved it as a CSV)

The error is occurring because you are trying to join your current directory which is named "data" but your file is actually in "Anaconda".
Try a simple
import pandas as pd
df = pd.read_csv("C:\\Anaconda\\SPY.csv")
If you really want to use os.path.join, this should do:
import pandas as pd
import os
path = os.path.join("C:","Anaconda","SPY.csv")
df = pd.read_csv(path)
Also, if your SPY.csv file is in the same directory as your Python file, you should replace the path with a simple SPY.csv

fail to load arff file in python

I am quite sure that my arff files are correct, for that I have downloaded different files on the web and successfully opened them in Weka.
But I want to use my data in python, then I typed:
import arff
data = arff.load('file_path','rb')
It always returns an error message: Invalid layout of the ARFF file, at line 1.
Why this happened and how should I do to make it right?

If you change your code like in below, it'll work.
import arff
data = arff.load(open('file_path'))

Using scipy we can load arff data in python
from scipy.io import arff
import pandas as pd
data = arff.loadarff('dataset.arff')
df = pd.DataFrame(data[0])
df.head()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't save pandas dataframe to HDF file - python

The solution had noting to do with the code but how to source a virtual environment in python. The correct way is to use . venv/bin/activate instead of source ~/venv/bin/activate. Now which python shows the python installed under ~/venv/bin/python and the code runs correctly.

Related

How can you read a gzipped parquet file in Python

I am trying to upload a csv file onto Python (Azure) but am running into file IO Error does not exist

How do I use pandas.read_csv on Google Cloud ML?

File data\SPY.csv does not exist

fail to load arff file in python

Categories

Resources