I've used pd.concat(list_of_dataframes) in the past with no problem, but I am currently having a problem.
I have a set of pickled dataframes, and I put them in a list like this:
pickle_frames = [pickle.load(open(pickle_file, 'rb')) for pickle_file in pickles_list]
edit: also tried this in a for loop instead of the comprehension like so, just in case, but with the same result:
pickle_frames = []
for pickle_file in pickles_list:
this_pickle = pickle.load( open(pickle_file, 'rb'))
pickle_frames.append(this_pickle)
edit: I also tried casting the loaded pickles as numpy arrays like so, again with the same result:
pickle_frames = [np.array(pickle.load( open(pickle_file, 'rb'))) for pickle_file in pickles_list]
Then I try to concatenate:
df = pd.concat(pickle_frames, keys=pickles_list)
And get this error:
TypeError: cannot concatenate a non-NDFrame object
I've tested the list of frames and it looks fine; type(pickle_frames) returns list and type(pickle_frames[0]) returns pandas.core.frame.DataFrame ... I can load and perform other DataFrame operations on pickle_frames[i] for any i.
Any ideas as to why concat isn't recognizing the loaded, previously pickled dataframes, when they seem to be perfectly good?
=======================
Full code:
import pickle, os
import pandas as pd
import numpy as np
path = os.getcwd()
pickles_list = [f for f in os.listdir(path) if f.endswith('.p')]
pickle_frames = [pd.DataFrame(pickle.load(open(pickle_file, 'rb')) for pickle_file in pickles_list]
df = pd.concat(pickle_frames, keys=pickles_list)
So it turns out that one of the frames was not properly formatted (I included it accidentally from an earlier batch of pickles). The type was still pandas.core.frameDataFrame though, so I'm still not sure why I got this exact error. Thanks for your question, #mdurant, it helped lead me to find the problem.
Related
I'm looping through a directory to read in a series of csv files into a single pandas dataframe. In one of the csv's its throwing an error. I can work through this one by one and figure out which one is causing the error, but I assume there must be some way to build in error handling that would print out the file that is causing the issue, however I'm not sure how to implement something like that.
Any advice appreciated, code below:
import os
import glob
import pandas as pd
path = r'C:\Users\PATH\TestCSVs'
all_files = glob.glob(os.path.join(path, "*.csv"))
master_df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
print(master_df.shape)
It works fine for simple files but not with more complex ones.
My files are not corrupted and they are in the right directory.
I tried it with easy generate files (1,2,3,4... a,b,c,d...).
I put it at Github tonight so you can run the code and see the files.
import os
import glob
import pandas as pd
def concatenate(indir='./files/', outfile='./all.csv'):
os.chdir(indir)
fileList = glob.glob('*.CSV')
dfList = []
'''colnames = ['Time', 'Number', 'Reaction', 'Code', 'Message', 'date']'''
print(len(fileList))
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header=0)
dfList.append(df)
'''print(dfList)'''
concatDf = pd.concat(dfList, axis=0)
'''concatDf.columns = colnames'''
concatDf.to_csv(outfile, index=None)
concatenate()
Error
Unable to open parsers.pyx: Unable to read file (Error: File not found
(/Users/alf4/Documents/vs_code/files/pandas/_libs/parsers.pyx)).
But just after more than two files.
complex ones? do you mean bigger csv files ?
instead of appendding data to an empty list and then concatenating back to the dataframe, we can do it in a single step, take an empty dataframe(df1), keep appending df to df1 in the loop.
df1=df1.append(df)
and then write it out in the end
df1.to_csv(outfile, index=None)
I am sorry for this question/the wrong topic because it seems not to be a code problem.
It seems that the installation of pandas is bugged. It put it to repl.it to share it here and there it works. At the moment I try to repair the python and pandas installation.
So many thanks to these guys in the comments for the helping.
I am reading a few XLS files via
import os
import pandas as pd
path = r'pathtofolder'
files = os.listdir(path=path)
dataframes = {}
for file in files:
filepath = path + '\\' + file
if filepath[-3:] == 'xls':
print(file)
dataframes[file] = pd.read_excel(filepath)
For some reason however, I can't access the dataframes inside the dictionaries, as .head() doesn't seem to work:
for file, dataframe in dataframes.items():
dataframe.head()
This code doesn't seem to do anything in Jupyter. However when I type() dataframe, I get a pandas.core.frame.DataFrame, so head should be working, right?
haven't worked with Python data frames before, but I don't think your for loop will give you any output in this way. It's just a running loop which ends when the last head is calculated. You can just use print() to see your output.
for file, dataframe in mydict.items():
print(dataframe.head())
Or create a reusable list of dataframe.head() as shown below. You enter the list name anytime in the console to view it later. Pardon the code for creating a dictionary of dataframes.
import pandas as pd
from sklearn import datasets
iris = pd.DataFrame(datasets.load_iris().data)
digits = pd.DataFrame(datasets.load_digits().data)
diabetes = pd.DataFrame(datasets.load_diabetes().data)
dataframes={'a':iris,'b':digits,'c':diabetes} #create a dictionary of dataframes
list_heads=[] #create a list of dataframe head()
for i in dataframes:
list_heads.append(dataframes[i].head())
list_heads
I am trying to load a mat file for the Street View House Numbers (SVHN) Dataset http://ufldl.stanford.edu/housenumbers/ in Python with the following code
import h5py
labels_file = './sv/train/digitStruct.mat'
f = h5py.File(labels_file)
struct= f.values()
names = struct[1].values()
print(names[1][1].value)
I get [<HDF5 object reference>] but I need to know the actual string
To get an idea of the data layout you could execute
h5dump ./sv/train/digitStruct.mat
but there are also other methods like visit or visititems.
A good reference that can help you and that seems to have already addressed a very similar problem (if not the same) recently is the following SO post:
h5py, access data in Datasets in SVHN
For example the snippet:
import h5py
import numpy
def get_name(index, hdf5_data):
name = hdf5_data['/digitStruct/name']
print ''.join([chr(v[0]) for v in hdf5_data[name[index][0]].value])
labels_file = 'train/digitStruct.mat'
f = h5py.File(labels_file)
for j in range(33402):
get_name(j, f)
will print the name of the files. I get for example:
7459.png
7460.png
7461.png
7462.png
7463.png
7464.png
7465.png
You can generalize from here.
I'm trying to code a small Python script to do some data analysis. I have several data files, each one being a single column of data, I know how to import each one in python using numpy.loadtxt giving me back a ndarray. But I can't figure out how to concatenate these ndarrays, numpy.concatenate or numpy.append always giving me back error messages even if I try to flatten them first.
Are you aware of a solution?
Ok, as you were asking for code and data details. Here is what my data file look like:
1.4533423
1.3709900
1.7832323
...
Just a column of float numbers, I have no problem import a single file using:
data = numpy.loadtxt("data_filename")
My code trying to concatenate the arrays looks like that now (after trying numpy.concatenate and numpy.append I'm now trying numpy.insert ) :
data = numpy.zeros(0) #creating an empty first array that will be incremented by each file after
for filename in sys.argv[1:]:
temp = numpy.loadtxt(filename)
numpy.insert(data, numpy.arange(len(temp), temp))
I'm passing the filenames when running my script with:
./my_script.py ALL_THE_DATAFILES
And the error message I get is:
TypeError: only length-1 arrays can be converted to Python scalars
numpy.concatenate will definitely be a valid choice - without sample data and sample code and corresponding error messages we cannot help further.
Alternatives would be numpy.r_, numpy.s_
EDIT
This code snippet:
import sys
import numpy as np
filenames = sys.argv[1:]
arrays = [np.loadtxt(filename) for filename in filenames]
final_array = np.concatenate(arrays, axis=0)