zip rows of pandas DataFrame with list/array of values - python

My current code is
from numpy import *
def buildRealDataObject(x):
loc = array(x[0])
trueClass = x[1]
evid = ones(len(loc))
evid[isnan(loc)] = 0
loc[isnan(loc)] = 0
return DataObject(location=loc, trueClass=trueClass, evidence=evid)
if trueClasses is None:
trueClasses = zeros(len(dataset), dtype=int8).tolist()
realObjects = list(map(lambda x: buildRealDataObject(x), zip(dataset, trueClasses)))
and it is working. What I expect is to create for each row of the DataFrame dataset each combined with the corresponding entry of trueClasses a realObject. I am not really sure though why it is working because if run list(zip(dataset, trueClasses)) I just get something like [(0, 0.0), (1, 0.0)]. The two columns of dataset are called 0 and 1. So my first question is: Why is this working and what is happening here?
However I think this might still be wrong on some level, because it might only work due to "clever implicit transformation" on side of pandas. Also, for the line evid[isnan(loc)] = 0 I now got the error
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
How should I rewrite this code instead?

Currently the zip works on columns instead of rows. Use one of the method from Pandas convert dataframe to array of tuples to make the zip work on rows instead of columns. For example substitute
zip(dataset, trueClasses)
with
zip(dataset.values, trueClasses)
Considiering this post, if you have already l = list(data_train.values) for some reason, then zip(l, eClass) is faster than zip(dataset.values, trueClasses). However, if you don't then the transformation takes too much time to make it worth it in my tests.

Related

Converting pandas DataFrame to numpy array but with an inconsistency

I am running into a weird inconsistency. So I had to learn the difference between immutable and mutable data types. For my purpose, I need to convert my pandas DataFrame into Numpy apply operations and convert it back, as I do not wish to alter my input.
so I am converting like follows:
mix=pd.DataFrame(array,columns=columns)
def mix_to_pmix(mix,p_tank):
previous=0
columns,mix_in=np.array(mix) #<---
mix_in*=p_tank
previous=0
for count,val in enumerate(mix_in):
mix_in[count]=val+previous
previous+=val
return pd.DataFrame(mix_in,columns=columns)
This works perfectly fine, but the function:
columns,mix_in=np.array(mix)
seems to not be consistent as in the case:
def to_molfrac(mix):
columns,mix_in=np.array(mix)
shape=mix_in.shape
for i in range(shape[0]):
mix_in[i,:]*=1/max(mix_in[i,:])
for k in range(shape[1]-1,0,-1):
mix_in[:,k]+=-mix_in[:,k-1]
mix_in=mix_in/mix_in.sum(axis=1)[:,np.newaxis]
return pd.DataFrame(mix_in,columns=columns)
I receive the error:
ValueError: too many values to unpack (expected 2)
The input of the latter function is the output of the previous function. So it should be the same case.
It's impossible to understand the input of to_molfrac and mix_to_pmix without an example.
But the pandas objects has a .value attribute which allows you to access the underlying numpy array. So, its probably better to use mix_in = mix.values instead.
columns, values = df.columns, df.values

Can lambda expressions be used within pandas apply method?

I encountered this lambda expression today and can't understand how it's used:
data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
The line of code doesn't seem to call the lambda function or pass any arguments into it so I'm confused how it does anything at all. The purpose of this is to take two columns CSD and SCHOOL CODE and combine the entries in each row into a new row, DBN. So does this lambda expression ever get used?
You're writing your results incorrectly to a column. data["class_size"]["DBN"] is not the correct way to select the column to write to. You've also selected a column to use apply with but you'd want that across the entire dataframe.
data["DBN"] = data.apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
the apply method of a pandas Series takes a function as one of its arguments.
here is a quick example of it in action:
import pandas as pd
data = {"numbers":range(30)}
def cube(x):
return x**3
df = pd.DataFrame(data)
df['squares'] = df['numbers'].apply(lambda x: x**2)
df['cubes'] = df['numbers'].apply(cube)
print df
gives:
numbers squares cubes
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
...
as you can see, either defining a function (like cube) or using a lambda function works perfectly well.
As has already been pointed out, if you're having problems with your particular piece of code it's that you have data["class_size"]["DBN"] = ... which is incorrect. I was assuming that was an odd typo because you didn't mention getting a key error, which is what that would result in.
if you're confused about this, consider:
def list_apply(func, mylist):
newlist = []
for item in mylist:
newlist.append(func(item))
this is a (not very efficient) function for applying a function to every item in a list. if you used it with cube as before:
a_list = range(10)
print list_apply(cube, a_list)
you get:
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
this is a simplistic example of how the apply function in pandas is implemented. I hope that helps?
Are you using a multi-index dataframe (i.e. There are column hierarchies)? It's hard to tell without seeing your data, but I'm presuming it is the case, since just using data["class_size"].apply() would yield a series on a normal dataframe (meaning the lambda wouldn't be able to find your columns specified and then there would be an error!)
I actually found this answer which explains the problem of trying to create columns in multi-index dataframes, one confusing things with multi-index column creation is that you can try to create a column like you are doing and it will seem to run without any issues, but won't actually create what you want. Instead, you need to change data["class_size"]["DBN"] = ... to data["class_size", "DBN"] = ... So, in full:
data["class_size","DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
Of course, if it isn't a mult-index dataframe then this won't help, and you should look towards one of the other answers.
I think 0:02d means 2 decimal place for "CSD" value. {}{} basically places the 2 values together to form 'DBN'.

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Applying the output of a function to two columns using .apply

I'm working on a script that takes in an address and spits out two values: coordinates (as a list) and result (whether the geocoding was successful or not. This works fine, but since the data is returned as a list, I then have to assign new columns based on the indices of that list, which works but returns a warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy.
EDIT: Just to be clear, I think I understand from that page that I should be using .loc to access the nested values. My question is more along the lines of generating two columns directly from a function as opposed to this workaround of having to dig the information out later.
I'd like to know the correct way to approach problems like these, as I actually have this problem twice in this project.
The actual specifics of the problem aren't important, so here's a simple example of how I've been approaching it:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return coords, result
df['output'] = df['address'].apply(geo)
Since this then yields a nested list into my df column, I then extract that into new columns as such:
df['coordinates'] = None
df['gps_status'] = None
for index, row in df.iterrows():
df['coordinates'][index] = df['output'][index][0]
df['gps_status'][index] = df['output'][index][1]
And again, I get the warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Any advice on the correct way to do this would be appreciated.
Usually you want to avoid iterrows() since it is faster to operate on an entire column at once. You can assign the result from output directly to a new column.
import pandas as pd
def geo(x):
return x*2, x*3
df = pd.DataFrame({'address':[1,2,3]})
output = df['address'].apply(geo)
df['a'] = [x[0] for x in output]
df['b'] = [x[1] for x in output]
gives you
address a b
0 1 2 3
1 2 4 6
2 3 6 9
with no copy warning.
Your function should return a Series:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return pd.Series([coords, result], ['coordinates', 'gps_status'])
df['output'] = df['address'].apply(geo)
That said, this may be better written as a merge.

pandas SparseDataFrame insertion

i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!
Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).

Categories