I am creating a function. One input of this function will be a panda dataframe and one of its tasks is to do some operation with two variables of this dataframe. These two variables are not fixed and I want to have the freedom to determine them using parameters as inputs of the function fun.
For example, suppose at some moment the variables I want to use are 'var1' and 'var2' (but at another time, I may want to use others two variables). Supose that these variables take values 1,2,3,4 and I want to reduce df doing var1 == 1 and var2 == 1. My functions is like this
def fun(df , var = ['input_var1', 'input_var2'] , val):
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
# Other operations
df = df.loc[(df.aux_var1 == val ) & (df.aux_var2 == val )]
# end of operations
# recover
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
return df
When I use the function fun, I have the error
fun(df, var = ['var1','var2'], val = 1)
IndexError: list index out of range
Actually, I want to do other more complex operations and I didn't describe these operations so as not to extend the question. Perhaps the simple example above has a solution that does not need to rename the variables. But maybe this solution doesn't work with the operations I really want to do. So first, I would necessarily like to correct the error when renaming the variables. If you want to give another more elegant solution that doesn't need renaming, I appreciate that too, but I will be very grateful if besides the elegant solution, you offer me the solution about renaming.
Python liste are zero indexed, i.e. the first element index is 0.
Just change the lines:
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
to
df = df.rename(columns={ var[0] : 'aux_var1 ', var[1]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[0] ,'aux_var2': var[1]})
respectively
In this case you are accessing var[2] but a 2-element list in Python has elements 0 and 1. Element 2 does not exist and therefore accessing it is out of range.
As it has been mentioned in other answers, the error you are receiving is due to the 0-indexing of Python lists, i.e. if you wish to access the first element of the list var, you do that by taking the 0 index instead of 1 index: var[0].
However to the topic of renaming, you are able to perform the filtering of pandas dataframe without any column renaming. I can see that you are accessing the column as an attribute of the dataframe, however you are able to achieve the same via utilising the __getitem__ method, which is more commonly used with square brackets, f.e. df[var[0]].
If you wish to have more generality over your function without any renaming happening, I can suggest this:
from functools import reduce
def fun(df , var, val):
_sub = reduce(
lambda x, y: x & (df[y] == val),
var,
pd.Series([True]*df.shape[0])
)
return df[_sub]
This will work with any number of input column variables. Hope this will serve as an inspiration to your more complicated operations you intend to do.
Related
I'm very new to Python and am having a problem trying to execute a very basic task in pandas. I am trying to create a new column (variable) called RACE which is based off of the values in RAC1P_RC1. I have tried every way to recode RACE (loc, apply, lambda), but it will not update its values at all, even when the argument is true. For example, I tried to use the code
def f(x):
if x['RAC1P_RC1'] == 1: return 1
else: return 0
acs['RACE'] = acs.apply(f, axis=1)
And when I look at the dataframe, all cases in RACE have a value of 0, even in cases where RAC1P_RC1 equals 1. There seems to be something very basic I'm missing here, since this is one of the simplest tasks in pandas, and I'm not able to do it. Any help would be appreciated.
Check the datatype of 'RAC1P_RC1' column, make sure it's not an object data type. If its object datatype then the condition (if x['RAC1P_RC1'] == 1) will always return False.
Also, you could use .loc to make the code faster as follows:
mask = (acs['RAC1P_RC1'] == 1)
acs.loc[mask,'RACE'] = 1
acs.loc[~mask,'RACE'] = 0
You can check your condition directly and that will give you a Series of True/False then typecast that Series to int via astype() method and you will get corrosponding binary values:
acs['RACE'] =acs['RAC1P_RC1'].eq(1).astype(int)
OR
you can also use view() method in place of astype() for achieving the same:
acs['RACE'] =acs['RAC1P_RC1'].eq(1).view('i1')
I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()
I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.
I encountered this lambda expression today and can't understand how it's used:
data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
The line of code doesn't seem to call the lambda function or pass any arguments into it so I'm confused how it does anything at all. The purpose of this is to take two columns CSD and SCHOOL CODE and combine the entries in each row into a new row, DBN. So does this lambda expression ever get used?
You're writing your results incorrectly to a column. data["class_size"]["DBN"] is not the correct way to select the column to write to. You've also selected a column to use apply with but you'd want that across the entire dataframe.
data["DBN"] = data.apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
the apply method of a pandas Series takes a function as one of its arguments.
here is a quick example of it in action:
import pandas as pd
data = {"numbers":range(30)}
def cube(x):
return x**3
df = pd.DataFrame(data)
df['squares'] = df['numbers'].apply(lambda x: x**2)
df['cubes'] = df['numbers'].apply(cube)
print df
gives:
numbers squares cubes
0 0 0 0
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
...
as you can see, either defining a function (like cube) or using a lambda function works perfectly well.
As has already been pointed out, if you're having problems with your particular piece of code it's that you have data["class_size"]["DBN"] = ... which is incorrect. I was assuming that was an odd typo because you didn't mention getting a key error, which is what that would result in.
if you're confused about this, consider:
def list_apply(func, mylist):
newlist = []
for item in mylist:
newlist.append(func(item))
this is a (not very efficient) function for applying a function to every item in a list. if you used it with cube as before:
a_list = range(10)
print list_apply(cube, a_list)
you get:
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
this is a simplistic example of how the apply function in pandas is implemented. I hope that helps?
Are you using a multi-index dataframe (i.e. There are column hierarchies)? It's hard to tell without seeing your data, but I'm presuming it is the case, since just using data["class_size"].apply() would yield a series on a normal dataframe (meaning the lambda wouldn't be able to find your columns specified and then there would be an error!)
I actually found this answer which explains the problem of trying to create columns in multi-index dataframes, one confusing things with multi-index column creation is that you can try to create a column like you are doing and it will seem to run without any issues, but won't actually create what you want. Instead, you need to change data["class_size"]["DBN"] = ... to data["class_size", "DBN"] = ... So, in full:
data["class_size","DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1)
Of course, if it isn't a mult-index dataframe then this won't help, and you should look towards one of the other answers.
I think 0:02d means 2 decimal place for "CSD" value. {}{} basically places the 2 values together to form 'DBN'.
In Python's Pandas, I am using the Data Frame as such:
drinks = pandas.read_csv(data_url)
Where data_url is a string URL to a CSV file
When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:
drinks.light_drinker[drinks.light_drinker == 1]
Is there a more DRY-like way to self-reference the "parent"? I.e. something like:
drinks.light_drinker[self == 1]
You can now use query or assign depending on what you need:
drinks.query('light_drinker == 1')
or to mutate the the df:
df.assign(strong_drinker = lambda x: x.light_drinker + 100)
Old answer
Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:
df.set(new_column=lambda self: self.light_drinker*2)
In the most current version of pandas, .where() also accepts a callable!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where
So, the following is now possible:
drinks.light_drinker.where(lambda x: x == 1)
which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.
To get a filtered DataFrame, use:
drinks.where(lambda x: x.light_drinker == 1)
Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).
If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:
drinks.query('light_drinker == 1')
Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.
I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().
drinks.where(drinks.light_drinker == 1, inplace=True)