Update a dataframe within apply after using groupby - python

I have a pandas dataframe that I want to group on and then update the original dataframe using iterrows and set_value. This doesn't appear to work.
Here is an example.
In [1]: def func(df, n):
...: for i, row in df.iterrows():
...: print("Updating {0} with value {1}".format(i, n))
...: df.set_value(i, 'B', n)
In [2]: df = pd.DataFrame({"A": [1, 2], "B": [0, 0]})
In [3]: df
Out[4]:
A B
0 1 0
1 2 0
In [125]: func(df, 1)
Updating 0 with value 1
Updating 1 with value 1
In [126]: df
Out[126]:
A B
0 1 1
1 2 1
In [127]: df.groupby('A').apply(lambda df: func(df, 2))
Updating 0 with value 2
Updating 0 with value 2
Updating 1 with value 2
In [126]: df
Out[126]:
A B
0 1 1
1 2 1
I was hoping that B would have been updated to 2.
Why isn't this working, and what is the best way to achieve this result?

The way you have things written, you seem to want the function func(df, n) to modify df in place. But df.groupby('A') (in some sense) creates another set of dataframes (one for each group), so using func() as an argument to df.groupby('A').apply() only modifies the these newly created dataframes and not the original df. Furthermore, the returned dataframe is a concatenation of the outputs of func() called with each group as an argument, which is why the returned dataframe is empty.
The shortest fix to your problem is to return df at the end of func:
def func(df, n):
for i, row in df.iterrows():
print("Updating {0} with value {1}".format(i, n))
df.set_value(i, 'B', n)
return df
df = df.groupby('A').apply(lambda df: func(df, 2))
I presume this is not exactly what you had in mind because you're probably expecting to modify everything in place. If modifying everything in place is your intention, you'd need to use combinations of a for loop and .loc, but modifying your dataframe with .loc will be computationally expensive if you intend to call .loc many times.
I would also guess that your function to set values depends on a more complicated criterion, but usually you can vectorize things and avoid having to use .iterrows() altogether.
To avoid the XY problem, I'd suggest describing your function in more detail, because chances are that you can get everything done with a few lines incorporating the use of .loc and avoiding the need to iterate through every row in Python. Case in point: df['B'] = 2 (sans a print statement) is a one-liner solution to your problem.

This isn't working because you are altering the copied subsets of df delivered by the groupby object's get_group method. You are changing something, just not what you were expecting.
If that weren't reason enough not to do this, you'll notice you had 3 print statements. That's because pandas runs that first group once to test and infer output. Then again to actually do stuff. If you altered things outside the scope , you may end up with unintended consequences.
Someone else can provide a better example of how to do it. I just wanted to explain why it wasn't working.

In some situations, if func() does things based on index, you could modify the original dataframe directly.
Instead of this:
def func(group, n):
for i, row in group.iterrows():
print("Updating {0} with value {1}".format(i, n))
group.set_value(i, 'B', n)
return group
df.groupby('A').apply(lambda group: func(group, 2))
You could do this:
for key, group in df.groupby('A'):
n = 2
for i, row in group.iterrows():
print("Updating {0} with value {1}".format(i, n))
df.set_value(i, 'B', n)

Related

Is there a way in python to use one column as index to change element of other string column?

I have a DataFrame with two columns that look something like this:
A B
0 '_______' [2,3]
1 '_______' [0]
2 '_______' [1,4,6]
where one column is a string with 7 "_" and the other column contains a numpy array with different lengths. My goal is to change column A using B as indexes so it looks like this:
A B
1 '__23___' [2,3]
2 '0______' [0]
3 '_1__4_6' [1,4,6]
My code seems to work but I keep getting the error:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
I don't understand how I fix this error. My code is:
for i in range(len(df)):
row = df.iloc[i,:].copy()
numbers = row['B']
for j in numbers:
loop_string = df['A'][i]
df['A'][i] = loop_string[:j] + str(j) + loop_string[j+1:]
Also the fact that I need two for loops bothers me this must be possible an other more efficient way. Can anyone help me?
You can use apply to use a custom function on the B column:
df['A'] = df['B'].apply(lambda l: ''.join([str(i) if i in l else '_'for i in range(7)]))
The above does not consider the original value of A but instead creates an entirely new string column.
Result:
A B
0 __23___ [2, 3]
1 0______ [0]
2 _1__4_6 [1, 4, 6]

Why do any() and pd.any() return different values?

I recently discovered that the built-in function any() doesn't work for pandas dataframes.
import pandas as pd
data = pd.DataFrame([True, False])
print("base: " + str(any(data)))
print("pandas: " + str(data.any()))
Result:
base: False
pandas: 0 True
dtype: bool
Can someone explain the logic behind this behavior?
Iterating through a dataframe is iterating through its column labels, e. g.
In[3]: df = pd.DataFrame({"col_1": [1, 2], "col_2": [3, 4]})
In[4]: df
col_1 col_2
0 1 3
1 2 4
In[5]: for i in df:
...: print(i)
col_1
col_2
In your case with only 1 column with the default label 0 (it is the number 0, not a string '0'), you obtained for
any(data),
which is as
any([0]),
which in turn is as
any([False])
the value False.
Looking at docs for any(), it says:
any(iterable) Return True if any element of the iterable is true. If the iterable is empty, return False. Equivalent to:
def any(iterable):
for element in iterable:
if element:
return True
return False
if you do:
for element in data:
print(element)
it will print 0.
Also if you do print(list(data)) you will get [0] - i.e. list with one element - 0.
So, when you iterate over dataframe itself (not the rows) you iterate over column labels and in this case you get just one 0, which leads is interpreted as False when you do any(data).

How to map one matrix value to another in theano function

I want to implement the following function in theano function,
a=numpy.array([ [b_row[dictx[idx]] if idx in dictx else 0 for idx in range(len(b_row))]
for b_row in b])
where a, b are narray, and dictx is a dictionary
I got the error TensorType does not support iteration
Do I have to use scan? or is there any simpler way?
Thanks!
Since b is of type ndarray, I'll assume every b_row has the same length.
If I understood correctly the code swaps the order of columns in b according to dictx, and pads the non-specified columns with zeros.
The main problem is Theano doesn't have a dictionary-like data structure (please let me know if there's one).
Because in your example the dictionary keys and values are integers within range(len(b_row)), one way to work around this is to construct a vector that uses indices as keys (if some index should not be contained in the dictionary, make its value -1).
The same idea should apply for mapping elements of a matrix in general, and there're certainly other (better) ways of doing this.
Here is the code.
Numpy:
dictx = {0:1,1:2}
b = numpy.asarray([[1,2,3],
[4,5,6],
[7,8,9]])
a = numpy.array([[b_row[dictx[idx]] if idx in dictx else 0 for idx in range(len(b_row))] for b_row in b])
print a
Theano:
dictx = theano.shared(numpy.asarray([1,2,-1]))
b = tensor.matrix()
a = tensor.switch(tensor.eq(dictx, -1), tensor.zeros_like(b), b[:,dictx])
fn = theano.function([b],a)
print fn(numpy.asarray([[1,2,3],
[4,5,6],
[7,8,9]]))
They both print:
[[2 3 0]
[5 6 0]
[8 9 0]]

Pandas Equivalent of R's which()

Variations of this question have been asked before, I'm still having trouble understanding how to actually slice a python series/pandas dataframe based on conditions that I'd like to set.
In R, what I'm trying to do is:
df[which(df[,colnumber] > somenumberIchoose),]
The which() function finds indices of row entries in a column in the dataframe which are greater than somenumberIchoose, and returns this as a vector. Then, I slice the dataframe by using these row indices to indicate which rows of the dataframe I would like to look at in the new form.
Is there an equivalent way to do this in python? I've seen references to enumerate, which I don't fully understand after reading the documentation. My sample in order to get the row indices right now looks like this:
indexfuture = [ x.index(), x in enumerate(df['colname']) if x > yesterday]
However, I keep on getting an invalid syntax error. I can hack a workaround by for looping through the values, and manually doing the search myself, but that seems extremely non-pythonic and inefficient.
What exactly does enumerate() do? What is the pythonic way of finding indices of values in a vector that fulfill desired parameters?
Note: I'm using Pandas for the dataframes
I may not understand clearly the question, but it looks like the response is easier than what you think:
using pandas DataFrame:
df['colname'] > somenumberIchoose
returns a pandas series with True / False values and the original index of the DataFrame.
Then you can use that boolean series on the original DataFrame and get the subset you are looking for:
df[df['colname'] > somenumberIchoose]
should be enough.
See http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
What what I know of R you might be more comfortable working with numpy -- a scientific computing package similar to MATLAB.
If you want the indices of an array who values are divisible by two then the following would work.
arr = numpy.arange(10)
truth_table = arr % 2 == 0
indices = numpy.where(truth_table)
values = arr[indices]
It's also easy to work with multi-dimensional arrays
arr2d = arr.reshape(2,5)
col_indices = numpy.where(arr2d[col_index] % 2 == 0)
col_values = arr2d[col_index, col_indices]
enumerate() returns an iterator that yields an (index, item) tuple in each iteration, so you can't (and don't need to) call .index() again.
Furthermore, your list comprehension syntax is wrong:
indexfuture = [(index, x) for (index, x) in enumerate(df['colname']) if x > yesterday]
Test case:
>>> [(index, x) for (index, x) in enumerate("abcdef") if x > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
Of course, you don't need to unpack the tuple:
>>> [tup for tup in enumerate("abcdef") if tup[1] > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
unless you're only interested in the indices, in which case you could do something like
>>> [index for (index, x) in enumerate("abcdef") if x > "c"]
[3, 4, 5]
And if you need an additional statement panda.Series allows you to do Operations between Series (+, -, /, , *).
Just multiplicate the indexes:
idx1 = df['lat'] == 49
idx2 = df['lng'] > 15
idx = idx1 * idx2
new_df = df[idx]
Instead of enumerate, I usually just use .iteritems. This saves a .index(). Namely,
[k for k, v in (df['c'] > t).iteritems() if v]
Otherwise, one has to do
df[df['c'] > t].index()
This duplicates the typing of the data frame name, which can be very long and painful to type.
A nice simple and neat way of doing this is the following:
SlicedData1 = df[df.colname>somenumber]]
This can easily be extended to include other criteria, such as non-numeric data:
SlicedData2 = df[(df.colname1>somenumber & df.colname2=='24/08/2018')]
And so on...

How can I improve this Pandas DataFrame construction?

I wrote this ugly piece of code. It does the job, but it is not elegant. Any suggestion to improve it?
Function returns a dict given i, j.
pairs = [dict({"i":i, "j":j}.items() + function(i, j).items()) for i,j in my_iterator]
pairs = pd.DataFrame(pairs).set_index(['i', 'j'])
The dict({}.items() + function(i, j).items()) is supposed to merge both dict in one as dict().update() does not return the merged dict.
A favourite trick* to return an updated a newly created dictionary:
dict(i=i, j=j, **function(i, j))
*and of much debate on whether this is actually "valid"...
Perhaps also worth mentioning the DataFrame from_records method:
In [11]: my_iterator = [(1, 2), (3, 4)]
In [12]: df = pd.DataFrame.from_records(my_iterator, columns=['i', 'j'])
In [13]: df
Out[13]:
i j
0 1 2
1 3 4
I suspect there would be a more efficient method by vectorizing your function (but it's hard to say what makes more sense without more specifics of your situation)...

Categories