Pandas tuples groupby aggregation - python

Certain columns of my data frame contain tuples. Whenever I do aggregation via group by that columns do not appear in the resulting data frame unless explicitly specified.
Example,
df = pd.DataFrame()
df['A'] = [1, 2, 1, 2]
df['B'] = [1, 2, 3, 4]
df['C'] = map(lambda s: (s,), df['B'])
print df
A B C
0 1 1 (1,)
1 2 2 (2,)
2 1 3 (3,)
3 2 4 (4,)
If I do the following way then the column C does not appear in aggregation
print df.groupby('A').sum()
B
A
1 4
2 6
But if I specify it explicitly it appears as expected
print df[['A', 'C']].groupby('A').sum()
C
A
1 (1, 3)
2 (2, 4)
Could you please tell me why the C column didn't appear in the first case?
I would like it to go by default.

Because you aggregate by column B, not column C:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['A'] = [1, 2, 1, 2]
df['B'] = [1, 2, 3, 4]
df['C'] = map(lambda s: (s,), df['B'])
print df
df.at[0,'B'] = 10
print df
A B C
0 1 10 (1,)
1 2 2 (2,)
2 1 3 (3,)
3 2 4 (4,)
print df.groupby('A').sum()
B
A
1 13
2 6
print df.groupby('A')['B'].sum()
1 13
2 6
Name: B, dtype: int64

Related

Pandas DataFrame: Add a list to each cell by iterating over df with new column does not work

DataFrame with some columns a and b.
I now want to add a new column c that should contain lists (of different lengths).
df1 = pd.DataFrame({'a':[1,2,3], 'b':[5,6,7]})
new_col_init = [list() for i in range(len(df1)]
df1['c'] = pd.Series(new_col_init,dtype='object')
print(df1)
gives:
a b c
0 1 5 []
1 2 6 []
2 3 7 []
Why Am I unable to do the following:
for i in range(len(df1)):
df1.loc[i,'c'] = [2]*i
This results in ValueError: cannot set using a multi-index selection indexer with a different length than the value.
However this works:
df1['c'] = pd.Series([[2], [2,2], [2,2,2]])
print(df1)
Result:
a b c
0 1 5 [2]
1 2 6 [2, 2]
2 3 7 [2, 2, 2]
Is there a way to assign the lists by iterating with a for-loop? (I have a lot of other stuff that gets already assigned within that loop and I now need to add the new lists)
You can use .at:
for i, idx in enumerate(df1.index, 1):
df1.at[idx, "c"] = [2] * i
print(df1)
Prints:
a b c
0 1 5 [2]
1 2 6 [2, 2]
2 3 7 [2, 2, 2]
Here is a solution you can try out using Index.map,
df1['c'] = df1.index.map(lambda x: (x + 1) * [2])
a b c
0 1 5 [2]
1 2 6 [2, 2]
2 3 7 [2, 2, 2]
df1.loc[:, 'c'] = [[2]*(i+1) for i in range(len(df1))]

Computed/Reactive column in Pandas?

I would like to emulate an Excel formula in Pandas I've tried this:
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
df['c'] = lambda x : df.a + df.b + 1 # Displays <function <lambda> ..> instead of the result
df['d'] = df.a + df.b + 1 # Static computation
df.a *= 2
df # Result of column c and d not updated :(
a b c d
0 6 5 <function <lambda> at 0x7f2354ddcca0> 9
1 4 3 <function <lambda> at 0x7f2354ddcca0> 6
2 2 2 <function <lambda> at 0x7f2354ddcca0> 4
3 0 1 <function <lambda> at 0x7f2354ddcca0> 2
What I expect is:
df
a b c
0 6 5 12
1 4 3 8
2 2 2 5
3 0 1 2
df.a /= 2
a b c
0 3 5 9
1 2 3 6
2 1 2 4
3 0 1 2
Is this possible to have a computed column dynamically in Pandas?
Maybe this code might give you a step in the right direction:
import pandas as pd
c_list =[]
df = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
c_list2 = list(map(lambda x: x + df.b + 1 , list(df.a)))
for i in range (0,4):
c_list.append(pd.DataFrame(c_list2[i])["b"][i])
df['c'] = c_list
df['d'] = df.a + df.b # Static computation
df.a *= 2
df
Reactivity between columns in a DataFrame does not seem practically feasible. My cellopype package does give you Excel-like reactivity between DataFrames. Here's my take on your question:
pip install cellopype
import pandas as pd
from cellopype import Cell
# define source df and wrap it in a Cell:
df_ab = pd.DataFrame({'a': [3, 2, 1, 0], 'b': [5, 3, 2, 1]})
cell_ab = Cell(recalc=lambda: df_ab.copy())
# define the dependent/reactive Cell (with a single column 'c')
cell_c = Cell(
recalc=lambda df: pd.DataFrame(df.a + df.b, columns=['c']),
sources=[cell_ab]
)
# and get its value
print(cell_c.value)
c
0 8
1 5
2 3
3 1
# change source df and recalc its Cell...
df_ab.loc[0,'a']=100
cell_ab.recalc()
# cell_c has been updated in response
print(cell_c.value)
c
0 105
1 5
2 3
3 1
Also see my response to this question.

Finding elements in a pandas dataframe

I have a pandas dataframe which looks like the following:
0 1
0 2
2 3
1 4
What I want to do is the following: if I get 2 as input my code is supposed to search for 2 in the dataframe and when it finds it returns the value of the other column. In the above example my code would return 0 and 3. I know that I can simply look at each row and check if any of the elements is equal to 2 but I was wondering if there is one-liner for such a problem.
UPDATE: None of the columns are index columns.
Thanks
>>> df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
>>> df
A B
0 0 1
1 0 2
2 2 3
3 1 4
The following pandas syntax is equivalent to the SQL SELECT B FROM df WHERE A = 2
>>> df[df['A'] == 2]['B']
2 3
Name: B, dtype: int64
There's also pandas.DataFrame.query:
>>> df.query('A == 2')['B']
2 3
Name: B, dtype: int64
You may need this:
n_input = 2
df[(df == n_input).any(1)].stack()[lambda x: x != n_input].unique()
# array([0, 3])
df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
t = [df.loc[lambda df: df['A'] == 3]]
t

Change pivot table from Series to DataFrame

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'B': ['X', 'Y', 'Z'] * 3,
'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
>>> df
A B C
0 1 X 1
1 1 Y 2
2 1 Z 3
3 2 X 1
4 2 Y 2
5 2 Z 3
6 3 X 1
7 3 Y 2
8 3 Z 3
result = df.pivot_table(index=['B'], values='C', aggfunc=sum)
>>> result
B
X 3
Y 6
Z 9
Name: C, dtype: int64
How can I have the column name for C show up a above the sums, and how can I sort result either ascending or descending. Result is a series not a dataframe and seems non-sortable?
Python: 2.7.11 and Pandas: 0.17.1
You were very close. Note that the brackets around the values coerces the result into a dataframe instead of a series (i.e. values=['C'] instead of values='C').
result = df.pivot_table(index = ['B'], values=['C'], aggfunc=sum)
>>> result
C
B
X 3
Y 6
Z 9
Asresult is now a dataframe, you can use sort_values on it:
>>> result.sort_values('C', ascending=False)
C
B
Z 9
Y 6
X 3

Get rows based on my given list without revising the order or unique the list

I have a df looks like below, I would like to get rows from 'D' column based on my list without changing or unique the order of list. .
A B C D
0 a b 1 1
1 a b 1 2
2 a b 1 3
3 a b 1 4
4 c d 2 5
5 c d 3 6 #df
My list
l = [4, 2, 6, 4] # my list
df.loc[df['D'].isin(l)].to_csv('output.csv', index = False)
When I use isin() the result would change the order and unique my result, df.loc[df['D'] == value only print the last line.
A B C D
3 a b 1 4
1 a b 1 2
5 c d 3 6
3 a b 1 4 # desired output
Any good way to do this? Thanks,
A solution without loop but merge:
In [26]: pd.DataFrame({'D':l}).merge(df, how='left')
Out[26]:
D A B C
0 4 a b 1
1 2 a b 1
2 6 c d 3
3 4 a b 1
You're going to have to iterate over your list, get copies of them filtered and then concat them all together
l = [4, 2, 6, 4] # you shouldn't use list = as list is a builtin
cache = {}
masked_dfs = []
for v in l:
try:
filtered_df = cache[v]
except KeyError:
filtered_df = df[df['D'] == v]
cache[v] = filtered_df
masked_dfs.append(filtered_df)
new_df = pd.concat(masked_dfs)
UPDATE: modified my answer to cache answers so that you don't have to do multiple searches for repeats
just collect the indices of the values you are looking for, put in a list and then use that list to slice the data
import pandas as pd
df = pd.DataFrame({
'C' : [6, 5, 4, 3, 2, 1],
'D' : [1,2,3,4,5,6]
})
l = [4, 2, 6, 4]
i_locs = [ind for elem in l for ind in df[df['D'] == elem].index]
df.loc[i_locs]
results in
C D
3 3 4
1 5 2
5 1 6
3 3 4

Categories