Flatten nested pandas dataframe columns - python

After some aggregation, my dataframe looks something like this
A B
B_min B_max
0 11 3 6
1 22 1 2
2 33 4 4
How do I make the columns be A, B_min and B_max, without any nesting? Simple and standard. I've tried reindex_axix() and unstack(), but nothing worked.

Here is one way, but I wish there was an in-built way to do this.
import pandas as pd
df = pd.DataFrame({'A': [11, 11, 22, 22, 33, 33],
'B': [3, 6, 1, 2, 4, 4]})
g = df.groupby('A', as_index=False).agg({'B': ['min', 'max']})
g.columns = ['_'.join(col).strip() if col[1] else col[0] for col in g.columns.values]
# A B_min B_max
# 0 11 3 6
# 1 22 1 2
# 2 33 4 4

Related

Pandas inserting values to new column using pandas apply/map/Applymap

The following is the dataframe,
a b
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
5 6 8
6 7 9
I want to add a new column, call it sum which takes the sum of it's respective row values.
Expected output
a b sum
0 1 3 4
1 2 4 6
2 3 5 8
3 4 6 10
4 5 7 12
5 6 8 14
6 7 9 16
How to achieve this using pandas map, apply, Applymap functions?
My Code
df = pd.DataFrame({
'a': [1,2,3,4,5,6,7],
'b': [3,4,5,6,7,8,9]
})
def sum(df):
return df['a']+df['b']
# Methods I tried
df['sum'] = df.apply(sum(df))
df['sum']=df[['a',"b"]].map(sum)
df['sum'] = df.apply(lambda x: x['a'] + x['b'])
Note: This is just a dummy code. The original code has a function which returns different output for each individual rows and it ain't as simple as applying sum function. So I request you to make a custom sum function and implement those methods, so that I'll learn and apply the same to my code.
You can use the pandas sum function like below:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7], "b": [3, 4, 5, 6, 7, 8, 9]})
df["sum"] = df.sum(axis=1)
print(df)
And if you have to use lambda with apply you can try:
import pandas as pd
def add(a, b):
return a + b
df = pd.DataFrame({
'a': [1,2,3,4,5,6,7],
'b': [3,4,5,6,7,8,9]
})
df['sum'] = df.apply(lambda row : add(row['a'], row['b']), axis = 1)
print(df)

Python Dataframe. Move rows values left according index of rows

I have the table like this:
import pandas as pd
data = [[20, 15, 10, 5], [20, 15, 10, 5], [20, 15, 10, 5], [20, 15, 10, 5]]
df = pd.DataFrame(data, columns = ['one', 'two', 'three', 'four'])
df
one
two
three
four
20
15
10
5
20
15
10
5
20
15
10
5
20
15
10
5
I want to move every rows values left according their rows index.
Row values with index 0 stays the same, Row values with index 1 moves left in one point, Row values with index 2 moves left in two points, etc...
Desired table should looks like this:
one
two
three
four
20
15
10
5
15
10
5
0
10
5
0
0
5
0
0
0
Thanks for helping me!
Another way by using a simple loop to shift the values in every row, and then use
fillna to replace NA values with 0:
for i in range(len(df)):
df.iloc[i,:] = df.iloc[i,:].shift(-i)
df.fillna(0, inplace=True)
Output:
>>> df
one two three four
0 20 15.0 10.0 5.0
1 15 10.0 5.0 0.0
2 10 5.0 0.0 0.0
3 5 0.0 0.0 0.0
You could use a method that shift left regarding the index value and fills with 0
import pandas as pd
def rotate_row(row):
return pd.Series(row.to_list()[row.name:] + [0] * row.name, index=row.index)
data = [[20, 15, 10, 5], [20, 15, 10, 5], [20, 15, 10, 5], [20, 15, 10, 5]]
df = pd.DataFrame(data, columns=['one', 'two', 'three', 'four'])
df = df.apply(rotate_row, axis=1)
print(df)
one two three four
0 20 15 10 5
1 15 10 5 0
2 10 5 0 0
3 5 0 0 0
An upper right triangle to upper left triangle approach:
Create a mask to grab the upper triangle of the DataFrame using np.triu + np.ones + DataFrame.shape
mask = np.triu(np.ones(df.shape, dtype=bool))
[[ True True True True]
[False True True True]
[False False True True]
[False False False True]]
Grab corresponding values from the values of the DataFrame:
a = df.values[mask]
[20 15 10 5 15 10 5 10 5 5]
Create a np.zeros skeleton of the same dtype as a and fliplr the mask and assign back:
tri = np.zeros(df.shape, dtype=a.dtype)
tri[np.fliplr(mask)] = a
[[20 15 10 5]
[15 10 5 0]
[10 5 0 0]
[ 5 0 0 0]]
Turn back into a DataFrame:
new_df = pd.DataFrame(tri, columns=df.columns)
new_df:
one two three four
0 20 15 10 5
1 15 10 5 0
2 10 5 0 0
3 5 0 0 0
Complete Working Example:
import numpy as np
import pandas as pd
data = [[20, 15, 10, 5], [20, 15, 10, 5], [20, 15, 10, 5],
[20, 15, 10, 5]]
df = pd.DataFrame(data, columns=['one', 'two', 'three', 'four'])
mask = np.triu(np.ones(df.shape, dtype=bool))
a = df.values[mask]
tri = np.zeros(df.shape, dtype=a.dtype)
tri[np.fliplr(mask)] = a
new_df = pd.DataFrame(tri, columns=df.columns)
print(new_df)

Creating a new dataframe off of duplicate indexes

I'm working in pandas and I have a dataframe X
idx
0
1
2
3
4
I want to create a new dataframe with the following indexes from ths list. There are duplicate indexes because I want some rows to repeat.
idx = [0,0,1,2,3,2,4]
My expected output is
idx
0
0
1
2
3
2
4
I cant use
X.iloc[idx]
because of the duplicated indexes
code i tried:
d = {'idx': [0,1,3,4]}
df = pd.DataFrame(data=d)
idx = [0,0,1,2,3,2,4]
df.iloc[idx] # errors here with IndexError: indices are out-of-bounds
What you want to do is weird, but here is one way to do it.
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'ONE', 'TWO'])
OUTPUT
A B C
ONE 11 12 13
ONE 21 22 23
TWO 31 32 33
Read: pandas: Rename columns / index names (labels) of DataFrame
your current dataframe df:-
idx
0 0
1 1
2 3
3 4
Now just use reindex() method:-
idx = [0,0,1,2,3,2,4]
df=df.reindex(idx)
Now if you print df you get:-
idx
0 0.0
0 0.0
1 1.0
2 3.0
3 4.0
2 3.0
4 NaN

How to make separate lists out of multiple dataframe columns?

Yep, much discussed and similar questions down voted multiple times.. I still can't figure this one out..
Say I have a dataframe like this:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
I want to end up with four separate list (a, b, c and d) with the data from each column.
Logically (to me anyway) I would do:
list_of_lst = df.values.T.astype(str).tolist()
for column in df.columns:
i = 0
while i < len(df.columns) - 1:
column = list_of_lst[1]
i = i + 1
But assigning variable names in a loop is not doable/recommended...
Any suggestions how I can get what I need?
I think the best is create dictionary of list by DataFrame.to_dict:
np.random.seed(456)
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
print (df)
A B C D
0 5 9 4 5
1 7 1 8 3
2 5 2 4 2
3 2 8 4 8
4 5 6 0 9
5 8 2 3 6
6 7 0 0 3
7 3 5 6 6
8 3 8 9 6
9 5 1 6 1
d = df.to_dict('l')
print (d['A'])
[5, 7, 5, 2, 5, 8, 7, 3, 3, 5]
If really want A, B, C and D lists:
for k, v in df.to_dict('l').items():
globals()[k] = v
print (A)
[5, 7, 5, 2, 5, 8, 7, 3, 3, 5]
retList = dict()
for i in df.columns:
iterator = df[i].tolist()
retList[i] = iterator
You'd get a dictionary with the keys as the column names and values as the list of values in that column.
Modify it to any data structure you want.
retList.values() will give you a list of size 4 with each inner list being the list of each column values
You can transpose your dataframe and use df.T.values.tolist(). But, if you are manipulating numeric arrays thereafter, it's advisable you skip the tolist() part.
df = pd.DataFrame(np.random.randint(0, 100, size=(5, 4)), columns=list('ABCD'))
# A B C D
# 0 17 56 57 31
# 1 3 44 15 0
# 2 94 36 87 30
# 3 44 49 56 76
# 4 29 5 35 24
list_of_lists = df.T.values.tolist()
# [[17, 3, 94, 44, 29],
# [56, 44, 36, 49, 5],
# [57, 15, 87, 56, 35],
# [31, 0, 30, 76, 24]]

Conditional multiplication of multiple series with another series

I would like to multiply (in place) values in one column of a DataFrame by values in another column, based on a condition in a third column. For example:
data = pd.DataFrame({'a': [1, 33, 56, 79, 2], 'b': [9, 12, 14, 5, 5], 'c': np.arange(5)})
data.loc[data.a > 10, ['a', 'b']] *= data.loc[data.a > 10, 'c']
What I would like this to do is multiply the values of both 'a' and 'b' by the corresponding (same row) value in 'c' based on a condition. However, the above code just results in NaN values in the desired range.
The closest workaround I've found has been to do this:
data.loc[data.a > 10, ['a', 'b']] = (data.loc[data.a > 10, ['a', 'b']].as_matrix().T * data.loc[data.a > 10, 'c']).T
which works, but it seems like there is a better (more Pythonic) way that I'm missing.
you can use mul(..., axis=0) method:
In [122]: mask = data.a > 10
In [125]: data.loc[mask, ['a','b']] = data.loc[mask, ['a','b']].mul(data.loc[mask, 'c'], 0)
In [126]: data
Out[126]:
a b c
0 1 9 0
1 33 12 1
2 112 28 2
3 237 15 3
4 2 5 4
Here is one alternative to use Series.where() to update values conditionally:
data[['a', 'b']] = data[['a', 'b']].apply(lambda m: m.where(data.a <= 10, m*data.c))
use update
data.update(data.query('a > 10')[['a', 'b']].mul(data.query('a > 10').c, 0))
data
Well it seems NumPy could be an alternative here -
arr = data.values
mask = arr[:,0] > 10
arr[mask,:2] *= arr[mask,2,None]
We just extracted the values as an array, which is a view into the dataframe and that lets us work on the array and the updates would be automatically reflected in the dataframe. Here's a sample run to show the progress -
In [507]: data # Input dataframe
Out[507]:
a b c
0 1 9 0
1 33 12 1
2 56 14 2
3 79 5 3
4 2 5 4
Use proposed codes -
In [508]: arr = data.values
In [509]: mask = arr[:,0] > 10
In [510]: arr[mask,:2] *= arr[mask,2,None]
Verify results with dataframe -
In [511]: data
Out[511]:
a b c
0 1 9 0
1 33 12 1
2 112 28 2
3 237 15 3
4 2 5 4
Let's try to verify through other way that we were indeed working with a view there -
In [512]: np.may_share_memory(data,arr)
Out[512]: True
# %%
import pandas as pd
import numpy as np
data = pd.DataFrame({'a': [1, 33, 56, 79, 2],
'b': [9, 12, 14, 5, 5],
'c': np.arange(5)})
(data.loc[data.a>10, ['a','b']]\
.T * data.loc[data.a>10, 'c'])\
.T.append(data.loc[data.a<=10, ['a','b']])\
.T.append(data.c).T.sort()
# %%
Out[17]:
a b c
0 1 9 0
1 33 12 1
2 112 28 2
3 237 15 3
4 2 5 4

Categories