Python - Pandas, split long column to multiple columns - python

Given the following DataFrame:
>>> pd.DataFrame(data=[['a',1],['a',2],['b',3],['b',4],['c',5],['c',6],['d',7],['d',8],['d',9],['e',10]],columns=['key','value'])
key value
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
6 d 7
7 d 8
8 d 9
9 e 10
I'm looking for a method that will change the structure based on the key value, like so:
a b c d e
0 1 3 5 7 10
1 2 4 6 8 10 <- 10 is duplicated
2 2 4 6 9 10 <- 10 is duplicated
The result row number is as the longest group count (d in the above example) and the missing values are duplicates of the last available value.

Create MultiIndex by set_index with counter column by cumcount, reshape by unstack, repalce missing values by last non missing ones with ffill and last converting all data to integers if necessary:
df = df.set_index([df.groupby('key').cumcount(),'key'])['value'].unstack().ffill().astype(int)
Another solution with custom lambda function:
df = (df.groupby('key')['value']
.apply(lambda x: pd.Series(x.values))
.unstack(0)
.ffill()
.astype(int))
print (df)
key a b c d e
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10

Using pivot , with groupby + cumcount
df.assign(key2=df.groupby('key').cumcount()).pivot('key2','key','value').ffill().astype(int)
Out[214]:
key a b c d e
key2
0 1 3 5 7 10
1 2 4 6 8 10
2 2 4 6 9 10

Related

Pandas how to output distinct values in column based on duplicate in another column

Here an example:
import pandas as pd
df = pd.DataFrame({
'product':['1','1','1','2','2','2','3','3','3','4','4','4','5','5','5'],
'value':['a','a','a','a','a','b','a','b','a','b','b','b','a','a','a']
})
product value
0 1 a
1 1 a
2 1 a
3 2 a
4 2 a
5 2 b
6 3 a
7 3 b
8 3 a
9 4 b
10 4 b
11 4 b
12 5 a
13 5 a
14 5 a
I need to output:
1 a
4 b
5 a
Because 'value' values for distinct 'product' values all are same
I'm sorry for bad English
I think you need this
m=df.groupby('product')['value'].transform('nunique')
df.loc[m==1].drop_duplicates(). reset_index(drop=True)
Output
product value
0 1 a
1 4 b
2 5 a
Details
df.groupby('product')['value'].transform('nunique') returns a series as below
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 1
10 1
11 1
12 1
13 1
14 1
where the numbers of the number of unique values in each group. Then we use df.loc to get only the rows in which this value is 1, so, the groups with unique values.
The we drop duplicates since you need only the group & its unique value.
If I undestand correctly your question, this simple code is for your:
distinct_prod_df = df.drop_duplicates(['product'])
and gives:
product value
0 1 a
3 2 a
6 3 a
9 4 b
12 5 a
You can try this:
mask = df.groupby('product').apply(lambda x: x.nunique() == 1)
df = df[mask].drop_duplicates()

How to compare columns of two different data frames and keep the common values

I have two data frames with same column but different values, out of which some are same and some are different. I want to compare both columns and keep the common values.
df1 :
A B C
1 1 1
2 4 6
3 7 9
4 9 0
6 0 1
df2 :
A D E
1 5 7
5 6 9
2 3 5
7 6 8
3 7 0
This is what I am expecting after comparison
df2 :
A D E
1 5 7
2 3 5
3 7 0
You can use pd.Index.intersection() to find the matching columns and do a inner merge finally reindex() to keep df2.columns:
match=df2.columns.intersection(df1.columns).tolist() #finds matching cols in both df
df2.merge(df1,on=match).reindex(df2.columns,axis=1) #merge and reindex to df2.columns
A D E
0 1 5 7
1 2 3 5
2 3 7 0

fastest way to insert multiple rows into a dataframe given a list of indexes (python)

I have a dataframe and I would like to insert rows at specific indexes at the beginning of each group within the dataframe. As an example lets say I have the following dataframe:
import pandas as pd
df = pd.DataFrame(data=[['A',1,1],['A',2,3],['A',5,4],['B',3,4],['B',2,6],['B',8,4],['C',9,3],['C',3,7],['C',1,9],['D',5,5],['D',8,3],['D',4,7]], columns=['Group','val1','val2'])
I would like to copy the first row of each unique value in the column group and insert that row at the beginning of each group while growing the dataframe. I can currently achieve this by using a for loop but it is pretty slow because my dataframe is large so I am looking for a vectorized solution.
I have a list of indexes where I would like to insert the rows.
idxs = [0, 3, 6, 9]
In each iteration of the loop I currently slice the dataframe at each one of the idxs into two dataframes, insert the row, and concat the dataframes. My dataframe is very large so this process has been very slow.
The solution would look like this:
Group val1 val2
0 A 1 1
1 A 1 1
2 A 2 3
3 A 5 4
4 B 3 4
5 B 3 4
6 B 2 6
7 B 8 4
8 C 9 3
9 C 9 3
10 C 3 7
11 C 1 9
12 D 5 5
13 D 5 5
14 D 8 3
15 D 4 7
You can do this by grouping by group, iterating over each group, and constructing a DataFrame via concatenation of each the first row of a group to the group itself, then the concatenation of all those concatenations.
Code:
import pandas as pd
df = pd.DataFrame(data=[['A',1,1],['A',2,3],['A',5,4],['B',3,4],['B',2,6],['B',8,4],['C',9,3],['C',3,7],['C',1,9],['D',5,5],['D',8,3],['D',4,7]], columns=['Group','val1','val2'])
df_new = pd.concat([
pd.concat([grp.iloc[[0], :], grp])
for key, grp in df.groupby('Group')
])
print(df_new)
Output:
Group val1 val2
0 A 1 1
0 A 1 1
1 A 2 3
2 A 5 4
3 B 3 4
3 B 3 4
4 B 2 6
5 B 8 4
6 C 9 3
6 C 9 3
7 C 3 7
8 C 1 9
9 D 5 5
9 D 5 5
10 D 8 3
11 D 4 7

Subtract values from maximum value within groups

Trying to take a df and create a new column thats based on the difference between the Value in a group and that groups max:
Group Value
A 4
A 6
A 10
B 5
B 8
B 11
End up with a new column "from_max"
from_max
6
4
0
6
3
0
I tried this but a ValueError:
df['from_max'] = df.groupby(['Group']).apply(lambda x: x['Value'].max() - x['Value'])
Thanks in Advance
Option 1
vectorised groupby + transform
df['from_max'] = df.groupby('Group').Value.transform('max') - df.Value
df
Group Value from_max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0
Option 2
index aligned subtraction
df['from_max'] = (df.groupby('Group').Value.max() - df.set_index('Group').Value).values
df
Group Value from_max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0
I think need GroupBy.transform for return Series with same size as original DataFrame:
df['from_max'] = df.groupby(['Group'])['Value'].transform(lambda x: x.max() - x)
Or:
df['from_max'] = df.groupby(['Group'])['Value'].transform(max) - df['Value']
Alternative is Series.map by aggregate max:
df['from_max'] = df['Group'].map(df.groupby(['Group'])['Value'].max()) - df['Value']
print (df)
Group Value from_max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0
Using reindex
df['From_Max']=df.groupby('Group').Value.max().reindex(df.Group).values-df.Value.values
df
Out[579]:
Group Value From_Max
0 A 4 6
1 A 6 4
2 A 10 0
3 B 5 6
4 B 8 3
5 B 11 0

pandas delete a cell and shift up the column

I am using pandas with python.
I have a column in which the first value is zero.
There are other zeros as well in the column but i don't want to delete them as well.
I want to delete this cell and move the column up by 1 position.
If it is easy i can make the first Zero as an empty cell and then delete but i cant find anything just to delete a specific cell and move the rest of the column up.
SO far i have tried help from existing stack overflow and quora plus github etc but i cant see anything i am looking for.
I believe you need shift first and then replace last NaN value:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
If no NaNs only use fillna for replace NaN:
df['A'] = df['A'].shift(-1).fillna('AAA')
print (df)
A B C D E F
0 b 4 7 1 5 a
1 c 5 8 3 3 a
2 d 4 9 5 6 a
3 e 5 4 7 9 b
4 f 5 2 1 2 b
5 AAA 4 3 0 4 b
If possible some NaNs in column then set last value by iloc, get_loc function return position of column A:
df['A'] = df['A'].shift(-1)
df.iloc[-1, df.columns.get_loc('A')] = 'AAA'
print (df)
A B C D E F
0 b 4 7 1 5 a
1 c 5 8 3 3 a
2 d 4 9 5 6 a
3 e 5 4 7 9 b
4 f 5 2 1 2 b
5 AAA 4 3 0 4 b

Categories