Pandas pivoting and adding a column from CSV with consecutive rows - python

I have consecutive row duplicates in two column.
I want to delete the second row duplicate based on [col1,col2] and move the value of another column to a new one.
Example:
Input
col1 col2 col3
X A 1
X A 2
Y A 3
Y A 4
X B 5
X B 6
Z C 7
Z C 8
Output
col1 col2 col3 col4
X A 1 2
Y A 3 4
X B 5 6
Z C 7 8
I found out about pivoting but I am struggling to understand how to add another column and avoid indexing, I would to preserve everything as written in the example

This is similar to Question 10 here:
(df.assign(col=df.groupby(['col1','col2']).cumcount())
.pivot_table(index=['col1','col2'], columns='col', values='col3')
.reset_index()
)
Output:
col col1 col2 0 1
0 X A 1 2
1 X B 5 6
2 Y A 3 4
3 Z C 7 8

Related

How to remove duplicate rows with a condition in pandas

i.e
i want to drop duplicates pairs using col1 and col2 as the subset only if the values are the opposite in col3 (one negative and one positive). similar to drop_duplicates function but i want to impose a condition and only want to remove the first pair (i.e if 3 duplicates, just remove 2, leave 1)
my dataset (df):
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
4 1 2 -1
5 1 2 1
6 1 2 1
I want:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
6 1 2 1
rows 4 and 5 are duplicated in col1 and col2 but value in col3 is the opposite, therefore we remove both. row 0 and row 2 have duplicate values in col1 and col2 but col3 is the same, so we don't remove those rows.
i've tried using drop_duplicates but realised it wouldn't work as it will only remove all duplicates and not consider anything else.
We can do transform
out = df[df.groupby(['col1','col2']).col3.transform('sum').ne(0) & df.col3.ne(0)]
Out[252]:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
Recreating the dataset:
import pandas as pd
data = [
[1, 1, 1],
[2, 2, 2],
[1, 1, 1],
[3, 5, 7],
[1, 2, -1],
[1, 2, 1],
[1, 2, 1],
]
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
if your data is not massive, you can use an iterrows function on a subset of the data.
The subset contains all duplicate values after all values have been turned into absolute values.
Next, we check if col3 is negative and if the opposite of col3 is in the duplicate subset.
If so, we drop the row from df.
df_dupes = df[df.abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
df.drop(labels=i, axis=0, inplace=True)
This code should remove row 4.
In your desired output, you left row 5 for some reason.
If you can explain why you left row 5 but kept row 0, then I can adjust my code to more accurately match your desired output.
I used #Petar Luketina code here with an adjustment and it worked. However I would like to use it for a massive dataset -> 1million rows and 43 columns. This code takes forever:
df_dupes = df[df['col3'].abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
print(row.col3)
try:
c = np.where((df['col1'] ==row.col1) & (df['col2'] ==row.col2) &
(df['col3'] ==-row.col3))[0][0]
df.drop(labels=[i,df.index.values[c]], axis=0, inplace=True)
except:
pass
I know this is an old question, but for those people interested, here is an alternative that avoids iterating over the rows:
First use a flag to identify the pair of rows to be removed (row plus the next row when col1 and col2 are the same and col3 are the negative of each other)
df.loc[(df.col1 == df.col1.shift(1)) & (df.col2 == df.col2.shift(1)) & (df.col3 == -df.col3.shift(1)), 'removeFlag'] = True
df.loc[df.removeFlag.shift(-1) == True, 'removeFlag'] = True
col1 col2 col3 removeFlag
0 1 1 1 NaN
1 2 2 2 NaN
2 1 1 1 NaN
3 3 5 7 NaN
4 1 2 -1 True
5 1 2 1 True
6 1 2 1 NaN
Then use this flag to delete to offending rows:
df = df[~(df.removeFlag == True)]
df.drop(columns=['removeFlag'], inplace=True)
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
6 1 2 1
This approach probably needs a little more refinement if row 6 had been the same as row 4 (ie the first half of a repeated identical pair) but you get the idea.

Pandas groupby concat ungrouped column into comma separated string

I have the following example df:
col1 col2 col3 doc_no
0 a x f 0
1 a x f 1
2 b x g 2
3 b y g 3
4 c x t 3
5 c y t 4
6 a x f 5
7 d x t 5
8 d x t 6
I want to group by the first 3 columns (col1, col2, col3), concatenate the fourth column (doc_no) into a line of strings based on the groupings of the first 3 columns, as well as also generate a sorted count column of the 3 column grouping (count). Example desired output below (column order doesn't matter):
col1 col2 col3 count doc_no
0 a x f 3 0, 1, 5
1 d x t 2 5, 6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
How would I go about doing this? I used the below line to get just the grouping and the count:
grouped_df = df.groupby(['col1','col2','col3']).size().reset_index(name='count')\
.sort_values(['count'], ascending=False).reset_index()
But I'm not sure how to also get the concatenated doc_no column in the same code line.
Try groupby and agg like so:
(df.groupby(['col1', 'col2', 'col3'])['doc_no']
.agg(['count', ('doc_no', lambda x: ','.join(map(str, x)))])
.sort_values('count', ascending=False)
.reset_index())
col1 col2 col3 count doc_no
0 a x f 3 0,1,5
1 d x t 2 5,6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
agg is simple to use because you can specify a list of reducers to run on a single column.
Let us do
df.doc_no=df.doc_no.astype(str)
s=df.groupby(['col1','col2','col3']).doc_no.agg(['count',','.join]).reset_index()
s
col1 col2 col3 count join
0 a x f 3 0,1,5
1 b x g 1 2
2 b y g 1 3
3 c x t 1 3
4 c y t 1 4
5 d x t 2 5,6
Another way
df2=df.groupby(['col1','col2','col3']).doc_no.agg(doc_no=('doc_no',list)).reset_index()
df2['doc_no']=df2['doc_no'].astype(str).str[1:-1]

delete pandas dataframe row if every value is equal

If I have a pandas dataframe which has a row containing float values and all the values are equal in the row, how do I delete that row from the dataframe?
Use DataFrame.nunique for test number of unique values per rows with Series.ne for filter out unique rows by boolean indexing:
df1 = df[df.nunique(axis=1).ne(1)]
Or test if not equal first column and test if at least one True per rows by DataFrame.any:
df1 = df[df.ne(df.iloc[:, 0], axis=0).any(axis=1)]
EDIT: If want remove all rows and all columns with same values solution should be changed for test columns with loc and axis=0:
df = pd.DataFrame({
'B':[4,4,4,4,4,4],
'C':[4,4,9,4,2,3],
'D':[4,4,5,7,1,0],
})
print (df)
B C D
0 4 4 4
1 4 4 4
2 4 9 5
3 4 4 7
4 4 2 1
5 4 3 0
df2 = df.loc[df.nunique(axis=1).ne(1), df.nunique(axis=0).ne(1)]
And for second solution:
df2 = df.loc[df.ne(df.iloc[:, 0], axis=0).any(axis=1), df.ne(df.iloc[0], axis=1).any(axis=0)]
print (df2)
C D
2 9 5
3 4 7
4 2 1
5 3 0
You can use DataFrame.diff over axis=1 (per row):
# Example dataframe:
df = pd.DataFrame({'Col1':[1,2,3],
'Col2':[2,2,5],
'Col3':[4,2,9]})
Col1 Col2 Col3
0 1 2 4
1 2 2 2 # <-- row with all same values
2 3 5 9
df[df.diff(axis=1).fillna(0).ne(0).any(axis=1)]
Col1 Col2 Col3
0 1 2 4
2 3 5 9

Pandas: Get mean of different rows when columns are equal

I'm trying to find the mean of values in different rows, grouped by similarities in other columns. Example:
In [14]: pd.DataFrame({'col1':[1,2,1,2], 'col2':['A','C','A','B'], 'col3':[1, 5, 6, 9]})
Out[14]:
col1 col2 col3
0 1 A 1
1 2 C 5
2 1 A 6
3 2 B 9
What I would like is to add a column with the means of col3, for all rows where the combination of col1 and col2 match. Desired output:
Out[14]:
col1 col2 col3 mean
0 1 A 1 3.5
1 2 C 5 5
2 1 A 6 3.5
3 2 B 9 9
I have tried several things with groupby in combination with apply but couldn't get proper results.
its a transform my man
df['mean'] = df.groupby(['col1','col2']).col3.transform('mean')

Add values to one column of a pandas dataframe based on the values in another

Say I have two matrices, an original and a reference:
import pandas as pa
print "Original Data Frame"
# Create a dataframe
oldcols = {'col1':['a','a','b','b'], 'col2':['c','d','c','d'], 'col3':[1,2,3,4]}
a = pa.DataFrame(oldcols)
print "Original Table:"
print a
print "Reference Table:"
b = pa.DataFrame({'col1':['x','x'], 'col2':['c','d'], 'col3':[10,20]})
print b
Where the tables look like this:
Original Data Frame
Original Table:
col1 col2 col3
0 a c 1
1 a d 2
2 b c 3
3 b d 4
Reference Table:
col1 col2 col3
0 x c 10
1 x d 20
Now I want to subtract from the third column (col3) of the original table (a), the value in the reference table (c) in the row where the second columns of the two tables match. So the first row of table two should have the value 10 added to the third column, because the row of table b where the column is col2 is 'c' has a value of 10 in col3. Make sense? Here's some code that does that:
col3 = []
for ix, row in a.iterrows():
col3 += [row[2] + b[b['col2'] == row[1]]['col3']]
a['col3'] = col3
print "Output Table:"
print a
Yielding the following output:
Output Table:
col1 col2 col3
0 a c [11]
1 a d [22]
2 b c [13]
3 b d [24]
My question is, is there a more elegant way to do this? Also, the results in 'col3' should not be lists. Solutions using numpy are also welcome.
I did not quite understand your description of what you are trying to do, but the output you have shown can be generated by first merging the two data frames and then some simple operations;
>>> df = a.merge(b.filter(['col2', 'col3']), how='left',
left_on='col2', right_on='col2', suffixes=('', '_'))
>>> df
col1 col2 col3 col3_
0 a c 1 10
1 b c 3 10
2 a d 2 20
3 b d 4 20
[4 rows x 4 columns]
>>> df.col3_.fillna(0, inplace=True) # in case there are no matches
>>> df.col3 += df.col3_
>>> df
col1 col2 col3 col3_
0 a c 11 10
1 b c 13 10
2 a d 22 20
3 b d 24 20
[4 rows x 4 columns]
>>> df.drop('col3_', axis=1, inplace=True)
>>> df
col1 col2 col3
0 a c 11
1 b c 13
2 a d 22
3 b d 24
[4 rows x 3 columns]
If values in col2 in b are not unique, then probably you also need something like:
>>> b.groupby('col2', as_index=False)['col3'].aggregate(sum)

Categories