Pandas groupby: remove duplicates

Pandas groupby: remove duplicates - python

input: (CSV file)
name subject internal_1_marks internal_2_marks final_marks
abc python 45 50 47
pqr java 45 46 46
pqr python 40 33 37
xyz java 45 43 49
xyz node 40 30 35
xyz ruby 50 45 47
Expected output: (CSV file)
name subject internal_1_marks internal_2_marks final_marks
abc python 45 50 47
pqr java 45 46 46
python 40 33 37
xyz java 45 43 49
node 40 30 35
ruby 50 45 47
I've tried this:
df = pd.read_csv("student_info.csv")
df.groupby(['name', 'subject']).sum().to_csv("output.csv")
but it's giving duplicate in first column as shown bellow.
name subject internal_1_marks internal_2_marks final_marks
abc python 45 50 47
pqr java 45 46 46
pqr python 40 33 37
xyz java 45 43 49
xyz node 40 30 35
xyz ruby 50 45 47
I need to remove duplicate in first column as shown in expected output.
Thanks.

Similar answer here
mask = df['name'].duplicated()
df.loc[mask.values,['name']] = ''
name subject internal_1_marks internal_2_marks final_marks
0 abc python 45 50 47
1 pqr java 45 46 46
2 python 40 33 37
3 xyz java 45 43 49
4 node 40 30 35
5 ruby 50 45 47

You can filter the dupes after the group by
df.groupby(['name', 'subject']).sum().reset_index().assign(name=lambda x: x['name'].where(~x['name'].duplicated(), '')).to_csv('filename.csv', index=False)
Also when reading the file you can pass index_col for the dupes
df = pd.read_csv('test.csv', index_col=[0])

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 100 110 5
21 60 70 80 55 57 8
32 12 43 57 87 98 9
41 99 23 45 65 78 12
This is the demo data frame,
Here i wanted to choose maximum for each row from 3 countries(INDIA,GERMANY,US) and then add the threshold value to that maximum record and then add that into the max value and update it in the dataframe.
lets take an example :
max[US,INDIA,GERMANY] = max[US,INDIA,GERMANY] + threshold
After performing this dataframe will get updated as below :
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 105 110 5
21 60 78 80 55 57 8
32 12 43 57 96 98 9
41 111 23 45 65 78 12
I tried to achieve this using for loop but it is taking too long to execute :
df_max = df_final[['US','INDIA','GERMANY']].idxmax(axis=1)
for ind in df_final.index:
column = df_max[ind]
df_final[column][ind] = df_final[column][ind] + df_final['Threshold'][ind]
Please help me with this. Looking forward for a good solution,Thanks in advance...!!!

First solution compare maximal value per row with all values of filtered columns, then multiple mask by Threshold and add to original column:
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
Or use numpy - get columns names by idxmax, compare by array from list cols, multiple and add to original columns:
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
There is difference of solutions if multiple maximum values per rows.
First solution add threshold to all maximum, second solution to first maximum.
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 100 20 100 110 5 <-changed data double 100
1 21 60 70 80 55 57 8
2 32 12 43 57 87 98 9
3 41 99 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 100 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12

Read variables from text file python (OPL Data file)

I've got many text (.txt) files that looks like this:
#camiones disponibles
set K:= 1 2 3;
#capacidades de camiones
param Q:=
1 20000
2 15000
3 10000
;
#demanda por tipo de leche
param D:=
26800
11700
2500
;
#costos de transporte
param c[*,*]
: 1000 1 2 3 4 5 6 7
1000 0 35 78 76 98 55 52 37
1 35 0 60 59 91 81 40 13
2 78 60 0 3 37 87 26 48
3 76 59 3 0 36 83 24 47
4 98 91 37 36 0 84 51 78
5 55 81 87 83 84 0 66 74
6 52 40 26 24 51 66 0 28
7 37 13 48 47 78 74 28 0
From my understanding this is an OPL Data file. Each one of these text files are an instance, and every one of them has the same variables. I need to read one text file at a time.
I'm trying to get each variable definition into a python variable, such as a numpy array or a pandas data frame.

This data file is Not opl format but ampl. So what you could do is use ampl to read the file and then from ampl Write that data in any Format you Need
https://portal.ampl.com/docs/archive/first-website/BOOK/CHAPTERS/15-display.pdf#page=33

Pandas: Add an empty row after every index in a MultiIndex dataframe

Consider below df:
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25
How can I add an empty row after each Name index?
Expected output:
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25

Use custom function for add empty rows in GroupBy.apply:
def f(x):
x.loc[('', ''), :] = ''
return x
Or:
def f(x):
return x.append(pd.DataFrame('', columns=df.columns, index=[(x.name, '')]))
df = df.groupby(level=0, group_keys=False).apply(f)
print (df)
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25

Adding another way using df.reindex and fill_value as '' after using pd.MultiIndex.from_product and Index.union and then sorting it.
idx = df.index.union(pd.MultiIndex.from_product((df.index.levels[0],[''])),sort=False)
out = df.reindex(sorted(idx,key=lambda x: x[0]),fill_value='')
print(out)
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25
We use sort=False when using Index.union the index so order is retained , then using sorted on the first element returns:
sorted(idx,key=lambda x:x[0])
[('Abc', 'DS'),
('Abc', 'DMS'),
('Abc', 'ADA'),
('Abc', ''),
('Bcd', 'BA'),
('Bcd', 'EAD'),
('Bcd', 'DS'),
('Bcd', ''),
('Cdf', 'EAD'),
('Cdf', 'ADA'),
('Cdf', '')]

# reset index
dfn = df.reset_index()
# find the border idx of 'Name', [2, 5, 7]
idx_list = dfn.drop_duplicates('Name', keep='last').index
# use the border idx, create an empty df, and append to the origin df, then sort the index
df_append = pd.DataFrame('', index = idx_list, columns = dfn.columns)
obj = dfn.append(df_append).sort_index().set_index(['Name', 'Subject'])
print(obj)
IA1 IA2 IA3
Name Subject
Abc DS 45 43 34
DMS 43 23 45
ADA 32 46 36
Bcd BA 45 35 37
EAD 23 45 12
DS 23 35 43
Cdf EAD 34 33 23
ADA 12 34 25

Shuffle rows in pandas dataframe, keeping duplicates together

I have a data like this:
A B C D E F
35 1 2 35 25 65
40 5 7 47 57 67
20 1 8 74 58 63
35 1 2 37 28 69
40 5 7 49 58 69
20 1 8 74 58 63
35 1 2 47 29 79
40 5 7 55 77 87
20 1 8 74 58 63
Here we can see that Columns A,B and C have replicas that are repeated in various rows. I want to shuffle all the rows and have the replicas in consecutive rows, without deleting any of them. The output should look like this:
A B C D E F
35 1 2 35 25 65
35 1 2 37 28 69
35 1 2 47 29 79
40 5 7 47 57 67
40 5 7 49 58 69
40 5 7 55 77 87
20 1 8 74 58 63
20 1 8 74 58 63
20 1 8 74 58 63
When I use pandas.DataFrame.duplicated, it can give me duplicated rows. How can I keep all the identical rows using groupby?

Here is code that achieves the result you asked for (which doesn't require either explicit shuffling or sorting, but merely grouping your existing df by columns A,B,C):
df_shuf = pd.concat( group[1] for group in df.groupby(['A','B','C'], sort=False) )
print(df_shuf.to_string(index=False))
A B C D E F
35 1 2 35 25 65
35 1 2 37 28 69
35 1 2 47 29 79
40 5 7 47 57 67
40 5 7 49 58 69
40 5 7 55 77 87
20 1 8 74 58 63
20 1 8 74 58 63
20 1 8 74 58 63
Notes:
I couldn't figure out how to do df.reindex in-place on the grouped object. But we can get by without it.
You don't need pandas.DataFrame.duplicated, since df.groupby(['A','B','C'] puts all duplicates in the same group already.
df.groupby(... sort=False) is faster, use it whenever you don't need the groups sorted by default.

How to delete first value in the last column of a pandas dataframe and then delete the remaining last row?

Below I am using pandas to read my csv file in the following format:
dataframe = pandas.read_csv("test.csv", header=None, usecols=range(2,62), skiprows=1)
dataset = dataframe.values
How can I delete the first value in the very last column in the dataframe and then delete the last row in the dataframe?
Any ideas?

You can shift the last column up to get rid of the first value, then drop the last line.
df.assign(E=df.E.shift(-1)).drop(df.index[-1])
MVCE:
pd.np.random.seed = 123
df = pd.DataFrame(pd.np.random.randint(0,100,(10,5)),columns=list('ABCDE'))
Output:
A B C D E
0 91 83 40 17 94
1 61 5 43 87 48
2 3 69 73 15 85
3 99 53 18 95 45
4 67 30 69 91 28
5 25 89 14 39 64
6 54 99 49 44 73
7 70 41 96 51 68
8 36 3 15 94 61
9 51 4 31 39 0
df.assign(E=df.E.shift(-1)).drop(df.index[-1]).astype(int)
Output:
A B C D E
0 91 83 40 17 48
1 61 5 43 87 85
2 3 69 73 15 45
3 99 53 18 95 28
4 67 30 69 91 64
5 25 89 14 39 73
6 54 99 49 44 68
7 70 41 96 51 61
8 36 3 15 94 0
or in two steps:
df[df.columns[-1]] = df[df.columns[-1]].shift(-1)
df = df[:-1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby: remove duplicates - python

Similar answer here mask = df['name'].duplicated() df.loc[mask.values,['name']] = '' name subject internal_1_marks internal_2_marks final_marks 0 abc python 45 50 47 1 pqr java 45 46 46 2 python 40 33 37 3 xyz java 45 43 49 4 node 40 30 35 5 ruby 50 45 47

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Read variables from text file python (OPL Data file)

Pandas: Add an empty row after every index in a MultiIndex dataframe

Shuffle rows in pandas dataframe, keeping duplicates together

How to delete first value in the last column of a pandas dataframe and then delete the remaining last row?

Categories

Resources