Append value/index for each duplicated row within a Pandas Dataframe - python

I have a sorted Dataframe with some duplicated ids and I wanted to make the ids unique by appending the index in which they appear in their duplicated list.
Original df:
id val
1 100
1 526
2 434
3 234
4 657
4 44
4 121
Notice how there are duplicate ids.
This is what I'm hoping for:
id val
1 100
1-1 526
2 434
3 234
4 657
4-1 44
4-2 121
Would also be ok with:
id val
1-0 100
1-1 526
2-0 434
3-0 234
4-0 657
4-1 44
4-2 121

Here's a way to do:
df2 = df.copy()
df2['id'] = df['id'].astype(str) + '-' + df.groupby('id').cumcount().astype(str)
id val
0 1-0 100
1 1-1 526
2 2-0 434
3 3-0 234
4 4-0 657
5 4-1 44
6 4-2 121

df['id'] = df.groupby('id')['id'].transform(lambda x: ['{}-{}'.format(v, i) if i else v for i, v in enumerate(x)])
print(df)
Prints:
id val
0 1 100
1 1-1 526
2 2 434
3 3 234
4 4 657
5 4-1 44
6 4-2 121

Related

Split columns conditionally on string

I have a data frame with the following shape:
0 1
0 OTT:81 DVBC:398
1 OTT:81 DVBC:474
2 OTT:81 DVBC:474
3 OTT:81 DVBC:454
4 OTT:81 DVBC:443
5 OTT:1 DVBC:254
6 DVBC:151 None
7 OTT:1 DVBC:243
8 OTT:1 DVBC:254
9 DVBC:227 None
I want for column 1 to be same as column 0 if column 1 contains "DVBC".
The split the values on ":" and the fill the empty ones with 0.
The end data frame should look like this
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I try to do this starting with:
if df[0].str.contains("DVBC") is True:
df[1] = df[0]
But after this the data frame looks the same not sure why.
My idea after is to pass the values to the respective columns then split by ":" and rename the columns.
How can I implement this?
Universal solution for split values by : and pivoting- first create Series by DataFrame.stack, split by Series.str.splitSeries.str.rsplit and last reshape by DataFrame.pivot:
df = df.stack().str.split(':', expand=True).reset_index()
df = df.pivot('level_0',0,1).fillna(0).rename_axis(index=None, columns=None)
print (df)
DVBC OTT
0 398 81
1 474 81
2 474 81
3 454 81
4 443 81
5 254 1
6 151 0
7 243 1
8 254 1
9 227 0
Here is one way that should work with any number of columns:
(df
.apply(lambda c: c.str.extract(':(\d+)', expand=False))
.ffill(axis=1)
.mask(df.replace('None', pd.NA).isnull().shift(-1, axis=1, fill_value=False), 0)
)
output:
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227

How to calculate min and max of a column for particular rows?

I have a csv file as following:
0 2 1 1 464 385 171 0:44:4
1 1 2 26 254 444 525 0:56:2
2 3 1 90 525 785 522 0:52:8
3 8 2 3 525 233 555 0:52:8
4 7 1 10 525 433 522 1:52:8
5 9 2 55 525 555 522 1:52:8
6 6 3 3 392 111 232 1:43:4
7 1 4 23 322 191 112 1:43:4
8 1 3 30 322 191 112 1:43:4
9 1 5 2 322 191 112 1:43:4
10 1 3 22 322 191 112 1:43:4
11 1 4 44 322 191 112 1:43:4
12 1 5 1 322 191 112 1:43:4
12 1 4 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
12 1 6 1 322 191 112 1:43:4
12 1 5 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
.
.
Third column has numbers between 1 to 6. I want to read information of columns #4 and #5 for all the rows that have number 1 to 6 in the third columns and find the maximum and minmum amount for each row that has number 1 to 6 seprately. For example output like this:
Mix for row with 1: 1
Max for row with 1: 90
Min for row with 2: 3
Max for row with 2: 55
and so on
I can plot the figure using following code. How to get summary statistics by group? What I'm looking for is to get multiple statistics for the same group like mean, min, max, number of each group in one call, is that doable?
import matplotlib.pyplot as plt
import csv
x= []
y= []
with open('mydata.csv','r') as csvfile:
ap = csv.reader(csvfile, delimiter=',')
for row in ap:
x.append(int(row[2]))
y.append(int(row[7]))
plt.scatter(x, y, color = 'g',s = 4, marker='o')
plt.show()
One easy way would be to use Pandas with read_csv(), .groupby() and .agg():
import pandas as pd
df = pd.read_csv("mydata.csv", header=None)
def min_max_avg(col):
return (col.min() + col.max()) / 2
result = df[[2, 3, 4]].groupby(2).agg(["min", "max", "mean", min_max_avg])
Result:
3 4
min max mean min_max_avg min max mean min_max_avg
2
1 1 90 33.666667 45.5 464 525 504.666667 494.5
2 3 55 28.000000 29.0 254 525 434.666667 389.5
3 3 30 18.333333 16.5 322 392 345.333333 357.0
4 3 44 23.333333 23.5 322 322 322.000000 322.0
5 1 3 2.000000 2.0 322 322 322.000000 322.0
6 1 33 22.333333 17.0 322 322 322.000000 322.0
If you don't like that you could do it with pure Python, it's only a little bit more work:
import csv
data = {}
with open("mydata.csv", "r") as file:
for row in csv.reader(file):
dct = data.setdefault(row[2], {})
for col in (3, 4):
dct.setdefault(col, []).append(row[col])
min_str = "Min for group {} - column {}: {}"
max_str = "Max for group {} - column {}: {}"
for row in data:
for col in (3, 4):
print(min_str.format(row, col, min(data[row][col])))
print(max_str.format(row, col, max(data[row][col])))
Result:
Min for group 1 - column 3: 1
Max for group 1 - column 3: 90
Min for group 1 - column 4: 464
Max for group 1 - column 4: 525
Min for group 2 - column 3: 26
Max for group 2 - column 3: 55
Min for group 2 - column 4: 254
Max for group 2 - column 4: 525
Min for group 3 - column 3: 22
Max for group 3 - column 3: 30
Min for group 3 - column 4: 322
Max for group 3 - column 4: 392
...
mydata.csv:
0,2,1,1,464,385,171,0:44:4
1,1,2,26,254,444,525,0:56:2
2,3,1,90,525,785,522,0:52:8
3,8,2,3,525,233,555,0:52:8
4,7,1,10,525,433,522,1:52:8
5,9,2,55,525,555,522,1:52:8
6,6,3,3,392,111,232,1:43:4
7,1,4,23,322,191,112,1:43:4
8,1,3,30,322,191,112,1:43:4
9,1,5,2,322,191,112,1:43:4
10,1,3,22,322,191,112,1:43:4
11,1,4,44,322,191,112,1:43:4
12,1,5,1,322,191,112,1:43:4
12,1,4,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4
12,1,6,1,322,191,112,1:43:4
12,1,5,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4

How to merge multiple sheets and rename column names with the names of the sheet names?

I have the following data. It is all in one excel file.
Sheet name: may2019
Productivity Count
Date : 01-Apr-2020 00:00 to 30-Apr-2020 23:59
Date Type: Finalized Date Modality: All
Name MR DX CT US MG BMD TOTAL
Svetlana 29 275 101 126 5 5 541
Kate 32 652 67 171 1 0 923
Andrew 0 452 0 259 1 0 712
Tom 50 461 61 104 4 0 680
Maya 0 353 0 406 0 0 759
Ben 0 1009 0 143 0 0 1152
Justin 0 2 9 0 1 9 21
Total 111 3204 238 1209 12 14 4788
Sheet Name: June 2020
Productivity Count
Date : 01-Jun-2019 00:00 to 30-Jun-2019 23:59
Date Type: Finalized Date Modality: All
NAme US DX CT MR MG BMD TOTAL
Svetlana 4 0 17 6 0 4 31
Kate 158 526 64 48 1 0 797
Andrew 154 230 0 0 0 0 384
Tom 1 0 19 20 2 8 50
Maya 260 467 0 0 1 1 729
Ben 169 530 59 40 3 0 801
Justin 125 164 0 0 4 0 293
Alvin 0 1 0 0 0 0 1
Total 871 1918 159 114 11 13 3086
I want to merge all the sheets into on sheet, drop the first 3 rows of all the sheets and and this is the output I am looking for
Sl.No Name US_jun2019 DX_jun2019 CT_jun2019 MR_jun2019 MG_jun2019 BMD_jun2019 TOTAL_jun2019 MR_may2019 DX_may2019 CT_may2019 US_may2019 MG_may2019 BMD_may2019 TOTAL_may2019
1 Svetlana 4 0 17 6 0 4 31 29 275 101 126 5 5 541
2 Kate 158 526 64 48 1 0 797 32 652 67 171 1 0 923
3 Andrew 154 230 0 0 0 0 384 0 353 0 406 0 0 759
4 Tom 1 0 19 20 2 8 50 0 2 9 0 1 9 21
5 Maya 260 467 0 0 1 1 729 0 1009 0 143 0 0 1152
6 Ben 169 530 59 40 3 0 801 50 461 61 104 4 0 680
7 Justin 125 164 0 0 4 0 293 0 452 0 259 1 0 712
8 Alvin 0 1 0 0 0 0 1 #N/A #N/A #N/A #N/A #N/A #N/A #N/A
I tried the following code but the output is not the one i am looking for.
df=pd.concat(df,sort=False)
df= df.drop(df.index[[0,1]])
df=df.rename(columns=df.iloc[0])
df= df.drop(df.index[[0]])
df=df.drop(['Sl.No'], axis = 1)
print(df)
First, read both Excel sheets.
>>> df1 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="may2019")
>>> df2 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="jun2019")
Drop the first three rows.
>>> df1.drop(index=range(3), inplace=True)
>>> df2.drop(index=range(3), inplace=True)
Rename columns to the first row, and drop the first row
>>> df1.rename(columns=dict(zip(df1.columns, df1.iloc[0])), inplace=True)
>>> df1.drop(index=[0], inplace=True)
>>> df2.rename(columns=dict(zip(df2.columns, df2.iloc[0])), inplace=True)
>>> df2.drop(index=[0], inplace=True)
Add suffixes to the columns.
>>> df1.rename(columns=lambda col_name: col_name + '_may2019', inplace=True)
>>> df2.rename(columns=lambda col_name: col_name + '_jun2019', inplace=True)
Remove the duplicate name column in the second DF.
>>> df2.drop(columns=['Name'], inplace=True)
Concatenate both the dataframes
>>> df = pd.concat([df1, df2], axis=1, inplace=True)
All the code in one place:
import pandas as pd
df1 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="may2019")
df2 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="jun2019")
df1.drop(index=range(3), inplace=True)
df2.drop(index=range(3), inplace=True)
df1.rename(columns=dict(zip(df1.columns, df1.iloc[0])), inplace=True)
df1.drop(index=[0], inplace=True)
df2.rename(columns=dict(zip(df2.columns, df2.iloc[0])), inplace=True)
df2.drop(index=[0], inplace=True)
df1.rename(columns=lambda col_name: col_name + '_may2019', inplace=True)
df2.rename(columns=lambda col_name: col_name + '_jun2019', inplace=True)
df2.drop(columns=['Name'], inplace=True)
df = pd.concat([df2, df1], axis=1, inplace=True)
print(df)

Perform operation on columns based on values of another columns in pandas

I have a dataframe
df = pd.DataFrame([["A",1,98,88,"",567,453,545,656,323,756], ["B",1,99,"","",231,232,234,943,474,345], ["C",1,97,67,23,543,458,456,876,935,876], ["B",1,"",79,84,895,237,678,452,545,453], ["A",1,45,"",58,334,778,234,983,858,657], ["C",1,23,55,"",183,565,953,565,234,234]], columns=["id","date","col1","col2","col3","col1_num","col1_deno","col3_num","col3_deno","col2_num","col2_deno"])
I need to make Nan/blank values for respective _num and _deno for column name. Ex: Make values Nan/blank for "col1_num" and "col1_deno" if particular row of "col1" is blank. Repeat the same process for "col2_num" and "col2_deno" based on "col2", and for "col3_num" and "col3_deno" based on "col3".
Expected Output:
df_out = pd.DataFrame([["A",1,98,88,"",567,453,"","",323,756], ["B",1,99,"","",231,232,"","","",""], ["C",1,97,67,23,543,458,456,876,935,876], ["B",1,"",79,84,"","",678,452,545,453], ["A",1,45,"",58,334,778,234,983,"",""], ["C",1,23,55,"",183,565,"","",234,234]], columns=["id","date","col1","col2","col3","col1_num","col1_deno","col3_num","col3_deno","col2_num","col2_deno"])
How to do it?
Let us try with boolean masking:
# select the columns
c = pd.Index(['col1', 'col2', 'col3'])
# create boolean mask
m = df[c].eq('').to_numpy()
# mask the values in `_num` and `_deno` like columns
df[c + '_num'] = df[c + '_num'].mask(m, '')
df[c + '_deno'] = df[c + '_deno'].mask(m, '')
>>> df
id date col1 col2 col3 col1_num col1_deno col3_num col3_deno col2_num col2_deno
0 A 1 98 88 567 453 323 756
1 B 1 99 231 232
2 C 1 97 67 23 543 458 456 876 935 876
3 B 1 79 84 678 452 545 453
4 A 1 45 58 334 778 234 983
5 C 1 23 55 183 565 234 234
#shubham's answer is simple and to the point and I believe faster as well; this is just an option, where you may not be able to (or want to) list all the columns
Get the list of columns that need to be changed:
cols = [col for col in df if col.startswith('col')]
['col1',
'col2',
'col3',
'col1_num',
'col1_deno',
'col3_num',
'col3_deno',
'col2_num',
'col2_deno']
Create a dictionary pairing col1 to the columns to be changed, same for col2 and so on:
from collections import defaultdict
d = defaultdict(list)
for col in cols:
if "_" in col:
d[col.split("_")[0]].append(col)
d
defaultdict(list,
{'col1': ['col1_num', 'col1_deno'],
'col3': ['col3_num', 'col3_deno'],
'col2': ['col2_num', 'col2_deno']})
Iterate through the dict to assign the new values:
for key, val in d.items():
df.loc[df[key].eq(""), val] = ""
id date col1 col2 col3 col1_num col1_deno col3_num col3_deno col2_num col2_deno
0 A 1 98 88 567 453 323 756
1 B 1 99 231 232
2 C 1 97 67 23 543 458 456 876 935 876
3 B 1 79 84 678 452 545 453
4 A 1 45 58 334 778 234 983
5 C 1 23 55 183 565 234 234
Solution with MultiIndex:
#first convert not processing and testing columns to index
df1 = df.set_index(['id','date'])
cols = df1.columns
#split columns by _ for MultiIndex
df1.columns = df1.columns.str.split('_', expand=True)
#compare columns without _ (with NaN in second level) by empty string
m = df1.xs(np.nan, axis=1, level=1).eq('')
#create mask by all columns
mask = m.reindex(df1.columns, axis=1, level=0)
#set new values by mask, overwrite columns names
df1 = df1.mask(mask, '').set_axis(cols, axis=1).reset_index()
print (df1)
id date col1 col2 col3 col1_num col1_deno col3_num col3_deno col2_num \
0 A 1 98 88 567 453 323
1 B 1 99 231 232
2 C 1 97 67 23 543 458 456 876 935
3 B 1 79 84 678 452 545
4 A 1 45 58 334 778 234 983
5 C 1 23 55 183 565 234
col2_deno
0 756
1
2 876
3 453
4
5 234

pandas column values to row values

I have a dataset (171 columns) and when I take it into my dataframe, it looks like this way-
ANO MNO UJ2010 DJ2010 UF2010 DF2010 UM2010 DM2010 UA2010 DA2010 ...
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209 05/04/2010 ...
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348 05/04/2010 ...
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151 05/04/2010 ...
Now I want to change my dataframe like this way -
ANO MNO Time Unit
1 A 06/01/2010 113
1 A 06/02/2010 129
1 A 06/03/2010 143
2 B 06/01/2010 218
2 B 06/02/2010 211
2 B 06/03/2010 244
3 C 06/01/2010 22
3 C 06/02/2010 114
3 C 06/03/2010 100
....
.....
I tried to use pd.melt, but I think it does not fullfil my purpose. How can I do this?
Use pd.lreshape as a close alternative to pd.melt after filtering the columns to be grouped under the distinct headers.
Through the use of pd.lreshape, when you inject a dictionary object as it's groups parameter, the keys would take on the new header name and all the list of column names fed as values to this dict would be cast under that single header. Thus, it produces a long formatted DF after the transformation.
Finally sort the DF w.r.t the unused columns to align these accordingly.
Then, a reset_index(drop=True) at the end to relabel the index axis to the default integer values by dropping off the intermediate index.
d = pd.lreshape(df, {"Time": df.filter(regex=r'^D').columns,
"Unit": df.filter(regex=r'^U').columns})
d.sort_values(['ANO', 'MNO']).reset_index(drop=True)
If there's a mismatch in the length of the grouping columns, then:
from itertools import groupby, chain
unused_cols = ['ANO', 'MNO']
cols = df.columns.difference(unused_cols)
# filter based on the common strings starting from the first slice upto end.
fnc = lambda x: x[1:]
pref1, pref2 = "D", "U"
# Obtain groups based on a common interval of slices.
groups = [list(g) for n, g in groupby(sorted(cols, key=fnc), key=fnc)]
# Fill single length list with it's other char counterpart.
fill_missing = [i if len(i)==2 else i +
[pref1 + i[0][1:] if i[0][0] == pref2 else pref2 + i[0][1:]]
for i in groups]
# Reindex based on newly obtained column names.
df = df.reindex(columns=unused_cols + list(chain(*fill_missing)))
Continue the same steps with pd.lreshape as mentioned above but this time with dropna=False parameter included.
You can reshape by stack but first create MultiIndex in columns with % and //.
MultiIndex values map pairs Time and Unit to second level of MultiIndex by floor division (//) by 2, differences of each pairs are created by modulo division (%).
Then stack use last level created by // and create new level of MultiIndex in index, which is not necessary, so is removed by reset_index(level=2, drop=True).
Last reset_index for convert first and second level to columns.
[[1,0]] is for swap columns for change ordering.
df = df.set_index(['ANO','MNO'])
cols = np.arange(len(df.columns))
df.columns = [cols % 2, cols // 2]
print (df)
0 1 0 1 0 1 0 1
0 0 1 1 2 2 3 3
ANO MNO
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209 05/04/2010
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348 05/04/2010
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151 05/04/2010
df = df.stack()[[1,0]].reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
ANO MNO Time Unit
0 1 A 06/01/2010 113
1 1 A 06/02/2010 129
2 1 A 06/03/2010 143
3 1 A 05/04/2010 209
4 2 B 06/01/2010 218
5 2 B 06/02/2010 211
6 2 B 06/03/2010 244
7 2 B 05/04/2010 348
8 3 C 06/01/2010 22
9 3 C 06/02/2010 114
10 3 C 06/03/2010 100
11 3 C 05/04/2010 151
EDIT:
#last column is missing
print (df)
ANO MNO UJ2010 DJ2010 UF2010 DF2010 UM2010 DM2010 UA2010
0 1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209
1 2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348
2 3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151
df = df.set_index(['ANO','MNO'])
#MultiIndex is created by first character of column names with all another
df.columns = [df.columns.str[0], df.columns.str[1:]]
print (df)
U D U D U D U
J2010 J2010 F2010 F2010 M2010 M2010 A2010
ANO MNO
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151
#stack add missing values, replace them by NaN
df = df.stack().reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
ANO MNO Time Unit
0 1 A NaN 209
1 1 A 06/02/2010 129
2 1 A 06/01/2010 113
3 1 A 06/03/2010 143
4 2 B NaN 348
5 2 B 06/02/2010 211
6 2 B 06/01/2010 218
7 2 B 06/03/2010 244
8 3 C NaN 151
9 3 C 06/02/2010 114
10 3 C 06/01/2010 22
11 3 C 06/03/2010 100
You can use iloc with pd.concat for this. The solution is simple - just stack all relevant columns (which are selected via iloc) vertically one after another and concatenate them:
def rename(sub_df):
sub_df.columns = ["ANO", "MNO", "Time", "Unit"]
return sub_df
pd.concat([rename(df.iloc[:, [0, 1, x+1, x]])
for x in range(2, df.shape[1], 2)])
ANO MNO Time Unit
0 1 A 06/01/2010 113
1 2 B 06/01/2010 218
2 3 C 06/01/2010 22
0 1 A 06/02/2010 129
1 2 B 06/02/2010 211
2 3 C 06/02/2010 114
0 1 A 06/03/2010 143
1 2 B 06/03/2010 244
2 3 C 06/03/2010 100
0 1 A 05/04/2010 209
1 2 B 05/04/2010 348
2 3 C 05/04/2010 151

Categories