pandas column values to row values - python

I have a dataset (171 columns) and when I take it into my dataframe, it looks like this way-
ANO MNO UJ2010 DJ2010 UF2010 DF2010 UM2010 DM2010 UA2010 DA2010 ...
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209 05/04/2010 ...
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348 05/04/2010 ...
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151 05/04/2010 ...
Now I want to change my dataframe like this way -
ANO MNO Time Unit
1 A 06/01/2010 113
1 A 06/02/2010 129
1 A 06/03/2010 143
2 B 06/01/2010 218
2 B 06/02/2010 211
2 B 06/03/2010 244
3 C 06/01/2010 22
3 C 06/02/2010 114
3 C 06/03/2010 100
....
.....
I tried to use pd.melt, but I think it does not fullfil my purpose. How can I do this?

Use pd.lreshape as a close alternative to pd.melt after filtering the columns to be grouped under the distinct headers.
Through the use of pd.lreshape, when you inject a dictionary object as it's groups parameter, the keys would take on the new header name and all the list of column names fed as values to this dict would be cast under that single header. Thus, it produces a long formatted DF after the transformation.
Finally sort the DF w.r.t the unused columns to align these accordingly.
Then, a reset_index(drop=True) at the end to relabel the index axis to the default integer values by dropping off the intermediate index.
d = pd.lreshape(df, {"Time": df.filter(regex=r'^D').columns,
"Unit": df.filter(regex=r'^U').columns})
d.sort_values(['ANO', 'MNO']).reset_index(drop=True)
If there's a mismatch in the length of the grouping columns, then:
from itertools import groupby, chain
unused_cols = ['ANO', 'MNO']
cols = df.columns.difference(unused_cols)
# filter based on the common strings starting from the first slice upto end.
fnc = lambda x: x[1:]
pref1, pref2 = "D", "U"
# Obtain groups based on a common interval of slices.
groups = [list(g) for n, g in groupby(sorted(cols, key=fnc), key=fnc)]
# Fill single length list with it's other char counterpart.
fill_missing = [i if len(i)==2 else i +
[pref1 + i[0][1:] if i[0][0] == pref2 else pref2 + i[0][1:]]
for i in groups]
# Reindex based on newly obtained column names.
df = df.reindex(columns=unused_cols + list(chain(*fill_missing)))
Continue the same steps with pd.lreshape as mentioned above but this time with dropna=False parameter included.

You can reshape by stack but first create MultiIndex in columns with % and //.
MultiIndex values map pairs Time and Unit to second level of MultiIndex by floor division (//) by 2, differences of each pairs are created by modulo division (%).
Then stack use last level created by // and create new level of MultiIndex in index, which is not necessary, so is removed by reset_index(level=2, drop=True).
Last reset_index for convert first and second level to columns.
[[1,0]] is for swap columns for change ordering.
df = df.set_index(['ANO','MNO'])
cols = np.arange(len(df.columns))
df.columns = [cols % 2, cols // 2]
print (df)
0 1 0 1 0 1 0 1
0 0 1 1 2 2 3 3
ANO MNO
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209 05/04/2010
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348 05/04/2010
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151 05/04/2010
df = df.stack()[[1,0]].reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
ANO MNO Time Unit
0 1 A 06/01/2010 113
1 1 A 06/02/2010 129
2 1 A 06/03/2010 143
3 1 A 05/04/2010 209
4 2 B 06/01/2010 218
5 2 B 06/02/2010 211
6 2 B 06/03/2010 244
7 2 B 05/04/2010 348
8 3 C 06/01/2010 22
9 3 C 06/02/2010 114
10 3 C 06/03/2010 100
11 3 C 05/04/2010 151
EDIT:
#last column is missing
print (df)
ANO MNO UJ2010 DJ2010 UF2010 DF2010 UM2010 DM2010 UA2010
0 1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209
1 2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348
2 3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151
df = df.set_index(['ANO','MNO'])
#MultiIndex is created by first character of column names with all another
df.columns = [df.columns.str[0], df.columns.str[1:]]
print (df)
U D U D U D U
J2010 J2010 F2010 F2010 M2010 M2010 A2010
ANO MNO
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151
#stack add missing values, replace them by NaN
df = df.stack().reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
ANO MNO Time Unit
0 1 A NaN 209
1 1 A 06/02/2010 129
2 1 A 06/01/2010 113
3 1 A 06/03/2010 143
4 2 B NaN 348
5 2 B 06/02/2010 211
6 2 B 06/01/2010 218
7 2 B 06/03/2010 244
8 3 C NaN 151
9 3 C 06/02/2010 114
10 3 C 06/01/2010 22
11 3 C 06/03/2010 100

You can use iloc with pd.concat for this. The solution is simple - just stack all relevant columns (which are selected via iloc) vertically one after another and concatenate them:
def rename(sub_df):
sub_df.columns = ["ANO", "MNO", "Time", "Unit"]
return sub_df
pd.concat([rename(df.iloc[:, [0, 1, x+1, x]])
for x in range(2, df.shape[1], 2)])
ANO MNO Time Unit
0 1 A 06/01/2010 113
1 2 B 06/01/2010 218
2 3 C 06/01/2010 22
0 1 A 06/02/2010 129
1 2 B 06/02/2010 211
2 3 C 06/02/2010 114
0 1 A 06/03/2010 143
1 2 B 06/03/2010 244
2 3 C 06/03/2010 100
0 1 A 05/04/2010 209
1 2 B 05/04/2010 348
2 3 C 05/04/2010 151

Related

Split columns conditionally on string

I have a data frame with the following shape:
0 1
0 OTT:81 DVBC:398
1 OTT:81 DVBC:474
2 OTT:81 DVBC:474
3 OTT:81 DVBC:454
4 OTT:81 DVBC:443
5 OTT:1 DVBC:254
6 DVBC:151 None
7 OTT:1 DVBC:243
8 OTT:1 DVBC:254
9 DVBC:227 None
I want for column 1 to be same as column 0 if column 1 contains "DVBC".
The split the values on ":" and the fill the empty ones with 0.
The end data frame should look like this
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I try to do this starting with:
if df[0].str.contains("DVBC") is True:
df[1] = df[0]
But after this the data frame looks the same not sure why.
My idea after is to pass the values to the respective columns then split by ":" and rename the columns.
How can I implement this?
Universal solution for split values by : and pivoting- first create Series by DataFrame.stack, split by Series.str.splitSeries.str.rsplit and last reshape by DataFrame.pivot:
df = df.stack().str.split(':', expand=True).reset_index()
df = df.pivot('level_0',0,1).fillna(0).rename_axis(index=None, columns=None)
print (df)
DVBC OTT
0 398 81
1 474 81
2 474 81
3 454 81
4 443 81
5 254 1
6 151 0
7 243 1
8 254 1
9 227 0
Here is one way that should work with any number of columns:
(df
.apply(lambda c: c.str.extract(':(\d+)', expand=False))
.ffill(axis=1)
.mask(df.replace('None', pd.NA).isnull().shift(-1, axis=1, fill_value=False), 0)
)
output:
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227

How to calculate min and max of a column for particular rows?

I have a csv file as following:
0 2 1 1 464 385 171 0:44:4
1 1 2 26 254 444 525 0:56:2
2 3 1 90 525 785 522 0:52:8
3 8 2 3 525 233 555 0:52:8
4 7 1 10 525 433 522 1:52:8
5 9 2 55 525 555 522 1:52:8
6 6 3 3 392 111 232 1:43:4
7 1 4 23 322 191 112 1:43:4
8 1 3 30 322 191 112 1:43:4
9 1 5 2 322 191 112 1:43:4
10 1 3 22 322 191 112 1:43:4
11 1 4 44 322 191 112 1:43:4
12 1 5 1 322 191 112 1:43:4
12 1 4 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
12 1 6 1 322 191 112 1:43:4
12 1 5 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
.
.
Third column has numbers between 1 to 6. I want to read information of columns #4 and #5 for all the rows that have number 1 to 6 in the third columns and find the maximum and minmum amount for each row that has number 1 to 6 seprately. For example output like this:
Mix for row with 1: 1
Max for row with 1: 90
Min for row with 2: 3
Max for row with 2: 55
and so on
I can plot the figure using following code. How to get summary statistics by group? What I'm looking for is to get multiple statistics for the same group like mean, min, max, number of each group in one call, is that doable?
import matplotlib.pyplot as plt
import csv
x= []
y= []
with open('mydata.csv','r') as csvfile:
ap = csv.reader(csvfile, delimiter=',')
for row in ap:
x.append(int(row[2]))
y.append(int(row[7]))
plt.scatter(x, y, color = 'g',s = 4, marker='o')
plt.show()
One easy way would be to use Pandas with read_csv(), .groupby() and .agg():
import pandas as pd
df = pd.read_csv("mydata.csv", header=None)
def min_max_avg(col):
return (col.min() + col.max()) / 2
result = df[[2, 3, 4]].groupby(2).agg(["min", "max", "mean", min_max_avg])
Result:
3 4
min max mean min_max_avg min max mean min_max_avg
2
1 1 90 33.666667 45.5 464 525 504.666667 494.5
2 3 55 28.000000 29.0 254 525 434.666667 389.5
3 3 30 18.333333 16.5 322 392 345.333333 357.0
4 3 44 23.333333 23.5 322 322 322.000000 322.0
5 1 3 2.000000 2.0 322 322 322.000000 322.0
6 1 33 22.333333 17.0 322 322 322.000000 322.0
If you don't like that you could do it with pure Python, it's only a little bit more work:
import csv
data = {}
with open("mydata.csv", "r") as file:
for row in csv.reader(file):
dct = data.setdefault(row[2], {})
for col in (3, 4):
dct.setdefault(col, []).append(row[col])
min_str = "Min for group {} - column {}: {}"
max_str = "Max for group {} - column {}: {}"
for row in data:
for col in (3, 4):
print(min_str.format(row, col, min(data[row][col])))
print(max_str.format(row, col, max(data[row][col])))
Result:
Min for group 1 - column 3: 1
Max for group 1 - column 3: 90
Min for group 1 - column 4: 464
Max for group 1 - column 4: 525
Min for group 2 - column 3: 26
Max for group 2 - column 3: 55
Min for group 2 - column 4: 254
Max for group 2 - column 4: 525
Min for group 3 - column 3: 22
Max for group 3 - column 3: 30
Min for group 3 - column 4: 322
Max for group 3 - column 4: 392
...
mydata.csv:
0,2,1,1,464,385,171,0:44:4
1,1,2,26,254,444,525,0:56:2
2,3,1,90,525,785,522,0:52:8
3,8,2,3,525,233,555,0:52:8
4,7,1,10,525,433,522,1:52:8
5,9,2,55,525,555,522,1:52:8
6,6,3,3,392,111,232,1:43:4
7,1,4,23,322,191,112,1:43:4
8,1,3,30,322,191,112,1:43:4
9,1,5,2,322,191,112,1:43:4
10,1,3,22,322,191,112,1:43:4
11,1,4,44,322,191,112,1:43:4
12,1,5,1,322,191,112,1:43:4
12,1,4,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4
12,1,6,1,322,191,112,1:43:4
12,1,5,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4

Append value/index for each duplicated row within a Pandas Dataframe

I have a sorted Dataframe with some duplicated ids and I wanted to make the ids unique by appending the index in which they appear in their duplicated list.
Original df:
id val
1 100
1 526
2 434
3 234
4 657
4 44
4 121
Notice how there are duplicate ids.
This is what I'm hoping for:
id val
1 100
1-1 526
2 434
3 234
4 657
4-1 44
4-2 121
Would also be ok with:
id val
1-0 100
1-1 526
2-0 434
3-0 234
4-0 657
4-1 44
4-2 121
Here's a way to do:
df2 = df.copy()
df2['id'] = df['id'].astype(str) + '-' + df.groupby('id').cumcount().astype(str)
id val
0 1-0 100
1 1-1 526
2 2-0 434
3 3-0 234
4 4-0 657
5 4-1 44
6 4-2 121
df['id'] = df.groupby('id')['id'].transform(lambda x: ['{}-{}'.format(v, i) if i else v for i, v in enumerate(x)])
print(df)
Prints:
id val
0 1 100
1 1-1 526
2 2 434
3 3 234
4 4 657
5 4-1 44
6 4-2 121

use result of a function applied to groupby for calculation on the original df

I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113

performing math on dataframe variables after groupby in pandas and bringing results back to original dataframe

First the data:
df
City Date Sex Weight
0 A 6/12/2015 M 185
1 A 6/12/2015 F 120
2 A 7/12/2015 M 210
3 A 7/12/2015 F 105
4 B 6/12/2015 M 225
5 B 6/12/2015 F 155
6 B 6/19/2015 M 167
7 B 6/19/2015 F 121
I am trying to subtract two weights, male-female. I am able to group the data and select the weights for each sex but am unable to simply create a new variable "wt_diff" and have the "wt_diff" appear on each row regardless of sex so that each city/date/sex group would in fact have, on the same row, the weight diff between the sexes.
I am looking to have this output:
df_new
City Date Sex Weight Wt_Diff
0 A 6/12/2015 M 185 65
1 A 6/12/2015 F 120 65
2 A 7/12/2015 M 210 105
3 A 7/12/2015 F 105 105
4 B 6/12/2015 M 225 70
5 B 6/12/2015 F 155 70
6 B 6/19/2015 M 167 46
7 B 6/19/2015 F 121 46
I can get the weight diffs by using this:
def diffw(df):
return(np.diff(df.Weight)*-1)
gb = ['Date', 'City']
gb=df.groupby(gb).apply(diffw)
gb
Date City
6/12/2015 A [65]
B [70]
6/19/2015 B [46]
7/12/2015 A [105]
dtype: object
I am just at a loss on how to get the wt_diffs back to the original df on each row.
Many thanks for any help . . .
John
You can use GroupBy.transform:
>>> f = df.groupby(['City', 'Date'])['Weight'].transform
>>> df['Wt_Diff'] = f('max') - f('min')
>>> df
City Date Sex Weight Wt_Diff
0 A 6/12/2015 M 185 65
1 A 6/12/2015 F 120 65
2 A 7/12/2015 M 210 105
3 A 7/12/2015 F 105 105
4 B 6/12/2015 M 225 70
5 B 6/12/2015 F 155 70
6 B 6/19/2015 M 167 46
7 B 6/19/2015 F 121 46
Edit: if max - min does not work, the easiest thing would be to add signed weight column first:
>>> df['+/-Weight'] = df['Weight'].where(df['Sex'] == 'M', -df['Weight'])
>>> df['Wt_Diff'] = df.groupby(['City', 'Date'])['+/-Weight'].transform('sum')

Categories