My code
import pandas as pd
import numpy as np
series = pd.read_csv('o1.csv', header=0)
s1 = series
s2 = series
s1['userID'] = series['userID'] + 5
s1['adID'] = series['adID'] + 3
s2['userID'] = s1['userID'] + 5
s2['adID'] = series['adID'] + 4
r1=series.append(s1)
r2=r1.append(s2)
print(r2)
I got something wrong,now columns are exactly the same.
Output
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
3 12 f 107 50
4 12 f 108 100
5 13 m 109 62
6 13 m 114 28
7 13 m 108 36
8 12 f 109 74
9 12 f 114 100
10 14 m 108 62
11 14 m 109 28
12 15 f 116 50
13 15 f 117 100
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
I didn't want my series column to be changed.
Why did it happened?
How to change this?
Do I need to use iloc?
IIUC need copy if need new object DataFrame:
s1 = series.copy()
s2 = series.copy()
Sample:
print (df)
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
s1 = df.copy()
s2 = df.copy()
s1['userID'] = df['userID'] + 5
s1['adID'] = df['adID'] + 3
s2['userID'] = s1['userID'] + 5
s2['adID'] = df['adID'] + 4
r1=df.append(s1)
r2=r1.append(s2)
print(r2)
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
0 16 m 110 50
1 16 m 111 100
2 16 m 112 0
0 21 m 111 50
1 21 m 112 100
2 21 m 113 0
Related
I want to multiply 2 columns (A*B) in a DataFrame where columns are pd.MultiIndex.
I want to perform this multiplication for each DataX (Data1, Data2, ...) column in columns level=0.
df = pd.DataFrame(data= np.arange(32).reshape(8,4),
columns = pd.MultiIndex.from_product(iterables = [["Data1","Data2"],["A","B"]]))
Data1 Data2
A B A B
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
5 20 21 22 23
6 24 25 26 27
7 28 29 30 31
The result of multiplication should be also a DataFrame with columns=pd.MultiIndex (see below).
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
I managed to perform this multiplication by iterating over columns, level=0,but looking a better way to do it.
for _ in df.columns.get_level_values(level=0).unique().tolist()[:]:
df[(_, "A*B")] = df[(_, "A")] * df[(_, "B")]
Any suggestions or hints much appreciated!
Thanks
Here is another alternative using df.prod and df.join
u = df.prod(axis=1,level=0)
u.columns=pd.MultiIndex.from_product((u.columns,['*'.join(df.columns.levels[1])]))
out = df.join(u)
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
Slice out the 'A' and 'B' along the first level of the columns Index. Then you can multiply which will align on the 0th level ('Data1', 'Data2'). We'll then re-create the MultiIndex on the columns and join back
df1 = df.xs('A', axis=1, level=1).multiply(df.xs('B', axis=1, level=1))
df1.columns = pd.MultiIndex.from_product([df1.columns, ['A*B']])
df = pd.concat([df, df1], axis=1)
Here are some timings assuming you have 2 groups (Data1, Data2) and your DataFrame just gets longer. Turns out, the simple loop might be the fastest of them all. (I added some sorting and needed to copy them all so the output is the same).
import perfplot
import pandas as pd
import numpy as np
##Tom
def simple_loop(df):
for _ in df.columns.get_level_values(level=0).unique().tolist()[:]:
df[(_, "A*B")] = df[(_, "A")] * df[(_, "B")]
return df.sort_index(axis=1)
##Roy2012
def mul_with_stack(df):
df = df.stack(level=0)
df["A*B"] = df.A * df.B
return df.stack().swaplevel().unstack(level=[2,1]).sort_index(axis=1)
##Alollz
def xs_concat(df):
df1 = df.xs('A', axis=1, level=1).multiply(df.xs('B', axis=1, level=1))
df1.columns = pd.MultiIndex.from_product([df1.columns, ['A*B']])
return pd.concat([df, df1], axis=1).sort_index(axis=1)
##anky
def prod_join(df):
u = df.prod(axis=1,level=0)
u.columns=pd.MultiIndex.from_product((u.columns,['*'.join(df.columns.levels[1])]))
return df.join(u).sort_index(axis=1)
perfplot.show(
setup=lambda n: pd.DataFrame(data=np.arange(4*n).reshape(n, 4),
columns =pd.MultiIndex.from_product(iterables=[["Data1", "Data2"], ["A", "B"]])),
kernels=[
lambda df: simple_loop(df.copy()),
lambda df: mul_with_stack(df.copy()),
lambda df: xs_concat(df.copy()),
lambda df: prod_join(df.copy())
],
labels=['simple_loop', 'stack_and_multiply', 'xs_concat', 'prod_join'],
n_range=[2 ** k for k in range(3, 20)],
equality_check=np.allclose,
xlabel="len(df)"
)
Here's a way to do it with stack and unstack. The advantage: fully vectorized, no loops, no join operations.
t = df.stack(level=0)
t["A*B"] = t.A * t.B
t = t.stack().swaplevel().unstack(level=[2,1])
The output is:
Data1 Data2
A B A*B A B A*B
0 0 1 0 2 3 6
1 4 5 20 6 7 42
2 8 9 72 10 11 110
3 12 13 156 14 15 210
4 16 17 272 18 19 342
Another alternative here, using prod :
df[("Data1", "A*B")] = df.loc(axis=1)["Data1"].prod(axis=1)
df[("Data2", "A*B")] = df.loc(axis=1)["Data2"].prod(axis=1)
df
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
I have a df as shown below
df:
Id gender age salary
1 m 27 100
2 m 26 100000
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 2000
From the above I would like to replace the value more than 80 percentile value with 80 percentile value.
Expected output:
Id gender age salary
1 m 27 100
2 m 26 560
3 m 57 180
4 f 27 150
5 m 57 200
6 f 29 100
7 m 47 130
8 f 27 140
9 m 37 100
10 f 43 560
Let's try:
quantiles = df.salary.quantile(0.8)
df.loc[df.salary > quantiles, 'salary'] = quantiles
Output (can't quite get 200 as .8 percentile though):
Id gender age salary
0 1 m 27 100.0
1 2 m 26 560.0
2 3 m 57 180.0
3 4 f 27 150.0
4 5 m 57 200.0
5 6 f 29 100.0
6 7 m 47 130.0
7 8 f 27 140.0
8 9 m 37 100.0
9 10 f 43 560.0
In case you want to fill within gender:
quantiles = df.groupby('gender')['salary'].transform('quantile', q=0.8)
Output:
Id gender age salary
0 1 m 27 100
1 2 m 26 200
2 3 m 57 180
3 4 f 27 150
4 5 m 57 200
5 6 f 29 100
6 7 m 47 130
7 8 f 27 140
8 9 m 37 100
9 10 f 43 890
I have a dataset in the following format. It got 48 columns and about 200000 rows.
slot1,slot2,slot3,slot4,slot5,slot6...,slot45,slot46,slot47,slot48
1,2,3,4,5,6,7,......,45,46,47,48
3.5,5.2,2,5.6,...............
I want to reshape this dataset to something as below, where N is less than 48 (maybe 24 or 12 etc..) column headers doesn't matter.
when N = 4
slotNew1,slotNew2,slotNew3,slotNew4
1,2,3,4
5,6,7,8
......
45,46,47,48
3.5,5.2,2,5.6
............
I can read row by row and then split each row and append to a new dataframe. But that is very inefficient. Is there any efficient and faster way to do that?
You may try this
N = 4
df_new = pd.DataFrame(df_original.values.reshape(-1, N))
df_new.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
The code extracts the data into numpy.ndarray, reshape it, and create a new dataset of desired dimension.
Example:
import numpy as np
import pandas as pd
df0 = pd.DataFrame(np.arange(48 * 3).reshape(-1, 48))
df0.columns = ['slot{:}'.format(i + 1) for i in range(48)]
print(df0)
# slot1 slot2 slot3 slot4 ... slot45 slot46 slot47 slot48
# 0 0 1 2 3 ... 44 45 46 47
# 1 48 49 50 51 ... 92 93 94 95
# 2 96 97 98 99 ... 140 141 142 143
#
# [3 rows x 48 columns]
N = 4
df = pd.DataFrame(df0.values.reshape(-1, N))
df.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
print(df.head())
# slotNew1 slotNew2 slotNew3 slotNew4
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 3 12 13 14 15
# 4 16 17 18 19
Another approach
N = 4
df1 = df0.stack().reset_index()
df1['i'] = df1['level_1'].str.replace('slot', '').astype(int) // N
df1['j'] = df1['level_1'].str.replace('slot', '').astype(int) % N
df1['i'] -= (df1['j'] == 0) - df1['level_0'] * 48 / N
df1['j'] += (df1['j'] == 0) * N
df1['j'] = 'slotNew' + df1['j'].astype(str)
df1 = df1[['i', 'j', 0]]
df = df1.pivot(index='i', columns='j', values=0)
Use pandas.explode after making chunks. Given df:
import pandas as pd
df = pd.DataFrame([np.arange(1, 49)], columns=['slot%s' % i for i in range(1, 49)])
print(df)
slot1 slot2 slot3 slot4 slot5 slot6 slot7 slot8 slot9 slot10 ... \
0 1 2 3 4 5 6 7 8 9 10 ...
slot39 slot40 slot41 slot42 slot43 slot44 slot45 slot46 slot47 \
0 39 40 41 42 43 44 45 46 47
slot48
0 48
Using chunks to divide:
def chunks(l, n):
"""Yield successive n-sized chunks from l.
Source: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n_items = len(l)
if n_items % n:
n_pads = n - n_items % n
else:
n_pads = 0
l = l + [np.nan for _ in range(n_pads)]
for i in range(0, len(l), n):
yield l[i:i + n]
N = 4
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
...
Advantage of this approach over numpy.reshape is that it can handle when N is not a factor:
N = 7
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7.0
1 8 9 10 11 12 13 14.0
2 15 16 17 18 19 20 21.0
3 22 23 24 25 26 27 28.0
4 29 30 31 32 33 34 35.0
5 36 37 38 39 40 41 42.0
6 43 44 45 46 47 48 NaN
I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113
I want to expand my dataframe with duplicate the row regularly.
import pandas as pd
import numpy as np
def expandData(data, timeStep=2, sampleLen= 5):
dataEp = pd.DataFrame()
for epoch in range(int(len(data)/sampleLen)):
dataSample = data.iloc[epoch*sampleLen:(epoch+1)*sampleLen, :]
for num in range(int(sampleLen-timeStep +1)):
tempDf = dataSample.iloc[num:timeStep+num,:]
dataEp = pd.concat([dataEp, tempDf],axis= 0)
return dataEp
df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
dfEp = expandData(df, 3, 5)
Output:
df
a other
0 0 100
1 1 101
2 2 102
3 3 103
4 4 104
5 15 105
6 16 106
7 17 107
8 18 108
9 19 109
dfEp
a other
0 0 100
1 1 101
2 2 102
1 1 101
2 2 102
3 3 103
2 2 102
3 3 103
4 4 104
5 15 105
6 16 106
7 17 107
6 16 106
7 17 107
8 18 108
7 17 107
8 18 108
9 19 109
Expected:
I expect a better a way of achieving it with good performance, as if the dataframe has large row size,such as 40 thousands rows, my code will run for about 20 minutes.
Edit:
Actually, I expect to repeat a small sequence with size of timeStep. And I have changed expandData(df, 2, 5) into expandData(df, 3, 5).
If your a values are evenly spaced, you can test for breaks in the series and then replicate the rows that are within each consecutive series according to this answer:
df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
#equally spaced rows have value zero, start/stop rows not
df["start/stop"] = df.a.diff().shift(-1) - df.a.diff()
#repeat rows with value zero in the new column
repeat = [2 if val == 0 else 1 for val in df["start/stop"]]
df = df.loc[np.repeat(df.index.values, repeat)]
print(df)
Sample output:
a other start/stop
0 0 100 NaN
1 1 101 0.0
1 1 101 0.0
2 2 102 0.0
2 2 102 0.0
3 3 103 0.0
3 3 103 0.0
4 4 104 10.0
5 15 105 -10.0
6 16 106 0.0
6 16 106 0.0
7 17 107 0.0
7 17 107 0.0
8 18 108 0.0
8 18 108 0.0
9 19 109 NaN
If it is just about the epoch length (you do not specify clearly the rules), then it is even simpler:
df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
sampleLen = 5
repeat = np.repeat([2], sampleLen)
repeat[0] = repeat[-1] = 1
repeat = np.tile(repeat, len(df)//sampleLen)
df = df.loc[np.repeat(df.index.values, repeat)]