I want to find the indexes where a new range of 100 values begins.
In the case below, since the first row is 0, the next index would be the next number above 100 (7).
At index 7, the value is 104, so the next index would be next number above 204 (15).
At index 15, the value is 205, so the next index would be the next number above 305 (n/a).
Therefore the output would be [0, 7, 15].
0 0
1 0
2 4
3 10
4 30
5 65
6 92
7 104
8 108
9 109
10 123
11 132
12 153
13 160
14 190
15 205
16 207
17 210
18 240
19 254
20 254
21 254
22 263
23 273
24 280
25 293
You can do zfill to create three digit numbers:
# convert number to string
df['grp'] = df['b'].astype(str).str.zfill(3).str[0]
print(df)
a b grp
0 0 0 0
1 1 0 0
2 2 4 0
3 3 10 0
4 4 30 0
5 5 65 0
6 6 92 0
7 7 104 1
8 8 108 1
9 9 109 1
10 10 123 1
11 11 132 1
12 12 153 1
13 13 160 1
14 14 190 1
15 15 205 2
# get first row from each group
ix = df.groupby('grp').first()['a'].to_numpy()
print(ix)
array([ 0, 7, 15])
For sorted data, we can use searchsorted -
In [98]: df.head()
Out[98]:
A
0 0
1 0
2 4
3 10
4 30
In [143]: df.A.searchsorted(np.arange(0,df.A.iloc[-1],100))
Out[143]: array([ 0, 7, 15])
If you need based on dataframe/series index, index it by df.index -
In [101]: df.index[_]
Out[101]: Int64Index([0, 7, 15], dtype='int64')
Related
I have a dataframe like
pd.DataFrame({'i': [ 3, 4, 12, 25, 44, 45, 52, 53, 65, 66]
, 't': range(1,11)
, 'v': range(0,100)[::10]}
)
i.e.
i t v
0 3 1 0
1 4 2 10
2 12 3 20
3 25 4 30
4 44 5 40
5 45 6 50
6 52 7 60
7 53 8 70
8 65 9 80
9 66 10 90
I would like to sum the values in column v with the next column if i increased by 1, otherwise do nothing.
One can assume that there are maximally two consecutive rows to sum, thus the last row might be ambiguous, depending if it is summed or not.
The resulting dataframe should look like:
i t v
0 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
6 52 7 130
8 65 9 170
Obviously I could loop over the dataframe using .iterrows() but there must be a smarter solution.
I tried various combinations of shift, diff and groupby, though I cannot see the way to do it...
It's a common technique to identify the block with cumsum on diff:
blocks = df['i'].diff().ne(1).cumsum()
df.groupby(blocks, as_index=False).agg({'i':'first','t':'first', 'v':'sum'})
Output:
i t v
0 3 1 10
1 12 3 20
2 25 4 30
3 44 5 90
4 52 7 130
5 65 9 170
Let us try
out = df.groupby(df['i'].diff().ne(1).cumsum()).agg({'i':'first','t':'first','v':'sum'})
Out[11]:
i t v
i
1 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
5 52 7 130
6 65 9 170
I have a dataframe df1:
Time Delta_time
0 0 NaN
1 15 15
2 18 3
3 30 12
4 45 15
5 64 19
6 80 16
7 82 2
8 100 18
9 120 20
where Delta_time is the difference between adjacent values in the Time column. I have another dataframe df2 that has time values numbering from 0 to 120 (121 rows) and another column called 'Short_gap'.
How do I set the value of Short_gap to 1 for all Time values that lie in a Delta_time value smaller than 5? For example, the Short_gap column should have a value of 1 for Time = 15,16,17,18 since Delta_time = 3 < 5.
Edit: Currently, df2 looks like this.
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
... ... ...
118 118 0
119 119 0
120 120 0
The expected output for df2 is
Time Short_gap
0 0 0
1 1 0
2 2 0
... ... ...
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 1
19 19 0
20 20 0
... ... ...
78 78 0
79 79 0
80 80 1
81 81 1
82 82 1
83 83 0
84 84 0
... ... ...
119 119 0
120 120 0
Try:
t = df['Delta_time'].shift(-1)
df2 = ((t < 5).repeat(t.fillna(1)).astype(int).reset_index(drop=True)
.to_frame(name='Short_gap').rename_axis('Time').reset_index())
print(df2.head(20))
print('...')
print(df2.loc[78:84])
Output:
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 0
19 19 0
...
Time Short_gap
78 78 0
79 79 0
80 80 1
81 81 1
82 82 0
83 83 0
84 84 0
id numbers
1 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
2 {'105': 1, '65': 11, '75': 0, '85': 50, '95': 0}
3 {'105': 1, '65': 11, '75': 0, '85': 51, '95': 0}
4 {}
5 {}
6 {}
7 {'75 cm': 7, '85 cm': 52, '95 cm': 10}
8 {'75 cm': 51, '85 cm': 114, '95 cm': 10}
9 {'75 cm': 9, '85 cm': 60, '95 cm': 10}
this is the current table
I know how to turn the dict into column and rows (key as column and value as rows but what i am looking for is for key and value to be rows with their own column headers)
test = pd.concat([df.drop(['numbers'], axis=1).sort_values(['id']),
df['numbers'].apply(pd.Series)], axis=1)
test2 = test.melt(id_vars=['id'],
var_name="name",
value_name="nameN").fillna(0)
im trying to get each key and value in the dictionary to be rows
id name nameN
1 105 1
1 65 11
1 75 0
1 85 51
1 95 0
You should use comprehensions to build the data for a new DataFrame. If you can just drop the ids where numbers is an empy dictionary, you can do:
test = pd.DataFrame([[x['id'], k, v] for _, x in df.iterrows()
for k,v in x['numbers'].items()], columns=['id', 'name', 'nameN'])
to get:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 105 1
11 3 65 11
12 3 75 0
13 3 85 51
14 3 95 0
15 7 75 cm 7
16 7 85 cm 52
17 7 95 cm 10
18 8 75 cm 51
19 8 85 cm 114
20 8 95 cm 10
21 9 75 cm 9
22 9 85 cm 60
23 9 95 cm 10
If you want a line with a specific value when numbers is empty:
test2 = pd.DataFrame([i for lst in [[[x['id'], '', '']] if x['numbers'] == {}
else [[x['id'], k, v] for k,v in x['numbers'].items()]
for _, x in df.iterrows()] for i in lst],
columns=['id', 'name', 'nameN']).sort_values('id').reset_index(drop=True)
giving:
id name nameN
0 1 105 1
1 1 65 11
2 1 75 0
3 1 85 51
4 1 95 0
5 2 105 1
6 2 65 11
7 2 75 0
8 2 85 50
9 2 95 0
10 3 95 0
11 3 75 0
12 3 85 51
13 3 105 1
14 3 65 11
15 4
16 5
17 6
18 7 75 cm 7
19 7 85 cm 52
20 7 95 cm 10
21 8 75 cm 51
22 8 85 cm 114
23 8 95 cm 10
24 9 85 cm 60
25 9 75 cm 9
26 9 95 cm 10
I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113
I want to expand my dataframe with duplicate the row regularly.
import pandas as pd
import numpy as np
def expandData(data, timeStep=2, sampleLen= 5):
dataEp = pd.DataFrame()
for epoch in range(int(len(data)/sampleLen)):
dataSample = data.iloc[epoch*sampleLen:(epoch+1)*sampleLen, :]
for num in range(int(sampleLen-timeStep +1)):
tempDf = dataSample.iloc[num:timeStep+num,:]
dataEp = pd.concat([dataEp, tempDf],axis= 0)
return dataEp
df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
dfEp = expandData(df, 3, 5)
Output:
df
a other
0 0 100
1 1 101
2 2 102
3 3 103
4 4 104
5 15 105
6 16 106
7 17 107
8 18 108
9 19 109
dfEp
a other
0 0 100
1 1 101
2 2 102
1 1 101
2 2 102
3 3 103
2 2 102
3 3 103
4 4 104
5 15 105
6 16 106
7 17 107
6 16 106
7 17 107
8 18 108
7 17 107
8 18 108
9 19 109
Expected:
I expect a better a way of achieving it with good performance, as if the dataframe has large row size,such as 40 thousands rows, my code will run for about 20 minutes.
Edit:
Actually, I expect to repeat a small sequence with size of timeStep. And I have changed expandData(df, 2, 5) into expandData(df, 3, 5).
If your a values are evenly spaced, you can test for breaks in the series and then replicate the rows that are within each consecutive series according to this answer:
df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
#equally spaced rows have value zero, start/stop rows not
df["start/stop"] = df.a.diff().shift(-1) - df.a.diff()
#repeat rows with value zero in the new column
repeat = [2 if val == 0 else 1 for val in df["start/stop"]]
df = df.loc[np.repeat(df.index.values, repeat)]
print(df)
Sample output:
a other start/stop
0 0 100 NaN
1 1 101 0.0
1 1 101 0.0
2 2 102 0.0
2 2 102 0.0
3 3 103 0.0
3 3 103 0.0
4 4 104 10.0
5 15 105 -10.0
6 16 106 0.0
6 16 106 0.0
7 17 107 0.0
7 17 107 0.0
8 18 108 0.0
8 18 108 0.0
9 19 109 NaN
If it is just about the epoch length (you do not specify clearly the rules), then it is even simpler:
df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
sampleLen = 5
repeat = np.repeat([2], sampleLen)
repeat[0] = repeat[-1] = 1
repeat = np.tile(repeat, len(df)//sampleLen)
df = df.loc[np.repeat(df.index.values, repeat)]