How to interpolate data and angles with PANDAS - python

I have a simple dataframe df that contains three columns:
Time: expressed in seconds
A: set of values that can vary between -inf to +inf
B: set of angles (degrees) which range between 0 and 359
Here is the dataframe
df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]})
And it looks like this:
Time A B
0 0 5 300
1 12 7 358
2 23 9 4
3 25 8 10
4 44 11 2
5 50 6 350
My idea is to interpolate the data from 0 to 50 seconds and I was able to achieve my goal using the following lines of code:
y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))})
df = pd.merge(left=y, right=df, on='Time', how='left').interpolate()
Problem: even though column A is interpolated correctly, column B is wrong because the interpolation of an angle between 360 degrees is not performed! Here is an example:
Time A B
12 12 7.000000 358.000000
13 13 7.181818 325.818182
14 14 7.363636 293.636364
15 15 7.545455 261.454545
16 16 7.727273 229.272727
17 17 7.909091 197.090909
18 18 8.090909 164.909091
19 19 8.272727 132.727273
20 20 8.454545 100.545455
21 21 8.636364 68.363636
22 22 8.818182 36.181818
23 23 9.000000 4.000000
Question: can you suggest me a smart and efficient way to solve this issue and being able to interpolate correctly the angles between 0/360 degrees?

You should be able to use the method described in this question for the angle column:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]})
df['B'] = np.rad2deg(np.unwrap(np.deg2rad(df['B'])))
y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))})
df = pd.merge(left=y, right=df, on='Time', how='left').interpolate()
df['B'] %= 360
print(df)
Output:
Time A B
0 0 5.000000 300.000000
1 1 5.166667 304.833333
2 2 5.333333 309.666667
3 3 5.500000 314.500000
4 4 5.666667 319.333333
5 5 5.833333 324.166667
6 6 6.000000 329.000000
7 7 6.166667 333.833333
8 8 6.333333 338.666667
9 9 6.500000 343.500000
10 10 6.666667 348.333333
11 11 6.833333 353.166667
12 12 7.000000 358.000000
13 13 7.181818 358.545455
14 14 7.363636 359.090909
15 15 7.545455 359.636364
16 16 7.727273 0.181818
17 17 7.909091 0.727273
18 18 8.090909 1.272727
19 19 8.272727 1.818182
20 20 8.454545 2.363636
21 21 8.636364 2.909091
22 22 8.818182 3.454545
23 23 9.000000 4.000000
24 24 8.500000 7.000000
25 25 8.000000 10.000000
26 26 8.157895 9.578947
27 27 8.315789 9.157895
28 28 8.473684 8.736842
29 29 8.631579 8.315789
30 30 8.789474 7.894737
31 31 8.947368 7.473684
32 32 9.105263 7.052632
33 33 9.263158 6.631579
34 34 9.421053 6.210526
35 35 9.578947 5.789474
36 36 9.736842 5.368421
37 37 9.894737 4.947368
38 38 10.052632 4.526316
39 39 10.210526 4.105263
40 40 10.368421 3.684211
41 41 10.526316 3.263158
42 42 10.684211 2.842105
43 43 10.842105 2.421053
44 44 11.000000 2.000000
45 45 11.000000 2.000000
46 46 11.000000 2.000000
47 47 11.000000 2.000000
48 48 11.000000 2.000000
49 49 11.000000 2.000000

Related

Compute annual rate using a DataFrame and pct_change()

I have a column inside a DataFrame that I want to use in order to perform the operation:
n = step/12
step = 3
t1 = step - 1
pd.DataFrame(100*((df[t1+step::step]['Column'].values / df[t1:-t1:step]['Column'].values)**(1/n) - 1))
A possible set of values for the column of interest could be:
>>> df['Column']
0 NaN
1 NaN
2 7469.5
3 NaN
4 NaN
5 7537.9
6 NaN
7 NaN
8 7655.2
9 NaN
10 NaN
11 7712.6
12 NaN
13 NaN
14 7784.1
15 NaN
16 NaN
17 7819.8
18 NaN
19 NaN
20 7898.6
21 NaN
22 NaN
23 7939.5
24 NaN
25 NaN
26 7995.0
27 NaN
28 NaN
29 8084.7
...
So df[t1+step::step]['Column'] would give us:
>>> df[5::3]['Column']
5 7537.9
8 7655.2
11 7712.6
14 7784.1
17 7819.8
20 7898.6
23 7939.5
26 7995.0
29 8084.7
32 8158.0
35 8292.7
38 8339.3
41 8449.5
44 8498.3
47 8610.9
50 8697.7
53 8766.1
56 8831.5
59 8850.2
62 8947.1
65 8981.7
68 8983.9
71 8907.4
74 8865.6
77 8934.4
80 8977.3
83 9016.4
86 9123.0
89 9223.5
92 9313.2
...
And lastly df[t1:-t1:step]['Column']
>>> df[2:-2:3]['Column']
2 7469.5
5 7537.9
8 7655.2
11 7712.6
14 7784.1
17 7819.8
20 7898.6
23 7939.5
26 7995.0
29 8084.7
32 8158.0
35 8292.7
38 8339.3
41 8449.5
44 8498.3
47 8610.9
50 8697.7
53 8766.1
56 8831.5
59 8850.2
62 8947.1
65 8981.7
68 8983.9
71 8907.4
74 8865.6
77 8934.4
80 8977.3
83 9016.4
86 9123.0
89 9223.5
...
With these values what we expect is the following output:
>>> pd.DataFrame(100*((df[5::3]['Column'].values / df[2:-2:3]['Column'].values)**4 -1))
0 3.713517
1 6.371352
2 3.033171
3 3.760103
4 1.847168
5 4.092131
6 2.087397
7 2.825602
8 4.563898
9 3.676223
10 6.769944
11 2.266778
12 5.391516
13 2.330287
14 5.406150
15 4.093476
16 3.182961
17 3.017786
18 0.849662
19 4.452016
20 1.555866
21 0.098013
22 -3.362834
23 -1.863919
24 3.140454
25 1.934544
26 1.753587
27 4.813692
28 4.479794
29 3.947179
Since this reminds a lot of what pct_change() does I was wondering if I could achieve the same result by doing something like:
>>> df['Column'].pct_change(periods=step)**(1/n) * 100
Until now I am getting incorrect outputs though. Is it possible to use pct_change() and achieve the same result?

Choosing the larger probability from a specific indexID

I have a database as follows:
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 clean 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 clean 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 clean 30
What I want to do is, for each repeated indexID, I would like to choose the entry that is of higher probability and mark that as clean and the other as dirty.
The output should look something like this:
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30
If need pandas solution create boolean mask by comparing Probability column by Series.ne (!=) with max values per groups created by transform, because need Series with same size as df:
mask = df['Probability'].ne(df.groupby('indexID')['Probability'].transform('max'))
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30
Detail:
print (df.groupby('indexID')['Probability'].transform('max'))
0 75
1 75
2 75
5 85
6 85
9 74
12 72
13 72
14 85
15 76
16 91
19 71
23 71
28 71
32 97
33 97
Name: Probability, dtype: int64
If want compare mean with gt (>):
mask = df['Probability'].gt(df['Probability'].mean())
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 dirty 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 dirty 85
9 4 5 9 dirty 74
12 6 7 12 clean 23
13 6 8 13 dirty 72
14 7 8 14 dirty 85
15 9 10 15 dirty 76
16 10 11 16 dirty 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 dirty 71
32 20 21 32 dirty 97
33 20 22 33 clean 30

Getting combinations of elements from from different pandas rows

Assume that I have a dataframe like this:
Date Artist percent_gray percent_blue percent_black percent_red
33 Leonardo 22 33 36 46
45 Leonardo 23 47 23 14
46 Leonardo 13 34 33 12
23 Michelangelo 28 19 38 25
25 Michelangelo 24 56 55 13
26 Michelangelo 21 22 45 13
13 Titian 24 17 23 22
16 Titian 45 43 44 13
19 Titian 17 45 56 13
24 Raphael 34 34 34 45
27 Raphael 31 22 25 67
I want to get maximum color differences of different pictures for the same artist. I can also compare percent_gray with percent_blue e.g. for Lenoardo the biggest difference is percent_red (date:46) - percent_blue(date:45) =12 - 47 = -35. I wanna see how it evolves over time, so I just wanna compare new pictures of the same artist with the old ones(in this case I can compare third picture with first and second ones, and second picture only with first one) and get the maximum differences. So dataframe should look like
Date Artist max_d
33 Leonardo NaN
45 Leonardo -32
46 Leonardo -35
23 Michelangelo NaN
25 Michelangelo 37
26 Michelangelo -43
13 Titian NaN
16 Titian 28
19 Titian 43
24 Raphael NaN
27 Raphael 33
I think I have to use groupby but couldn't manage to get the output I want.
You can use:
#first sort in real data
df = df.sort_values(['Artist', 'Date'])
mi = df.iloc[:,2:].min(axis=1)
ma = df.iloc[:,2:].max(axis=1)
ma1 = ma.groupby(df['Artist']).shift()
mi1 = mi.groupby(df['Artist']).shift()
mad1 = mi - ma1
mad2 = ma - mi1
df['max_d'] = np.where(mad1.abs() > mad2.abs(), mad1, mad2)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
Explanation (with new columns):
#get min and max per rows
df['min'] = df.iloc[:,2:].min(axis=1)
df['max'] = df.iloc[:,2:].max(axis=1)
#get shifted min and max by Artist
df['max1'] = df.groupby('Artist')['max'].shift()
df['min1'] = df.groupby('Artist')['min'].shift()
#get differences
df['max_d1'] = df['min'] - df['max1']
df['max_d2'] = df['max'] - df['min1']
#if else of absolute values
df['max_d'] = np.where(df['max_d1'].abs() > df['max_d2'].abs(), df['max_d1'], df['max_d2'])
print (df)
percent_red min max max1 min1 max_d1 max_d2 max_d
0 46 22 46 NaN NaN NaN NaN NaN
1 14 14 47 46.0 22.0 -32.0 25.0 -32.0
2 12 12 34 47.0 14.0 -35.0 20.0 -35.0
3 25 19 38 NaN NaN NaN NaN NaN
4 13 13 56 38.0 19.0 -25.0 37.0 37.0
5 13 13 45 56.0 13.0 -43.0 32.0 -43.0
6 22 17 24 NaN NaN NaN NaN NaN
7 13 13 45 24.0 17.0 -11.0 28.0 28.0
8 13 13 56 45.0 13.0 -32.0 43.0 43.0
9 45 34 45 NaN NaN NaN NaN NaN
10 67 22 67 45.0 34.0 -23.0 33.0 33.0
And if use second explanation solution, remove columns:
df = df.drop(['min','max','max1','min1','max_d1', 'max_d2'], axis=1)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
How about a custom apply function. Does this work?
from operator import itemgetter
import pandas
import itertools
p = pandas.read_csv('Artits.tsv', sep='\s+')
def diff(x):
return x
def max_any_color(cols):
grey = []
blue = []
black = []
red = []
for row in cols.iterrows():
date = row[1]['Date']
grey.append(row[1]['percent_gray'])
blue.append(row[1]['percent_blue'])
black.append(row[1]['percent_black'])
red.append(row[1]['percent_red'])
gb = max([abs(a[0] - a[1]) for a in itertools.product(grey,blue)])
gblack = max([abs(a[0] - a[1]) for a in itertools.product(grey,black)])
gr = max([abs(a[0] - a[1]) for a in itertools.product(grey,red)])
bb = max([abs(a[0] - a[1]) for a in itertools.product(blue,black)])
br = max([abs(a[0] - a[1]) for a in itertools.product(blue,red)])
blackr = max([abs(a[0] - a[1]) for a in itertools.product(black,red)])
l = [gb,gblack,gr,bb,br,blackr]
c = ['grey/blue','grey/black','grey/red','blue/black','blue/red','black/red']
max_ = max(l)
between_colors_index = l.index(max_)
return c[between_colors_index], max_
p.groupby('Artist').apply(lambda x: max_any_color(x))
Output:
Leonardo (blue/red, 35)
Michelangelo (blue/red, 43)
Raphael (blue/red, 45)
Titian (black/red, 43)

How to plot multiple lines as histograms per group from a pandas Date Frame

I am trying to look at 'time of day' effects on my users on a week over week basis to get a quick visual take on how consistent time of day trends are. So as a first start I've used this:
df[df['week'] < 10][['realLocalTime', 'week']].hist(by = 'week', bins = 24, figsize = (15, 15))
To produce the following:
This is a nice easy start, but what I would really like is to represent the histogram as a line plot, and overlay all the lines, one for each week on the same plot. Is there a way to do this?
I have a bit more experience with ggplot, where I would just do this by adding a factor level dependency on color and by. Is there a similarly easy way to do this with pandas and or matplotlib?
Here's what my data looks like:
realLocalTime week
1 12 10
2 12 10
3 12 10
4 12 10
5 13 5
6 17 5
7 17 5
8 6 6
9 17 5
10 20 6
11 18 5
12 18 5
13 19 6
14 21 6
15 21 6
16 14 6
17 6 6
18 0 6
19 21 5
20 17 6
21 23 6
22 22 6
23 22 6
24 17 6
25 22 5
26 13 6
27 23 6
28 22 5
29 21 6
30 17 6
... ... ...
70 14 5
71 9 5
72 19 6
73 19 6
74 21 6
75 20 5
76 20 5
77 21 5
78 15 6
79 22 6
80 23 6
81 15 6
82 12 6
83 7 6
84 9 6
85 8 6
86 22 6
87 22 6
88 22 6
89 8 5
90 8 5
91 8 5
92 9 5
93 7 5
94 22 5
95 8 6
96 10 6
97 0 6
98 22 5
99 14 6
Maybe you can simply use crosstab to compute the number of element by week and plot it.
# Test data
d = {'realLocalTime': ['12','14','14','12','13','17','14', '17'],
'week': ['10','10','10','10','5','5','6', '6']}
df = DataFrame(d)
ax = pd.crosstab(df['realLocalTime'], df['week']).plot()
Use groupby and value_counts
df.groupby('week').realLocalTime.value_counts().unstack(0).fillna(0).plot()

Sample size from mean() on groupby object

Is there a way to record the sample size when calling the mean() method of a groupby object?
Consider the following dataframe:
In [16]: df
Out[16]:
formation phi sw
0 nio 14 47
1 nio 10 16
2 nio 12 12
3 nio 19 82
4 nio 23 43
5 fthays 24 19
6 codell 23 5
7 codell 24 45
8 codell 9 11
9 graneros 26 11
10 graneros 15 45
11 graneros 12 16
12 dkot 11 79
It's easy enough to compute the mean across each of these formations using the mean() method of the groupby object:
In [17]: df.groupby(['formation']).mean()
Out[17]:
phi sw
formation
codell 18.666667 20.333333
dkot 11.000000 79.000000
fthays 24.000000 19.000000
graneros 17.666667 24.000000
nio 15.600000 40.000000
But I'd like to know if there's a way to add a column for the sample size. So my desired output would be something like:
phi sw n
formation
codell 18.666667 20.333333 3
dkot 11.000000 79.000000 1
fthays 24.000000 19.000000 1
graneros 17.666667 24.000000 3
nio 15.600000 40.000000 5
You can do this by using the aggregate function, with the mean and count functions as arguments.
>> df.groupby(['formation']).aggregate([np.mean, np.size])

Categories