Pandas - Add columns based on grouping of other columns - python

I want to add three additional columns using pandas and python. I'm not sure how to add additional columns based on rows who have the same GroupID value.
min_avg: Which is the lowest avg value for rows with the same GroupID
max_avg: Which is the highest avg value for rows with the same GroupID
group_avg: Which is the avg value for each rows 'min_avg, max_avg' columns
I'm not entirely sure where to begin with this one.
I have this:
avg groupId
0 25.5 1016
1 26.7 1048
2 25.8 1016
3 53.5 1048
4 29.3 1064
5 27.7 1016
and my goal is this:
avg groupId min_avg max_avg group_average
0 25.5 1016 25.5 27.7 26.6
1 26.7 1048 26.3 53.5 39.9
2 25.8 1016 25.5 27.7 26.6
3 53.5 1048 26.3 53.5 39.9
4 29.3 1064 29.3 29.3 29.3
5 27.7 1016 25.5 27.7 26.6

We can do merge with groupby describe
df=df.merge(df.groupby('groupId').avg.describe()[['mean','min','max']].reset_index(),how='left')
Out[25]:
avg groupId mean min max
0 25.5 1016 26.333333 25.5 27.7
1 26.7 1048 40.100000 26.7 53.5
2 25.8 1016 26.333333 25.5 27.7
3 53.5 1048 40.100000 26.7 53.5
4 29.3 1064 29.300000 29.3 29.3
5 27.7 1016 26.333333 25.5 27.7

The describe method, as given in YOBEN_S's solution, will compute more statistics than is required, including count, std, and dtypes. See here.
We can get around this by using the agg method.
df.merge(df.groupby('groupId')['avg'].agg([min, max, 'mean']), on='groupId')
# output
avg groupId min max mean
0 25.5 1016 25.5 27.7 26.333333
1 26.7 1048 26.7 53.5 40.100000
2 25.8 1016 25.5 27.7 26.333333
3 53.5 1048 26.7 53.5 40.100000
4 29.3 1064 29.3 29.3 29.300000
5 27.7 1016 25.5 27.7 26.333333
Speed Comparison
Approach 1
%%timeit -n 1000
df.merge(df.groupby('groupId').avg.describe()[['mean','min','max']].reset_index(),how='left')
9.6 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Approach 2
%%timeit -n 1000
df.merge(df.groupby('groupId')['avg'].agg([min, max, 'mean']), on='groupId')
3.42 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Approach 3
Additionally, we can get a slight speedup by converting df.merge to df.join.
2.96 ms ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

In dataframe, how to speed up recognizing rows that have more than 5 consecutive previous values with same sign?

I have a dataframe like this.
val consecutive
0 0.0001 0.0
1 0.0008 0.0
2 -0.0001 0.0
3 0.0005 0.0
4 0.0008 0.0
5 0.0002 0.0
6 0.0012 0.0
7 0.0012 1.0
8 0.0007 1.0
9 0.0004 1.0
10 0.0002 1.0
11 0.0000 0.0
12 0.0015 0.0
13 -0.0005 0.0
14 -0.0003 0.0
15 0.0001 0.0
16 0.0001 0.0
17 0.0003 0.0
18 -0.0003 0.0
19 -0.0001 0.0
20 0.0000 0.0
21 0.0000 0.0
22 -0.0008 0.0
23 -0.0008 0.0
24 -0.0001 0.0
25 -0.0006 0.0
26 -0.0010 1.0
27 0.0002 0.0
28 -0.0003 0.0
29 -0.0008 0.0
30 -0.0010 0.0
31 -0.0003 0.0
32 -0.0005 1.0
33 -0.0012 1.0
34 -0.0002 1.0
35 0.0000 0.0
36 -0.0018 0.0
37 -0.0009 0.0
38 -0.0007 0.0
39 0.0000 0.0
40 -0.0011 0.0
41 -0.0006 0.0
42 -0.0010 0.0
43 -0.0015 0.0
44 -0.0012 1.0
45 -0.0011 1.0
46 -0.0010 1.0
47 -0.0014 1.0
48 -0.0011 1.0
49 -0.0017 1.0
50 -0.0015 1.0
51 -0.0010 1.0
52 -0.0014 1.0
53 -0.0012 1.0
54 -0.0004 1.0
55 -0.0007 1.0
56 -0.0011 1.0
57 -0.0008 1.0
58 -0.0006 1.0
59 0.0002 0.0
The column 'consecutive' is what I want to compute. It is '1' when current row has more than 5 consecutive previous values with same sign (either positive or negative, including it self).
What I've tried is:
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
).replace(np.nan, 0)
But it's too slow for large dataset.
Do you have any idea how to speed up?
One option is to avoid the use of apply() altogether.
The main idea is to create 2 'helper' columns:
sign: boolean Series indicating if value is positive (True) or negative (False)
id: group identical consecutive occurences together
Finally, we can groupby the id and use cumulative count to isolate the rows which have 4 or more previous rows with the same sign (i.e. get all rows with 5 consecutive sign values).
# Setup test dataset
import pandas as pd
import numpy as np
vals = np.random.randn(20000)
df = pd.DataFrame({'val': vals})
# Create the helper columns
sign = df['val'] >= 0
df['id'] = sign.ne(sign.shift()).cumsum()
# Count the ids and set flag to True if the cumcount is above our desired value
df['consecutive'] = df.groupby('id').cumcount() >= 4
Benchmarking
On my system I get the following benchmarks:
sign = df['val'] >= 0
# 92 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df['id'] = sign.ne(sign.shift()).cumsum()
# 1.06 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df['consecutive'] = df.groupby('id').cumcount() >= 4
# 3.36 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thus in total we get an average runtime of: 4.51 ms
For reference, your solution and #Emma 's solution ran respectively on my system in:
# 287 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 121 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not sure this is fast enough for your data size but using min, max seems faster.
With 20k rows,
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: np.all(arr > 0) or np.all(arr < 0), raw=True
)
# 144 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['consecutive'] = df['val'].rolling(5).apply(
lambda arr: (arr.min() > 0 or arr.max() < 0), raw=True
)
# 57.1 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Finding the 3 largest number from array [duplicate]

I have this data file and I have to find the 3 largest numbers it contains
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
Therefore I have written the following code, but it only searches the first row of numbers instead of the entire list. Can anyone help to find the error?
def three_highest_temps(f):
file = open(f, "r")
largest = 0
second_largest = 0
third_largest = 0
temp = []
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
print(three_highest_temps("data5.txt"))
Your data contains float numbers not integer.
You can use sorted:
>>> data = '''24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
... 16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
... 10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
... 21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
... 19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
... 14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
... 8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
... 11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
... 13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
... 22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
... 17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
... 20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
... '''
>>> sorted(map(float, data.split()), reverse=True)[:3]
[74.0, 73.7, 73.7]
If you want to integer results
>>> temps = sorted(map(float, data.split()), reverse=True)[:3]
>>> map(int, temps)
[74, 73, 73]
You only get the max elements for the first line because you return at the end of the first iteration. You should de-indent the return statement.
Sorting the data and picking the first 3 elements runs in n*log(n).
data = [float(v) for v in line.split() for line in file]
sorted(data, reverse=True)[:3]
It is perfectly fine for 144 elements.
You can also get the answer in linear time using a heapq
import heapq
heapq.nlargest(3, data)
Your return statement is inside the for loop. Once return is reached, the function terminates, so the loop never gets into a second iteration. Move the return outside the loop by reducing indentation.
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
In addition, your comparisons won't work, because line.split() returns a list of strings, not floats. (As has been pointed out, your data consists of floats, not ints. I'm assuming the task is to find the largest float.) So let's convert the strings using float()
Your code still won't be correct, though, because when you find a new largest value, you completely discard the old one. Instead you should now consider it the second largest known value. Same rule applies for second to third largest.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif largest > i > second_largest:
third_largest = second_largest
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
Now there is one last issue:
You overlook cases where i is identical with one of the largest values. In such a case i > largest would be false, but so would largest > i. You could change either of these comparisons to >= to fix this.
Instead, let us simplify the if clauses by considering that the elif conditions are only considered after all previous conditions were already found to be false. When we reach the first elif, we already know that i can not be larger than largest, so it suffices to compare it to second largest. The same goes for the second elif.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif i > second_largest:
third_largest = second_largest
second_largest = i
elif i > third_largest:
third_largest = i
return largest, second_largest, third_largest
This way we avoid accidentally filtering out the i == largest and i == second_largest edge cases.
Since you are dealing with a file, as a cast and numpythonic approach you can load the file as an array and then sort the array and get the last 3 item :
import numpy as np
with open('filename') as f:
array = np.genfromtxt(f).ravel()
array.sort()
print array[-3:]
[ 73.7 73.7 74. ]

sum of row in the same columns in pandas

i have a dataframe something like this
d1 d2 d3 d4
780 37.0 21.4 122840.0
784 38.1 21.4 122860.0
846 38.1 21.4 122880.0
843 38.0 21.5 122900.0
820 36.3 22.9 133220.0
819 36.3 22.9 133240.0
819 36.4 22.9 133260.0
820 36.3 22.9 133280.0
822 36.4 22.9 133300.0
how do i get the sum of values between the same column in a new column in a dataframe
for example:
d1 d2 d3 d4 d5
780 37.0 21.4 122840.0 1564
784 38.1 21.4 122860.0 1630
846 38.1 21.4 122880.0 1689
i want a new column with the sum of d1[i] + d1[i+1] .i know .sum() in pandas but i cant do sum between the same column
Your question is not fully clear to me, but I think what you mean to do is:
df['d5'] = df['d1'] + df['d1'].shift(-1)
Now you have to decide what you want to happen for the last element of the series.
Check with rolling
df['d5'] = df['d1'].rolling(2 ,min_periods=1).sum()
df
Out[321]:
d1 d2 d3 d4 d5
0 780 37.0 21.4 122840.0 780.0
1 784 38.1 21.4 122860.0 1564.0
2 846 38.1 21.4 122880.0 1630.0
3 843 38.0 21.5 122900.0 1689.0
4 820 36.3 22.9 133220.0 1663.0
5 819 36.3 22.9 133240.0 1639.0
6 819 36.4 22.9 133260.0 1638.0
7 820 36.3 22.9 133280.0 1639.0
8 822 36.4 22.9 133300.0 1642.0

How to group daily time series data into smaller dataframes of weeks

I have a dataframe that looks like this:
open high low close weekday
time
2011-11-29 2.55 2.98 2.54 2.75 1
2011-11-30 2.75 3.09 2.73 2.97 2
2011-12-01 2.97 3.14 2.93 3.06 3
2011-12-02 3.06 3.14 3.03 3.12 4
2011-12-03 3.12 3.13 2.75 2.79 5
2011-12-04 2.79 2.90 2.61 2.83 6
2011-12-05 2.83 2.93 2.78 2.88 0
2011-12-06 2.88 3.05 2.87 3.03 1
2011-12-07 3.03 3.08 2.93 2.99 2
2011-12-08 2.99 3.01 2.88 2.98 3
2011-12-09 2.98 3.04 2.93 2.97 4
2011-12-10 2.97 3.13 2.93 3.05 5
2011-12-11 3.05 3.38 2.99 3.25 6
The weekday column refers to 0 = Monday,...6 = Sunday.
I want to make groups of smaller dataframes only containing the data for Friday, Saturday, Sunday and Monday. So one subset would look like this:
2011-12-02 3.06 3.14 3.03 3.12 4
2011-12-03 3.12 3.13 2.75 2.79 5
2011-12-04 2.79 2.90 2.61 2.83 6
2011-12-05 2.83 2.93 2.78 2.88 0
filter before drop_duplicates
df[df.weekday.isin([4,5,6,0])].drop_duplicates('weekday')
Out[10]:
open high low close weekday
2011-12-02 3.06 3.14 3.03 3.12 4
2011-12-03 3.12 3.13 2.75 2.79 5
2011-12-04 2.79 2.90 2.61 2.83 6
2011-12-05 2.83 2.93 2.78 2.88 0

Nan values in columns in python

I have a data set which is created based on other data set. In my new data fame some columns have nan values. I want to make a log on each columns. However I need all the rows even though they have Nan values. What should I do with Nan values before applying log? For example consider the following data set:
a b c
1 2 3
4 5 6
7 nan 8
9 nan nan
I do not want to drop the rows with nan values. I need them for applying log on them.
I need to have the values of 7 and 8 in the row 6 for example.
Thanks.
Having nan won't affect log when calculating for each individual cell. What's more is that np.log has the property that it will operate on a pd.DataFrame and return a pd.DataFrame
np.log(df)
a b c
0 0.000000 0.693147 1.098612
1 1.386294 1.609438 1.791759
2 1.945910 NaN 2.079442
3 2.197225 NaN NaN
Notice the difference in timing
%timeit np.log(df)
%timeit pd.DataFrame(np.log(df.values), df.index, df.columns)
%timeit df.applymap(np.log)
134 µs ± 5.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
107 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
835 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Response to #IanS
Notice the subok=True parameter in the documentation
It controls whether the original type is preserved. If we turn it to False
np.log(df, subok=False)
array([[ 0. , 0.69314718, 1.09861229],
[ 1.38629436, 1.60943791, 1.79175947],
[ 1.94591015, nan, 2.07944154],
[ 2.19722458, nan, nan]])

Categories