I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.
You can do a regular resample / groupby, and use the idxmax method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time' column is datetime like, which I did with pd.to_datetime. You can convert the time column back with strftime. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index() if you do not wish Time to be your index.
You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
I am getting this error <bound method NDFrame.head of .
It is not showing my data frame properly, what should I do?
My code is basic, here it is:
import pandas as pd
df = pd.read_csv("/Users/shloak/Desktop/Pandas/Avacado/avocado.csv”)
albany_df = df[ df['region'] == "Albany"]
albany_df.head
This is my output
<bound method NDFrame.head of Unnamed: 0 Date AveragePrice Total Volume 4046 4225 \
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85
1 1 2015-12-20 1.35 54876.98 674.28 44638.81
2 2 2015-12-13 0.93 118220.22 794.70 109149.67
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41
4 4 2015-11-29 1.28 51039.60 941.48 43838.39
... ... ... ... ... ... ...
17608 7 2018-02-04 1.52 4124.96 118.38 420.36
17609 8 2018-01-28 1.32 6987.56 433.66 374.96
17610 9 2018-01-21 1.54 3346.54 14.67 253.01
17611 10 2018-01-14 1.47 4140.95 7.30 301.87
17612 11 2018-01-07 1.54 4816.90 43.51 412.17
4770 Total Bags Small Bags Large Bags XLarge Bags type \
0 48.16 8696.87 8603.62 93.25 0.0 conventional
1 58.33 9505.56 9408.07 97.49 0.0 conventional
2 130.50 8145.35 8042.21 103.14 0.0 conventional
3 72.58 5811.16 5677.40 133.76 0.0 conventional
4 75.78 6183.95 5986.26 197.69 0.0 conventional
... ... ... ... ... ... ...
17608 0.00 3586.22 3586.22 0.00 0.0 organic
17609 0.00 6178.94 6178.94 0.00 0.0 organic
17610 0.00 3078.86 3078.86 0.00 0.0 organic
17611 0.00 3831.78 3831.78 0.00 0.0 organic
17612 0.00 4361.22 4357.89 3.33 0.0 organic
year region
0 2015 Albany
1 2015 Albany
2 2015 Albany
3 2015 Albany
4 2015 Albany
... ... ...
17608 2018 Albany
17609 2018 Albany
17610 2018 Albany
17611 2018 Albany
17612 2018 Albany
[338 rows x 14 columns]>
What is the reason for this? I have Pythonn 3.9 and Pandas 1.1.3
head is a method, you need to call it, like this: albany_df.head().
Right now you are not getting an error, but you print the method itself instead of the result of calling it.
i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps
I have two dataframes namely df1 and df2. I want to perform operation on column New_Amount_Dollar from df2. Basically in df1 I have historical currency data and I want to perform datewise operation given Currency and Amount_Dollar from df2 to calculate the values for New_Amount_Dollar column in df2.
e.g In df2 I have first currency as AUD for Date = '01-01-2019', so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/AUD value from df1
i.e New_Amount_Dollar = 19298/98 = 196.91
another example where in df2 I have third currency as COP for Date = '03-01-2019, so I want to calculate New_Amount_Dollar value such that
New_Amount_Dollar = Amount_Dollar/COP value from df1
i.e New_Amount_Dollar = 5000/0.043 = 116279.06
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97],
'BWP':[30,31,33,32,31],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192],
'BND':[0.99,0.952,0.970,0.980,0.970],
'COP':[0.05,0.047,0.043,0.047,0.045]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','05-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
df1
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
df2
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 05-01-2019 BND 2300 0
Expected Result
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 196.91
1 02-01-2019 AUD 19210 195.02
2 03-01-2019 COP 5000 116279.06
3 04-01-2019 CAD 200 10204.08
4 05-01-2019 BND 2300 2371.13
Use DataFrame.lookup with DataFrame.set_index for array and divide Amount_Dollar column:
arr = df1.set_index('Date').lookup(df2['Date'], df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] / arr
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 196.918367
1 02-01-2019 AUD 19210 195.025381
2 03-01-2019 COP 5000 116279.069767
3 04-01-2019 CAD 200 10204.081633
4 05-01-2019 BND 2300 2371.134021
But if datetimes not match, use DataFrame.asfreq:
import pandas as pd
data1 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019',
'04-01-2019','05-01-2019','08-01-2019'],
'AUD':[98, 98.5, 99, 99.5, 97,100],
'BWP':[30,31,33,32,31,20],
'CAD':[0.02,0.0192,0.0196,0.0196,0.0192,0.2],
'BND':[0.99,0.952,0.970,0.980,0.970,.23],
'COP':[0.05,0.047,0.043,0.047,0.045,0.023]}
df1 = pd.DataFrame(data1)
data2 = {'Date':['01-01-2019', '02-01-2019', '03-01-2019', '04-01-2019','07-01-2019'],
'Currency':['AUD','AUD','COP','CAD','BND'],
'Amount_Dollar':[19298, 19210, 5000, 200, 2300],
'New_Amount_Dollar':[0,0,0,0,0]
}
df2 = pd.DataFrame(data2)
print (df1)
Date AUD BWP CAD BND COP
0 01-01-2019 98.0 30 0.0200 0.990 0.050
1 02-01-2019 98.5 31 0.0192 0.952 0.047
2 03-01-2019 99.0 33 0.0196 0.970 0.043
3 04-01-2019 99.5 32 0.0196 0.980 0.047
4 05-01-2019 97.0 31 0.0192 0.970 0.045
5 08-01-2019 100.0 20 0.2000 0.230 0.023
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 01-01-2019 AUD 19298 0
1 02-01-2019 AUD 19210 0
2 03-01-2019 COP 5000 0
3 04-01-2019 CAD 200 0
4 07-01-2019 BND 2300 0
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
print (df1.set_index('Date').asfreq('D', method='ffill'))
AUD BWP CAD BND COP
Date
2019-01-01 98.0 30 0.0200 0.990 0.050
2019-01-02 98.5 31 0.0192 0.952 0.047
2019-01-03 99.0 33 0.0196 0.970 0.043
2019-01-04 99.5 32 0.0196 0.980 0.047
2019-01-05 97.0 31 0.0192 0.970 0.045
2019-01-06 97.0 31 0.0192 0.970 0.045
2019-01-07 97.0 31 0.0192 0.970 0.045
2019-01-08 100.0 20 0.2000 0.230 0.023
arr = df1.set_index('Date').asfreq('D', method='ffill').lookup(df2['Date'], df2['Currency'])
df2['New_Amount_Dollar'] = df2['Amount_Dollar'] / arr
print (df2)
Date Currency Amount_Dollar New_Amount_Dollar
0 2019-01-01 AUD 19298 196.918367
1 2019-01-02 AUD 19210 195.025381
2 2019-01-03 COP 5000 116279.069767
3 2019-01-04 CAD 200 10204.081633
4 2019-01-07 BND 2300 2371.134021
I am learning data frames and trying out different graphs. I have a data set of video games and am trying to plot a graph which shows years on x axis, net sales on y axis and the graph has to be per video game genre. I have grouped the data but am facing issues displaying it. Below is what I have tried:
import pandas as pd
%matplotlib inline
from matplotlib.pyplot import hist
df = pd.read_csv('VideoGames.csv')
s = df.groupby(['Genre','Year_of_Release']).agg(sum)['Global_Sales']
print(s)
The data is grouped properly as shown below:
Genre Year_of_Release
Action 1980.0 0.34
1981.0 14.84
1982.0 6.52
1983.0 2.86
1984.0 1.85
1985.0 3.52
1986.0 13.74
1987.0 1.12
1988.0 1.75
1989.0 4.64
1990.0 6.39
1991.0 6.76
1992.0 3.83
1993.0 1.81
1994.0 1.55
1995.0 3.57
1996.0 20.58
1997.0 27.58
1998.0 39.44
1999.0 27.77
2000.0 34.04
2001.0 59.39
2002.0 86.76
2003.0 67.93
2004.0 76.25
2005.0 85.53
2006.0 66.13
2007.0 104.97
2008.0 135.01
2009.0 137.66
...
Sports 2013.0 41.23
2014.0 45.10
2015.0 40.90
2016.0 23.53
Strategy 1991.0 0.94
1992.0 0.37
1993.0 0.81
1994.0 3.56
1995.0 6.51
1996.0 5.61
1997.0 7.71
1998.0 13.46
1999.0 18.45
2000.0 8.52
2001.0 7.55
2002.0 5.56
2003.0 7.99
2004.0 7.16
2005.0 5.31
2006.0 4.22
2007.0 9.26
2008.0 11.55
2009.0 12.36
2010.0 13.77
2011.0 8.84
2012.0 3.27
2013.0 6.09
2014.0 0.99
2015.0 1.84
2016.0 1.15
Name: Global_Sales, dtype: float64
Please advise how i can plot the graphs for all the genre's in one diagram. Thank you.
In pandas plot, the index will be plotted as x axis and every column is plotted separately, so you just need to transform the series to a data frame with Genre as columns:
ax = s.unstack('Genre').plot(kind = "line")