I have 2 types of text files which I need to read into a pandas DataFrame. I have a problem with the datetimes and separator.
File A:
2009,7,1,3,101,13.03,89.33,0.6,287.69,0
2009,7,1,6,102,19.3,55,1,288.67,0
2009,7,1,9,103,22.33,39.67,1,289.6,0
2009,7,1,12,104,21.97,41,1,295.68,0
File B:
2019 9 1 3.00 101 14.02 92.08 2.62 174.77 0.109
2019 9 1 6.00 102 13.79 92.86 2.79 179.29 0.046
2019 9 1 9.00 103 13.81 92.60 2.73 178.94 0.070
2019 9 1 12.00 104 13.31 95.20 2.91 179.38 0.015
fileA.txt has no extra spaces, fileB.txt has 1 extra space on the beginning of each line. I can read each of these as follows and the results are correct:
>>> import pandas as pd
>>> from datetime import datetime as dtdt
>>> par3 = lambda x: dtdt.strptime(x, '%Y %m %d %H')
>>> par4 = lambda x: dtdt.strptime(x, '%Y %m %d %H.%M')
>>> df3=pd.read_csv('fileA.txt',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par3, index_col='Date')
>>> df3
4 5 6 7 8 9
Date
2009-07-01 03:00:00 101 13.03 89.33 0.6 287.69 0
2009-07-01 06:00:00 102 19.30 55.00 1.0 288.67 0
2009-07-01 09:00:00 103 22.33 39.67 1.0 289.60 0
2009-07-01 12:00:00 104 21.97 41.00 1.0 295.68 0
>>> dg3=pd.read_csv('fileB.txt',sep='\s+',engine='python',header=None,parse_dates={'Date': [0,1,2,3]}, date_parser=par4, index_col='Date')
>>> dg3
4 5 6 7 8 9
Date
2019-09-01 03:00:00 101 14.02 92.08 2.62 174.77 0.109
2019-09-01 06:00:00 102 13.79 92.86 2.79 179.29 0.046
2019-09-01 09:00:00 103 13.81 92.60 2.73 178.94 0.070
2019-09-01 12:00:00 104 13.31 95.20 2.91 179.38 0.015
Question: how do I read in both of these types with the same command? The only way I can think of is to first open the file and read in first line to deduce the hour-format (col.3) and the separator? That feels non-pythonic way.
Also, if the hour-reading float is e.g. 3.75, it would be ok to round that into nearest integer and just set the minute reading to 0.
Related
I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.
You can do a regular resample / groupby, and use the idxmax method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time' column is datetime like, which I did with pd.to_datetime. You can convert the time column back with strftime. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index() if you do not wish Time to be your index.
You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
i have a dataframe that contains cell phone minutes usage logged by date of call and duration.
It looks like this (30 row sample):
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
I want to group by user_id AND call_date with the ultimate goal of calculating the number of minutes used per month over the course of the year, per user.
I thought i could accomplish this by using:
calls.groupby(['user_id','call_date'])['duration'].sum()
but the results aren't what i expected:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-08-14 13.86
2018-08-16 23.46
2018-08-17 8.11
2018-08-18 1.74
2018-08-19 10.73
2018-08-20 7.32
2018-08-21 0.00
2018-08-23 8.50
2018-08-24 8.63
2018-08-25 35.39
2018-08-27 10.57
2018-08-28 19.91
2018-08-29 0.54
2018-08-31 22.38
2018-09-01 7.53
2018-09-02 10.27
2018-09-03 30.66
2018-09-04 0.00
2018-09-05 9.09
2018-09-06 10.06
i'd hoped that it would be grouped like user_id 1000, all calls for jan with duration summed, all calls for feb with duration summed, etc.
i am really new to python and programming in general and am not sure what my next step should be to get these grouped by user_id and month of the year?
Thanks in advance for any insight you can offer.
Regards,
Jared
Something is not quite right in your setup. First of all, both of your tables are the same, so I am not sure if this is a cut-and-paste error or something else. Here is what I do with your data. Load it up like so, note we explicitly convert call_date to Datetime`
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(
"""
id user_id call_date duration
0 1000_93 1000 2018-12-27 8.52
1 1000_145 1000 2018-12-27 13.66
2 1000_247 1000 2018-12-27 14.48
3 1000_309 1000 2018-12-28 5.76
4 1000_380 1000 2018-12-30 4.22
5 1000_388 1000 2018-12-31 2.20
6 1000_510 1000 2018-12-27 5.75
7 1000_521 1000 2018-12-28 14.18
8 1000_530 1000 2018-12-28 5.77
9 1000_544 1000 2018-12-26 4.40
10 1000_693 1000 2018-12-31 4.31
11 1000_705 1000 2018-12-31 12.78
12 1000_735 1000 2018-12-29 1.70
13 1000_778 1000 2018-12-28 3.29
14 1000_826 1000 2018-12-26 9.96
15 1000_842 1000 2018-12-27 5.85
16 1001_0 1001 2018-09-06 10.06
17 1001_1 1001 2018-10-12 1.00
18 1001_2 1001 2018-10-17 15.83
19 1001_4 1001 2018-12-05 0.00
20 1001_5 1001 2018-12-13 6.27
21 1001_6 1001 2018-12-04 7.19
22 1001_8 1001 2018-11-17 2.45
23 1001_9 1001 2018-11-19 2.40
24 1001_11 1001 2018-11-09 1.00
25 1001_13 1001 2018-12-24 0.00
26 1001_19 1001 2018-11-15 30.00
27 1001_20 1001 2018-09-21 5.75
28 1001_23 1001 2018-10-27 0.98
29 1001_26 1001 2018-10-28 5.90
30 1001_29 1001 2018-09-30 14.78
"""), delim_whitespace = True, index_col=0)
df['call_date'] = pd.to_datetime(df['call_date'])
Then using
df.groupby(['user_id','call_date'])['duration'].sum()
does the expected grouping by user and by each date:
user_id call_date
1000 2018-12-26 14.36
2018-12-27 48.26
2018-12-28 29.00
2018-12-29 1.70
2018-12-30 4.22
2018-12-31 19.29
1001 2018-09-06 10.06
2018-09-21 5.75
2018-09-30 14.78
2018-10-12 1.00
2018-10-17 15.83
2018-10-27 0.98
2018-10-28 5.90
2018-11-09 1.00
2018-11-15 30.00
2018-11-17 2.45
2018-11-19 2.40
2018-12-04 7.19
2018-12-05 0.00
2018-12-13 6.27
2018-12-24 0.00
If you want to group by month as you seem to suggest you can use the Grouper functionality:
df.groupby(['user_id',pd.Grouper(key='call_date', freq='1M')])['duration'].sum()
which produces
user_id call_date
1000 2018-12-31 116.83
1001 2018-09-30 30.59
2018-10-31 23.71
2018-11-30 35.85
2018-12-31 13.46
Let me know if you are getting different results from following these steps
I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)
one years' data showed as follows:
datetime data
2008-01-01 00:00:00 0.044
2008-01-01 00:30:00 0.031
2008-01-01 01:00:00 -0.25
.....
2008-01-31 23:00:00 0.036
2008-01-31 23:30:00 0.42
2008-01-02 00:00:00 0.078
2008-01-02 00:30:00 0.008
2008-01-02 01:00:00 0.09
2008-01-02 01:30:00 0.054
.....
2008-12-31 22:00:00 0.55
2008-12-31 22:30:00 0.05
2008-12-31 23:00:00 0.08
2008-12-31 23:30:00 0.033
There is a value per half-hour. I want the sum of all values in a day, so convert to 365 rows values.
year day sum values
2008 1 *
2008 2 *
...
2008 364 *
2008 365 *
You can use dt.year + dt.dayofyear with groupby and aggregate sum:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear]).sum()
print (df)
data
datetime datetime
2008 1 -0.175
2 0.230
31 0.456
366 0.713
And if need DataFrame is possible convert index to column and set columns names by reset_index + rename_axis:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear])['data']
.sum()
.rename_axis(('year','dayofyear'))
.reset_index()
print (df)
year dayofyear data
0 2008 1 -0.175
1 2008 2 0.230
2 2008 31 0.456
3 2008 366 0.713
I have a dataframe named DateUnique made of all unique dates (format datetime or string) that are present in my other dataframe named A.
>>> print(A)
'dateLivraisonDemande' 'abscisse' 'BaseASDébut' 'BaseATDébut' 0 2015-05-27 2004-01-10 05:00:00 05:00:00
1 2015-05-27 2004-02-10 18:30:00 22:30:00
2 2015-05-27 2004-01-20 23:40:00 19:30:00
3 2015-05-27 2004-03-10 12:05:00 06:00:00
4 2015-05-27 2004-01-10 23:15:00 13:10:00
5 2015-05-27 2004-02-10 18:00:00 13:45:00
6 2015-05-27 2004-01-20 02:05:00 19:15:00
7 2015-05-27 2004-03-20 08:00:00 07:45:00
8 2015-05-29 2004-01-01 18:45:00 21:00:00
9 2015-05-27 2004-02-15 04:20:00 07:30:00
10 2015-04-10 2004-01-20 13:50:00 15:30:00
And:
>>> print(DateUnique)
1 1899-12-30
2 1900-01-01
3 2004-03-10
4 2004-03-20
5 2004-01-20
6 2015-05-29
7 2015-04-10
8 2015-05-27
9 2004-02-15
10 2004-02-10
How can I get the name of the columns that contain each date?
Maybe with something similar to this:
# input:
If row == '2015-04-10':
print(df.name_Of_Column([0]))
# output:
'dateLivraisonDemande'
You can make a function that returns the appropriate column. Use the vectorized isin function, and then check if any value is True.
df = pd.DataFrame({'dateLivraisonDemande': ['2015-05-27']*7 + ['2015-05-27', '2015-05-29', '2015-04-10'],
'abscisse': ['2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-10', '2004-01-20', '2004-03-10', '2004-01-10',
'2004-02-15', '2004-01-20']})
DateUnique = pd.Series(['1899-12-30', '1900-01-01', '2004-03-10', '2004-03-20',
'2004-01-20', '2015-05-29', '2015-04-10', '2015-05-27',
'2004-02-15', '2004-02-10'])
def return_date_columns(date_input):
if df["dateLivraisonDemande"].isin([date_input]).any():
return "dateLivraisonDemande"
if df["abscisse"].isin([date_input]).any():
return "abscisse"
>>> DateUnique.apply(return_date_columns)
0 None
1 None
2 abscisse
3 None
4 abscisse
5 dateLivraisonDemande
6 dateLivraisonDemande
7 dateLivraisonDemande
8 abscisse
9 abscisse
dtype: object