A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]
Related
I want to filter out a specific value(9999) that appears many times from a subset of my dataset. This is what I have done so far but I'm not sure how to filter out all the 9999 values.
import pandas as pd
import statistics
df=pd.read_csv('Area(2).txt',delimiter='\t')
Initially, this is what a part of my dataset for 30 days (containing 600+ values) looks like below. I'm just showing the first two rows here.
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
Now I wanted to select the range of numbers under the column "Values" between 23-25 April. So I did the following:
df5=df.iloc[528:602,5]
print(df5)
The range of values I get for 23-25 April looks like this:
528 9999
529 9999
530 9999
531 9999
532 9999
597 9999
598 9999
599 9999
600 9999
601 9999
Name: Value, Length: 74, dtype: int64
I want to filter out all 9999 values from this subset, I have tried a number of ways to get rid of these values but I keep getting IndexError: positional indexers are out-of-bounds so I am unable to get rid of 9999 and do further work like finding the variance and standard deviation with the selected subset.
If this helps, I also tried to filter out 9999 in the beginning and it looked like this:
df2=df[df.Value!=9999]
print(df2)
No Date Time Rand Col Value
6 2167 1 4 1991 6:00 181 7 152
7 2168 1 4 1991 7:00 181 8 178
8 2169 1 4 1991 8:00 181 9 239
9 2170 1 4 1991 9:00 181 10 296
10 2171 1 4 1991 10:00 181 11 337
.. ... ... ... ... ... ...
638 2799 27 4 1991 14:00 234 3 193
639 2800 27 4 1991 15:00 234 4 162
640 2801 27 4 1991 16:00 234 5 144
641 2802 27 4 1991 17:00 234 6 151
642 2803 27 4 1991 18:00 234 7 210
[351 rows x 6 columns]
Then I tried to obtain the range of column values between 23 April - 25 April by trying what I did below but I always get IndexError: positional indexers are out-of-bounds
df6=df2.iloc[528:602,5]
print(df6)
How I can properly filter out the value I mentioned and obtain the subset of the dataset that I need?
Given:
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
2 2167 1 4 1991 6:00 181 7 152
3 2168 1 4 1991 7:00 181 8 178
4 2169 1 4 1991 8:00 181 9 239
5 2170 1 4 1991 9:00 181 10 296
6 2171 1 4 1991 10:00 181 11 337
7 2799 27 4 1991 14:00 234 3 193
8 2800 27 4 1991 15:00 234 4 162
9 2801 27 4 1991 16:00 234 5 144
10 2802 27 4 1991 17:00 234 6 151
11 2803 27 4 1991 18:00 234 7 210
First, let's make a proper datetime index:
# Your dates are pretty scuffed, there was some formatting to make them make sense...
df.index = pd.to_datetime(df.Date.str.split().apply(lambda x: f'{x[1].zfill(2)}-{x[0].zfill(2)}-{x[2]}') + ' ' + df.Time)
df.drop(['Date', 'Time'], axis=1, inplace=True)
This gives:
No Rand Col Value
1991-04-01 00:00:00 2161 181 1 9999
1991-04-01 01:00:00 2162 181 2 9999
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337
1991-04-27 14:00:00 2799 234 3 193
1991-04-27 15:00:00 2800 234 4 162
1991-04-27 16:00:00 2801 234 5 144
1991-04-27 17:00:00 2802 234 6 151
1991-04-27 18:00:00 2803 234 7 210
Then, we can easily fulfill your conditions (replace the dates with your own desired range).
df[df.Value.ne(9999)].loc['1991-04-01':'1991-04-01']
# df[df.Value.ne(9999)].loc['1991-04-23':'1991-04-25']
Output:
No Rand Col Value
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337
Here is my DF
Start-Time Running-Time Speed-Avg HR-Avg
0 2016-12-18 10:8:14 0:24:2 20 138
1 2016-12-18 10:8:14 0:24:2 20 138
2 2016-12-23 8:52:36 0:31:19 16 134
3 2016-12-23 8:52:36 0:31:19 16 134
4 2016-12-25 8:0:51 0:30:10 50 135
5 2016-12-25 8:0:51 0:30:10 50 135
6 2016-12-26 8:41:26 0:10:1 27 116
7 2016-12-26 8:41:26 0:10:1 27 116
8 2017-1-7 11:16:9 0:26:15 22 124
9 2017-1-7 11:16:9 0:26:15 22 124
10 2017-1-10 19:2:54 0:53:51 5 142
11 2017-1-10 19:2:54 0:53:51 5 142
and i have been trying to format this column in H:M:S format
using
timeDF=(pd.to_datetime(cleanDF['Running-Time'],format='%H:%M:%S'))
but i have been getting ValueError: time data ' 0:24:2' does not match format '%M:%S' (match) this error
Thank you in advance.
There is problem trailing whitespaces, so need str.strip:
Or if create DataFrame from file by read_csv add parameter skipinitialspace=True:
cleanDF = pd.read_csv(file, skipinitialspace = True)
timeDF=(pd.to_datetime(cleanDF['Running-Time'].str.strip(), format='%H:%M:%S'))
print (timeDF)
0 1900-01-01 00:24:02
1 1900-01-01 00:24:02
2 1900-01-01 00:31:19
3 1900-01-01 00:31:19
4 1900-01-01 00:30:10
5 1900-01-01 00:30:10
6 1900-01-01 00:10:01
7 1900-01-01 00:10:01
8 1900-01-01 00:26:15
9 1900-01-01 00:26:15
10 1900-01-01 00:53:51
11 1900-01-01 00:53:51
Name: Running-Time, dtype: datetime64[ns]
But maybe better is convert it to timedeltas by to_timedelta:
timeDF=(pd.to_timedelta(cleanDF['Running-Time'].str.strip()))
print (timeDF)
0 00:24:02
1 00:24:02
2 00:31:19
3 00:31:19
4 00:30:10
5 00:30:10
6 00:10:01
7 00:10:01
8 00:26:15
9 00:26:15
10 00:53:51
11 00:53:51
Name: Running-Time, dtype: timedelta64[ns]
Get value of each quarter from cumulated income statement reports with pandas
Is there any way to do it with python/pandas?
I have an example dataset like below.
(please suppose that this company's fiscal year is from Jan to Dec)
qend revenue profit
2015-03-31 2,453 298
2015-06-30 5,076 520
2015-09-30 8,486 668
2015-12-31 16,724 820
2016-03-31 1,880 413
2016-06-30 3,989 568
2016-09-30 7,895 621
2016-12-31 16,621 816
I want to know how much revenue and profit that this company earns per each quarter.
But the report is only showing the number in cumulative.
In this case, Q1 is fine. But from Q2-Q4, I have to get the difference from each last quarter.
This is my expecting results.
qend revenue profit mycommment
2015-03-31 2,453 298 copy from Q1
2015-06-30 2,623 222 delta of Q1 and Q2
2015-09-30 3,410 148 delta of Q2 and Q3
2015-12-31 8,238 152 delta of Q3 and Q4
2016-03-31 1,880 413 copy from Q1
2016-06-30 2,109 155 delta of Q1 and Q2
2016-09-30 3,906 53 delta of Q2 and Q3
2016-12-31 8,726 195 delta of Q3 and Q4
The difficulty is it is not simply getting delta from last row, because each Q1 needs no delta value while rest of Q2-4 needs delta value.
If there is no easy way in pandas, I'll code it with python.
I think you need quarter for find first and then add value of diff by condition:
m = df['qend'].dt.quarter == 1
df['diff_profit'] = np.where(m, df['profit'], df['profit'].diff())
#same as
#df['diff_profit'] = df['profit'].where(m, df['profit'].diff())
print (df)
qend revenue profit diff_profit
0 2015-03-31 2,453 298 298.0
1 2015-06-30 5,076 520 222.0
2 2015-09-30 8,486 668 148.0
3 2015-12-31 16,724 820 152.0
4 2016-03-31 1,880 413 413.0
5 2016-06-30 3,989 568 155.0
6 2016-09-30 7,895 621 53.0
7 2016-12-31 16,621 816 195.0
Or:
df['diff_profit'] = np.where(m, df['profit'], df['profit'].shift() - df['profit'])
print (df)
qend revenue profit diff_profit
0 2015-03-31 2,453 298 298.0
1 2015-06-30 5,076 520 -222.0
2 2015-09-30 8,486 668 -148.0
3 2015-12-31 16,724 820 -152.0
4 2016-03-31 1,880 413 413.0
5 2016-06-30 3,989 568 -155.0
6 2016-09-30 7,895 621 -53.0
7 2016-12-31 16,621 816 -195.0
Detail:
print (df['qend'].dt.quarter)
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 4
Name: qend, dtype: int64
I've taken a large data file and managed to use groupby and value_counts to get the dataframe below. However, I want to format it so the company is on the left, with the months on top, and each number would be the number of calls that month, the third column.
Here is my code to sort:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count)
df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
Here is my output df=
recvd_dttm CompanyName
1/1/2015 11:42 Company 1 1
1/1/2015 14:29 Company 2 1
1/1/2015 8:12 Company 4 1
1/1/2015 9:53 Company 1 1
1/10/2015 11:38 Company 3 1
1/10/2015 11:31 Company 5 1
1/10/2015 12:04 Company 2 1
I want
Company Jan Feb Mar Apr May
Company 1 10 4 45 40 34
Company 2 2 5 56 5 57
Company 3 3 7 71 6 53
Company 4 4 4 38 32 2
Company 5 20 3 3 3 29
I know that there is a nifty pivot function for dataframes from this documentation http://pandas.pydata.org/pandas-docs/stable/reshaping.html for pandas, so I've been trying to use df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
One problem is that the third column doesn't have a name, so I can't use it for values = 'NumberCalls'. The second problem is figuring out how to take the datetime format in my dataframe and make it display by month only.
Edit:
CompanyName is the first column, recvd_dttm is the 15th column. This is my code after some more attempts:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count).set_index('recvd_dttm').sort_index()
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
result.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
It is throwing this error: KeyError: 'recvd_dttm' and won't get to the result line.
You need to aggregate the data before creating the pivot table. If there is no column name, you can either refer it to df.iloc[:, 1] (the 2nd column) or simply rename the df.
import pandas as pd
import numpy as np
# just simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']
Out[34]:
CompnayName
recvd_dttm
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company1 1
2015-01-01 00:00:00 company2 1
2015-01-01 01:00:00 company4 1
2015-01-01 01:00:00 company2 1
2015-01-01 01:00:00 company5 1
2015-01-01 03:00:00 company3 1
2015-01-01 03:00:00 company2 1
2015-01-01 03:00:00 company3 1
2015-01-01 04:00:00 company4 1
2015-01-01 04:00:00 company1 1
2015-01-01 04:00:00 company3 1
2015-01-01 05:00:00 company2 1
2015-01-01 06:00:00 company5 1
... ... ..
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company5 1
2015-06-29 19:00:00 company5 1
2015-06-29 20:00:00 company1 1
2015-06-29 20:00:00 company4 1
2015-06-29 22:00:00 company1 1
2015-06-29 22:00:00 company2 1
2015-06-29 22:00:00 company4 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company2 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company4 1
[10000 rows x 2 columns]
# first groupby month and company name, and calculate the sum of calls, and reset all index
# since we don't have a name for that columns, simply tell pandas it is the 2nd column we try to count on
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
# rename the columns
result.columns = ['Month', 'CompanyName', 'counts']
Out[41]:
Month CompanyName counts
0 1 company1 328
1 1 company2 337
2 1 company3 342
3 1 company4 345
4 1 company5 331
5 2 company1 295
6 2 company2 300
7 2 company3 328
8 2 company4 304
9 2 company5 329
10 3 company1 366
11 3 company2 398
12 3 company3 339
13 3 company4 336
14 3 company5 345
15 4 company1 322
16 4 company2 348
17 4 company3 351
18 4 company4 340
19 4 company5 312
20 5 company1 347
21 5 company2 354
22 5 company3 347
23 5 company4 363
24 5 company5 312
25 6 company1 316
26 6 company2 311
27 6 company3 331
28 6 company4 307
29 6 company5 316
# create pivot table
result.pivot(index='CompanyName', columns='Month', values='counts')
Out[44]:
Month 1 2 3 4 5 6
CompanyName
company1 326 297 339 337 344 308
company2 310 318 342 328 355 296
company3 347 315 350 343 347 329
company4 339 314 367 353 343 311
company5 370 331 370 320 357 294
I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})