I have a dataframe with some dates,and associated data with each date that I am reading in from a csv file (the file is relatively small, on the magnitude of 10,000s of rows, and ~10 columns):
memid date a b
10000 7/3/2017 221 143
10001 7/4/2017 442 144
10002 7/6/2017 132 145
10003 7/8/2017 742 146
10004 7/10/2017 149 147
I want to add a column, "date_diff", to this dataframe that calculates the amount of days between each date and the previously most recent date (the rows are always sorted by date):
memid date a b date_diff
10000 7/3/2017 221 143 NaN
10001 7/4/2017 442 144 1
10002 7/6/2017 132 145 2
10003 7/8/2017 742 146 2
10004 7/11/2017 149 147 3
I am having trouble figuring out a good way to create this "date_diff" column as iterating row by row tends to be frowned upon when using pandas/numpy. Is there an easy way to create this column in python/pandas/numpy or is this job better done before the csv is read into my script?
Thanks!
EDIT: Thanks to jpp and Tai for their answer. It covers the original question but I have a follow up:
What if my dataset has multiple rows for each date? Is there a way to easily check the difference between each group of dates to produce an output like the example below? Is it easier if there are a set number of rows for each date?
memid date a b date_diff
10000 7/3/2017 221 143 NaN
10001 7/3/2017 442 144 NaN
10002 7/4/2017 132 145 1
10003 7/4/2017 742 146 1
10004 7/6/2017 149 147 2
10005 7/6/2017 457 148 2
Edit to answer OP's new question: what if there are duplicates in date columns?
Set up: creating a df that does not contains duplicates
df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df_no_dup = df.drop_duplicates("date").copy()
df_no_dup["diff"] = df_no_dup["date"].diff().dt.days
Method 1 : merge
df.merge(df_no_dup[["date", "diff"]], left_on="date", right_on="date", how="left")
memid date a b diff
0 10000 2017-07-03 221 143 NaN
1 10001 2017-07-03 442 144 NaN
2 10002 2017-07-04 132 145 1.0
3 10003 2017-07-04 742 146 1.0
4 10004 2017-07-06 149 147 2.0
5 10005 2017-07-06 457 148 2.0
Method 2 : map
df["diff"] = df["date"].map(df_no_dup.set_index("date")["diff"])
Try this.
df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df.date.diff()
0 NaT
1 1 days
2 2 days
3 2 days
4 2 days
Name: date, dtype: timedelta64[ns]
To convert to integers:
df['diff'] = df['date'].diff() / np.timedelta64(1, 'D')
# memid date a b diff
# 0 10000 2017-07-03 221 143 NaN
# 1 10001 2017-07-04 442 144 1.0
# 2 10002 2017-07-06 132 145 2.0
# 3 10003 2017-07-08 742 146 2.0
# 4 10004 2017-07-10 149 147 2.0
Related
I want to filter out a specific value(9999) that appears many times from a subset of my dataset. This is what I have done so far but I'm not sure how to filter out all the 9999 values.
import pandas as pd
import statistics
df=pd.read_csv('Area(2).txt',delimiter='\t')
Initially, this is what a part of my dataset for 30 days (containing 600+ values) looks like below. I'm just showing the first two rows here.
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
Now I wanted to select the range of numbers under the column "Values" between 23-25 April. So I did the following:
df5=df.iloc[528:602,5]
print(df5)
The range of values I get for 23-25 April looks like this:
528 9999
529 9999
530 9999
531 9999
532 9999
597 9999
598 9999
599 9999
600 9999
601 9999
Name: Value, Length: 74, dtype: int64
I want to filter out all 9999 values from this subset, I have tried a number of ways to get rid of these values but I keep getting IndexError: positional indexers are out-of-bounds so I am unable to get rid of 9999 and do further work like finding the variance and standard deviation with the selected subset.
If this helps, I also tried to filter out 9999 in the beginning and it looked like this:
df2=df[df.Value!=9999]
print(df2)
No Date Time Rand Col Value
6 2167 1 4 1991 6:00 181 7 152
7 2168 1 4 1991 7:00 181 8 178
8 2169 1 4 1991 8:00 181 9 239
9 2170 1 4 1991 9:00 181 10 296
10 2171 1 4 1991 10:00 181 11 337
.. ... ... ... ... ... ...
638 2799 27 4 1991 14:00 234 3 193
639 2800 27 4 1991 15:00 234 4 162
640 2801 27 4 1991 16:00 234 5 144
641 2802 27 4 1991 17:00 234 6 151
642 2803 27 4 1991 18:00 234 7 210
[351 rows x 6 columns]
Then I tried to obtain the range of column values between 23 April - 25 April by trying what I did below but I always get IndexError: positional indexers are out-of-bounds
df6=df2.iloc[528:602,5]
print(df6)
How I can properly filter out the value I mentioned and obtain the subset of the dataset that I need?
Given:
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
2 2167 1 4 1991 6:00 181 7 152
3 2168 1 4 1991 7:00 181 8 178
4 2169 1 4 1991 8:00 181 9 239
5 2170 1 4 1991 9:00 181 10 296
6 2171 1 4 1991 10:00 181 11 337
7 2799 27 4 1991 14:00 234 3 193
8 2800 27 4 1991 15:00 234 4 162
9 2801 27 4 1991 16:00 234 5 144
10 2802 27 4 1991 17:00 234 6 151
11 2803 27 4 1991 18:00 234 7 210
First, let's make a proper datetime index:
# Your dates are pretty scuffed, there was some formatting to make them make sense...
df.index = pd.to_datetime(df.Date.str.split().apply(lambda x: f'{x[1].zfill(2)}-{x[0].zfill(2)}-{x[2]}') + ' ' + df.Time)
df.drop(['Date', 'Time'], axis=1, inplace=True)
This gives:
No Rand Col Value
1991-04-01 00:00:00 2161 181 1 9999
1991-04-01 01:00:00 2162 181 2 9999
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337
1991-04-27 14:00:00 2799 234 3 193
1991-04-27 15:00:00 2800 234 4 162
1991-04-27 16:00:00 2801 234 5 144
1991-04-27 17:00:00 2802 234 6 151
1991-04-27 18:00:00 2803 234 7 210
Then, we can easily fulfill your conditions (replace the dates with your own desired range).
df[df.Value.ne(9999)].loc['1991-04-01':'1991-04-01']
# df[df.Value.ne(9999)].loc['1991-04-23':'1991-04-25']
Output:
No Rand Col Value
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337
I have a dataset with number of startstation IDS, endstation IDS and the duration of travel for bikes in a city.
The data dates back to 2017 and hence now certain stations do not exist.
I have the list of those station IDs. How can I remove rows from the dataframe which either starts or ends at those stations?
For example, if I want to remove StartStation ID = 135 which is in index 4 and 5, what should I do? This entends for a million rows where 135 can be present anywhere.
Bike Id StartStation Id EndStation Id Duration
0 395 573 137.0 660.0
1 12931 399 507.0 420.0
2 7120 399 507.0 420.0
3 1198 599 616.0 300.0
4 10739 135 486.0 1260.0
5 10949 135 486.0 1260.0
6 8831 193 411.0 540.0
7 8778 266 770.0 600.0
8 700 137 294.0 540.0
9 5017 456 39.0 3000.0
10 4359 444 445.0 240.0
11 2801 288 288.0 5340.0
12 9525 265 592.0 300.0
I'm calling your list of ids to remove removed_ids.
df=df.loc[
(~df['StartStation ID'].isin(removed_ids)) &\
(~df['EndStation ID'].isin(removed_ids))
]
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]
I have a dataframe as follows:
A B
zDate
01-JAN-17 100 200
02-JAN-17 111 203
03-JAN-17 NaN 202
04-JAN-17 109 205
05-JAN-17 101 211
06-JAN-17 105 NaN
07-JAN-17 104 NaN
What is the best way, to fill the missing values, using last available ones?
Following is the intended result:
A B
zDate
01-JAN-17 100 200
02-JAN-17 111 203
03-JAN-17 111 202
04-JAN-17 109 205
05-JAN-17 101 211
06-JAN-17 105 211
07-JAN-17 104 211
Use ffill function, what is same as fillna with method ffill:
df = df.ffill()
print (df)
A B
zDate
01-JAN-17 100.0 200.0
02-JAN-17 111.0 203.0
03-JAN-17 111.0 202.0
04-JAN-17 109.0 205.0
05-JAN-17 101.0 211.0
06-JAN-17 105.0 211.0
07-JAN-17 104.0 211.0
I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})