Difference between dates in Pandas dataframe - python
This is related to this question, but now I need to find the difference between dates that are stored in 'YYYY-MM-DD'. Essentially the difference between values in the count column is what we need, but normalized by the number of days between each row.
My dataframe is:
date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,53.0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,53.0
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,53.0
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,54.0
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,54.0
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,54.0
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,58.0
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,521.0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,524.0
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,531.0
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,533.0
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,534.0
And I'd like to find the difference between each date after grouping by date+site+country+kind+ID tuples.
[date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count,day_diff
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,0,0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,0,1
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,0,1
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,0,1
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,0,1
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,0,1
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,4,2
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,0,0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,3,1
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,7,4
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,3,1
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,1,1]
One option would be to convert the date column to a Pandas datetime one using pd.to_datetime() and use the diff function but that results in values of "x days", of type timetelda64. I'd like to use this difference to find the daily average count so if this can be accomplished in even a single/less painful step, that would work well.
you can use .dt.days accessor:
In [72]: df['date'] = pd.to_datetime(df['date'])
In [73]: df['day_diff'] = df.groupby(['site','country_code','kind','ID'])['date'] \
.diff().dt.days.fillna(0)
In [74]: df
Out[74]:
date site country_code kind ID rank votes sessions avg_score count day_diff
0 2017-03-20 website1 US 0 84 226 0.0 15.0 3.370812 53.0 0.0
1 2017-03-21 website1 US 0 84 214 0.0 15.0 3.370812 53.0 1.0
2 2017-03-22 website1 US 0 84 226 0.0 16.0 3.370812 53.0 1.0
3 2017-03-23 website1 US 0 84 234 0.0 16.0 3.369048 54.0 1.0
4 2017-03-24 website1 US 0 84 226 0.0 16.0 3.369048 54.0 1.0
5 2017-03-25 website1 US 0 84 212 0.0 16.0 3.369048 54.0 1.0
6 2017-03-27 website1 US 0 84 228 0.0 16.0 3.369048 58.0 2.0
7 2017-02-15 website2 AU 1 91 144 4.0 148.0 4.727272 521.0 0.0
8 2017-02-16 website2 AU 1 91 144 3.0 147.0 4.727272 524.0 1.0
9 2017-02-20 website2 AU 1 91 100 4.0 148.0 4.727272 531.0 4.0
10 2017-02-21 website2 AU 1 91 118 6.0 149.0 4.727272 533.0 1.0
11 2017-02-22 website2 AU 1 91 114 4.0 151.0 4.727272 534.0 1.0
Related
NaN in single column while importing data from URL
I am trying to import all 9 columns of the popular MPG dataset from UCI from a URL. The problem is , instead of the string values showing, Carname (the ninth column) is populated by NaN. What is going wrong and how can one fix this? The link to the repository shows that the original dataset has 9 columns, so this should work. From the URL and we find that the data looks like 18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" with unique string values on the Carname but when we import it as import pandas as pd # Import raw dataset from URL url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data' column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin', 'Carname'] data = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True) data.head(3) yielding (with NaN values on Carname) MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname 0 18.0 8 307.0 130.0 3504.0 12.0 70 1 NaN 1 15.0 8 350.0 165.0 3693.0 11.5 70 1 NaN
It’s literally in your read_csv call: comment='\t'. The only tabs are before the Carname field, which means the way you read the fle explicitely ignores that column. You can remove the comment parameter and use the more generic separator \s+ instead to split on any whitespace (one or more spaces, a tab, etc.): >>> pd.read_csv(url, names=column_names, na_values='?', sep='\s+') MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin Carname 0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu 1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320 2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite 3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst 4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino .. ... ... ... ... ... ... ... ... ... 393 27.0 4 140.0 86.0 2790.0 15.6 82 1 ford mustang gl 394 44.0 4 97.0 52.0 2130.0 24.6 82 2 vw pickup 395 32.0 4 135.0 84.0 2295.0 11.6 82 1 dodge rampage 396 28.0 4 120.0 79.0 2625.0 18.6 82 1 ford ranger 397 31.0 4 119.0 82.0 2720.0 19.4 82 1 chevy s-10 [398 rows x 9 columns]
How can I iterate over a pandas dataframe so I can divide specific values based on a condition?
I have a dataframe like below: 0 1 2 ... 62 63 64 795 89.0 92.0 89.0 ... 74.0 64.0 4.0 575 80.0 75.0 78.0 ... 70.0 68.0 3.0 1119 2694.0 2437.0 2227.0 ... 4004.0 4010.0 6.0 777 90.0 88.0 88.0 ... 71.0 67.0 4.0 506 82.0 73.0 77.0 ... 69.0 64.0 2.0 ... ... ... ... ... ... ... ... 65 84.0 77.0 78.0 ... 78.0 80.0 0.0 1368 4021.0 3999.0 4064.0 ... 1.0 4094.0 8.0 1036 80.0 80.0 79.0 ... 73.0 66.0 5.0 1391 3894.0 3915.0 3973.0 ... 4.0 4090.0 8.0 345 81.0 74.0 75.0 ... 80.0 75.0 1.0 I want to divide all elements over 1000 in this dataframe by 100. So 4021.0 becomes 40.21, et cetera. I've tried something like below: for cols in df: for rows in df[cols]: print(df[cols][rows]) I get index out of bound errors. I'm just not sure how to properly iterate the way I'm looking for.
I think loops are here slow, so better is use vectorizes solutions - select values greater like 1000 and divide: df[df.gt(1000)] = df.div(100) Or using DataFrame.mask: df = df.mask(df.gt(1000), df.div(100)) print (df) 0 1 2 62 63 64 795 89.00 92.00 89.00 74.00 64.00 4.0 575 80.00 75.00 78.00 70.00 68.00 3.0 1119 26.94 24.37 22.27 40.04 40.10 6.0 777 90.00 88.00 88.00 71.00 67.00 4.0 506 82.00 73.00 77.00 69.00 64.00 2.0 65 84.00 77.00 78.00 78.00 80.00 0.0 1368 40.21 39.99 40.64 1.00 40.94 8.0 1036 80.00 80.00 79.00 73.00 66.00 5.0 1391 38.94 39.15 39.73 4.00 40.90 8.0 345 81.00 74.00 75.00 80.00 75.00 1.0
You can use the applymap function and create your custom function def mapper_function(x): if x >= 1000: x=x/100 else: x return x df=df.applymap(mapper_function)
Improve Performance of Apply Method
I would like to groupby by the variable of my df "cod_id" and then apply this function: [df.loc[df['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \ for d in df['dt_op']] Moving from this df: print(df) dt_op quantity cod_id 20/01/18 1 613 21/01/18 8 611 21/01/18 1 613 ... To this one: print(final_df) n = 7 dt_op quantity product_code Final_Quantity 20/01/18 1 613 2 21/01/18 8 611 8 25/01/18 1 613 1 ... I tried with: def lookforward(x): L = [x.loc[x['dt_op'].between(row.dt_op, row.dt_op + pd.Timedelta(days=7)), \ 'quantity'].sum() for row in x.itertuples(index=False)] return pd.Series(L, index=x.index) s = df.groupby('cod_id').apply(lookforward) s.index = s.index.droplevel(0) df['Final_Quantity'] = s print(df) dt_op quantity cod_id Final_Quantity 0 2018-01-20 1 613 2 1 2018-01-21 8 611 8 2 2018-01-21 1 613 1 But it is not an efficient solution, since it is computationally slow; How can I improve its performance? I would achieve it even with a new code/new function that leads to the same result. EDIT: Subset of the original dataset, with just one product (cod_id == 2), I tried to run on the code provided by "w-m": print(df) cod_id dt_op quantita final_sum 0 2 2017-01-03 1 54.0 1 2 2017-01-04 1 53.0 2 2 2017-01-13 1 52.0 3 2 2017-01-23 2 51.0 4 2 2017-01-26 1 49.0 5 2 2017-02-03 1 48.0 6 2 2017-02-27 1 47.0 7 2 2017-03-05 1 46.0 8 2 2017-03-15 1 45.0 9 2 2017-03-23 1 44.0 10 2 2017-03-27 2 43.0 11 2 2017-03-31 3 41.0 12 2 2017-04-04 1 38.0 13 2 2017-04-05 1 37.0 14 2 2017-04-15 2 36.0 15 2 2017-04-27 2 34.0 16 2 2017-04-30 1 32.0 17 2 2017-05-16 1 31.0 18 2 2017-05-18 1 30.0 19 2 2017-05-19 1 29.0 20 2 2017-06-03 1 28.0 21 2 2017-06-04 1 27.0 22 2 2017-06-07 1 26.0 23 2 2017-06-13 2 25.0 24 2 2017-06-14 1 23.0 25 2 2017-06-20 1 22.0 26 2 2017-06-22 2 21.0 27 2 2017-06-28 1 19.0 28 2 2017-06-30 1 18.0 29 2 2017-07-03 1 17.0 30 2 2017-07-06 2 16.0 31 2 2017-07-07 1 14.0 32 2 2017-07-13 1 13.0 33 2 2017-07-20 1 12.0 34 2 2017-07-28 1 11.0 35 2 2017-08-06 1 10.0 36 2 2017-08-07 1 9.0 37 2 2017-08-24 1 8.0 38 2 2017-09-06 1 7.0 39 2 2017-09-16 2 6.0 40 2 2017-09-20 1 4.0 41 2 2017-10-07 1 3.0 42 2 2017-11-04 1 2.0 43 2 2017-12-07 1 1.0
Edit 181017: this approach doesn't work due to forward rolling functions on sparse time series not currently being supported by pandas, see the comments. Using for loops can be a performance killer when doing pandas operations. The for loop around the rows plus their timedelta of 7 days can be replaced with a .rolling("7D"). To get a forward-rolling time delta (current date + 7 days), we reverse the df by date, as shown here. Then no custom function is required anymore, and you can just take .quantity.sum() from the groupby. quant_sum = df.sort_values("dt_op", ascending=False).groupby("cod_id") \ .rolling("7D", on="dt_op").quantity.sum() cod_id dt_op 611 2018-01-21 8.0 613 2018-01-21 1.0 2018-01-20 2.0 Name: quantity, dtype: float64 result = df.set_index(["cod_id", "dt_op"]) result["final_sum"] = quant_sum result.reset_index() cod_id dt_op quantity final_sum 0 613 2018-01-20 1 2.0 1 611 2018-01-21 8 8.0 2 613 2018-01-21 1 1.0
Implementing the exact behavior from the question is difficult due to two shortcoming in pandas: neither groupby/rolling/transform nor forward looking rolling sparse dates being implemented (see other answer for more details). This answer attempts to work around both by resampling the data, filling in all days, and then joining the quant_sums back with the original data. # Create a temporary df with all in between days filled in with zeros filled = df.set_index("dt_op").groupby("cod_id") \ .resample("D").asfreq().fillna(0) \ .quantity.to_frame() # Reverse and sum filled["quant_sum"] = filled.reset_index().set_index("dt_op") \ .iloc[::-1] \ .groupby("cod_id") \ .rolling(7, min_periods=1) \ .quantity.sum().astype(int) # Join with original `df`, dropping the filled days result = df.set_index(["cod_id", "dt_op"]).join(filled.quant_sum).reset_index()
find duplicates and mark as variant
I'm trying to create a data frame where I add duplicates as variants in a column.To further illustrate my question: I have a pandas dataframe like this: Case ButtonAsInteger 0 1 130 1 1 133 2 1 42 3 2 165 4 2 158 5 2 157 6 3 158 7 3 159 8 3 157 9 4 130 10 4 133 11 4 43 ... ... ... I have converted it into this form: grouped = activity2.groupby(['Case']) values = grouped['ButtonAsInteger'].agg('sum') id_df = grouped['ButtonAsInteger'].apply(lambda x: pd.Series(x.values)).unstack(level=-1 0 1 2 3 4 5 6 7 8 9 Case 1 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN 2 165.0 158.0 157.0 141.0 142.0 142.0 142.0 142.0 142.0 147.0 3 158.0 159.0 157.0 147.0 166.0 170.0 169.0 130.0 133.0 133.0 4 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN And now I want to find duplicates and mark each duplicate as a variant. So in this example, Case 1 and 4 should get variant 1. Like this: Variants 0 1 2 3 4 5 6 7 8 9 Case 1 1 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN 2 2 165.0 158.0 157.0 141.0 142.0 142.0 142.0 142.0 142.0 147.0 3 3 158.0 159.0 157.0 147.0 166.0 170.0 169.0 130.0 133.0 133.0 4 1 130.0 133.0 42.0 52.0 47.0 47.0 32.0 94.0 NaN NaN I have already tried this method https://stackoverflow.com/a/44999009. But it doesn't work on my data frame. Unfortunately I don't know why. It will probably be possible to apply a double for loop. So for each line look if there is a duplicate in the record. Whether this is efficient on a large record, I don't know. I have also added my procedure with grouping, because perhaps there is a possibility to already work with duplicates at this point?
This groups by all columns and returns the group index (+ 1 because zero based indexing is the default). I think this should be what you want. id_df['Variant'] = id_df.groupby( id_df.columns.values.tolist()).grouper.group_info[0] + 1 The resulting data frame, given your input data like above: 0 1 2 Variant Case 1 130 133 42 1 2 165 158 157 3 3 158 159 157 2 4 130 133 42 1 There could be a syntactically nicer way to access the group index, but i didn't find one.
How to assign a values to dataframe's column by comparing values in another dataframe
I have two data frames. One has rows for every five minutes in a day: df TIMESTAMP TEMP 1 2011-06-01 00:05:00 24.5 200 2011-06-01 16:40:00 32.0 1000 2011-06-04 11:20:00 30.2 5000 2011-06-18 08:40:00 28.4 10000 2011-07-05 17:20:00 39.4 15000 2011-07-23 02:00:00 29.3 20000 2011-08-09 10:40:00 29.5 30656 2011-09-15 10:40:00 13.8 I have another dataframe that ranks the days ranked TEMP DATE RANK 62 43.3 2011-08-02 1.0 63 43.1 2011-08-03 2.0 65 43.1 2011-08-05 3.0 38 43.0 2011-07-09 4.0 66 42.8 2011-08-06 5.0 64 42.5 2011-08-04 6.0 84 42.2 2011-08-24 7.0 56 42.1 2011-07-27 8.0 61 42.1 2011-08-01 9.0 68 42.0 2011-08-08 10.0 Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]'). What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank). So, the final result would look something like this: df TIMESTAMP TEMP RANK 1 2011-06-01 00:05:00 24.5 98.0 200 2011-06-01 16:40:00 32.0 98.0 1000 2011-06-04 11:20:00 30.2 96.0 5000 2011-06-18 08:40:00 28.4 50.0 10000 2011-07-05 17:20:00 39.4 9.0 15000 2011-07-23 02:00:00 29.3 45.0 20000 2011-08-09 10:40:00 29.5 40.0 30656 2011-09-15 10:40:00 13.8 100.0 What I have done so far: # Separate the date and times. df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize() df['TIME'] = df['YYYYMMDDHHmm'].dt.time df = df[['DATE', 'TIME', 'TAIR']] df['RANK'] = 0 for index, row in df.iterrows(): df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values But I think I am going in a very wrong direction because this takes ages to complete. How do I improve this code?
IIUC, you can play with indexes to match the values df = df.set_index(df.TIMESTAMP.dt.date)\ .assign(RANK=ranked.set_index('DATE').RANK)\ .set_index(df.index)