I have a dataframe of following structure (showing it as comma separated values):
day date hour cnt
Friday 9/15/2017 0 3
Friday 9/15/2017 1 5
Friday 9/15/2017 2 8
Friday 9/15/2017 3 6
...........................
Friday 9/15/2017 10
...........................
Saturday 9/16/2017 21 5
Saturday 9/16/2017 22 4
Some of the date values have data for every hour (0-23).
However, some of the date values can have missing hours. In the example, for 9/15/2017 data, there are no records for hour values from 9 to 13. For all these missing records, I need to add a new record with a cnt value (last column) of zero.
How do I achieve this in Python?
Provided you use pandas.DataFrame you may use fillna() method:
DataFrame['cnt'].fillna(value=0, axis=1)
Example:
Consider data:
one two three
a NaN 1.2 -0.355322
c NaN 3.3 0.983801
e 0.01 4 -0.712964
You may fill NaN using fillna():
data.fillna(0)
one two three
a 0 1.2 -0.355322
c 0 3.3 0.983801
e 0.01 4 -0.712964
You can generate a DatetimeIndex and use resample method:
#suppose your dataframe is named df:
idx = pd.DatetimeIndex(pd.to_datetime(df['date']).add(pd.to_timedelta(df['hour'], unit='h')))
df.index = idx
df_filled = df[['cnt']].resample('1H').sum().fillna(0).astype(int)
df_filled['day'] = df_filled.index.strftime('%A')
df_filled['date'] = df_filled.index.strftime('%-m/%-d/%Y')
df_filled['hour'] = df_filled.index.strftime('%-H')
or you can do the pivot and unpivot trick:
df_filled = df.pivot(values='cnt',index='date',columns='hour').fillna(0).unstack()
df_filled = df_filled.reset_index().sort_values(by=['date','hour'])
Related
A sample of my dataframe is as follows:
|Date_Closed|Owner|Case_Closed_Count|
|2022-07-19|JH|1|
|2022-07-18|JH|2|
|2022-07-17|JH|5|
|2022-07-19|DT|3|
|2022-07-15|DT|1|
|2022-07-01|DT|1|
|2022-06-30|JW|30|
|2022-06-28|JH|2|
My goal is to get a sum of case count per owner per month, which looks like:
|Month|Owner|Case_Closed_Count|
|2022-07|JH|8|
|2022-07|DT|5|
|2022-06|JW|30|
|2022-06|JH|2|
Here is the code I got so far:
df = pd.to_datetime(df['Date_Closed'])
month = df.Date_Closed.dt.to_period("M")
G = df.groupby(month).agg({'Case_Closed_Count':'sum'})
With the code above, I manage to get the case closed count groupby month, but how do I keep the owner column?
here is one way to do it
df['Date_Closed'] = pd.to_datetime(df['Date_Closed'])
df.groupby([df['Date_Closed'].dt.strftime('%Y-%m'), 'Owner']).sum().reset_index()
Date_Closed Owner Case_Closed_Count
0 2022-06 JH 2
1 2022-06 JW 30
2 2022-07 DT 5
3 2022-07 JH 8
I have a DataFrame like this:
Year Month Day Rain (mm)
2021 1 1 15
2021 1 2 NaN
2021 1 3 12
And so on (there are multiple years). I have used pivot_table function to convert the DataFrame into this:
Year 2021 2020 2019 2018 2017
Month Day
1 1 15
2 NaN
3 12
I used:
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first')
Now I would like to replace all NaN values and also possible -1 values with zeros from every column (by columns I mean years) but I have not been able to do so. I have tried:
df = df.fillna(0)
And also:
df.loc[df['Rain (mm)'] == NaN, 'Rain (mm)'] = 0
But neither won't work, no error message/exception, dataframe just remains unchanged. What I'm doing wrong? Any advise is highly appreciated.
I think problem is NaN are strings, so cannot replace them, so first try convert values to numeric:
df['Rain (mm)'] = pd.to_numeric(df['Rain (mm)'], errors='coerce')
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first').fillna(0)
Let's say I have the following df:
year date_until
1 2010 -
2 2011 30.06.13
3 2011 NaN
4 2015 30.06.18
5 2020 -
I'd like to fill all - and NaNs in the date_until column with 30/06/{year +1}. I tried the following but it uses the whole year column instead of the corresponding value of the specific row:
df['date_until] = df['date_until].str.replace('-', f'30/06/{df["year"]+1}')
my final goal is to calculate the difference between the year and the year of date_until, so maybe the step above is even unnecessary.
We can use pd.to_datetime here with errors='coerce' to ignore the faulty dates. Then use the dt.year to calculate the difference:
df['date_until'] = pd.to_datetime(df['date_until'], format='%d.%m.%y', errors='coerce')
df['diff_year'] = df['date_until'].dt.year - df['year']
year date_until diff_year
0 2010 NaT NaN
1 2011 2013-06-30 2.0
2 2011 NaT NaN
3 2015 2018-06-30 3.0
4 2020 NaT NaN
For everybody who is trying to replace values just like I wanted to in the first place, here is how you could solve it:
for i in range(len(df)):
if pd.isna(df['date_until'].iloc[i]):
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}'
if df['date_until'].iloc[i] == '-':
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}
But #Erfan's approach is much cleaner
I want to merge two datasets that are indexed by time and id. The problem is, the time is slightly different in each dataset. In one dataset, the time (Monthly) is mid-month, so the 15th of every month. In the other dataset, it is the last business day. This should still be a one-to-one match, but the dates are not exactly the same.
My approach is to shift mid-month dates to business day end-of-month dates.
Data:
dt = pd.date_range('1/1/2011','12/31/2011', freq='D')
dt = dt[dt.day == 15]
lst = [1,2,3]
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
df.head()
output:
0
date id
2011-01-15 1 -0.598584
2 -0.484455
3 -2.044912
2011-02-15 1 -0.017512
2 0.852843
This is what I want (I removed the performance warning):
In[83]:df.index.levels[0] + BMonthEnd()
Out[83]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
However, indexes are immutable, so this does not work:
In: df.index.levels[0] = df.index.levels[0] + BMonthEnd()
TypeError: 'FrozenList' does not support mutable operations.
The only solution I've got is to reset_index(), change the dates, then set_index() again:
df.reset_index(inplace=True)
df['date'] = df['date'] + BMonthEnd()
df.set_index(['date','id'], inplace=True)
This gives what I want, but is this the best way? Is there a set_level_values() function (I didn't see it in the API)?
Or maybe I'm taking the wrong approach to the merge. I could merge the dataset with keys df.index.get_level_values(0).year, df.index.get_level_values(0).month and id but this doesn't seem much better.
You can use set_levels in order to set multiindex levels:
df.index.set_levels(df.index.levels[0] + pd.tseries.offsets.BMonthEnd(),
level='date', inplace=True)
>>> df.head()
0
date id
2011-01-31 1 -1.410646
2 0.642618
3 -0.537930
2011-02-28 1 -0.418943
2 0.983186
You could just build it again:
df.index = pd.MultiIndex.from_arrays(
[
df.index.get_level_values(0) + BMonthEnd(),
df.index.get_level_values(1)
])
set_levels implicitly rebuilds the index under the covers. If you have more than two levels, this solution becomes unweildy, so consider using set_levels for typing brevity.
Since you want to merge anyway, you can forget about changing the index and use use pandas.merge_asof()
Data
df1
0
date id
2011-01-15 1 -0.810581
2 1.177235
3 0.083883
2011-02-15 1 1.217419
2 -0.970804
3 1.262364
2011-03-15 1 -0.026136
2 -0.036250
3 -1.103929
2011-04-15 1 -1.303298
And here is one with last business day of the month, df2
0
date id
2011-01-31 1 -0.277675
2 0.086539
3 1.441449
2011-02-28 1 1.330212
2 -0.028398
3 -0.114297
2011-03-31 1 -0.031264
2 -0.787093
3 -0.133088
2011-04-29 1 0.938732
merge
Use df1 as your left DataFrame and then choose the merge direction as forward since the last business day is always after the 15th. Optionally, you can set a tolerance. This is useful in the situation where you are missing a month in the right DataFrame and will prevent you from merging 03-31-2011 to 02-15-2011 if you are missing data for the last business day February.
import pandas as pd
pd.merge_asof(df1.reset_index(), df2.reset_index(), by='id', on='date',
direction='forward', tolerance=pd.Timedelta(days=20)).set_index(['date', 'id'])
Results in
0_x 0_y
date id
2011-01-15 1 -0.810581 -0.277675
2 1.177235 0.086539
3 0.083883 1.441449
2011-02-15 1 1.217419 1.330212
2 -0.970804 -0.028398
3 1.262364 -0.114297
2011-03-15 1 -0.026136 -0.031264
2 -0.036250 -0.787093
3 -1.103929 -0.133088
2011-04-15 1 -1.303298 0.938732
NOTE: Looking for some help on an efficient way to do this besides a mega join and then calculating the difference between dates
I have table1 with country ID and a date (no duplicates of these values) and I want to summarize table2 information (which has country, date, cluster_x and a count variable, where cluster_x is cluster_1, cluster_2, cluster_3) so that table1 has appended to it each value of the cluster ID and the summarized count from table2 where date from table2 occurred within 30 days prior to date in table1.
I believe this is simple in SQL: How to do this in Pandas?
select a.date,a.country,
sum(case when a.date - b.date between 1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between 1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between 1 and 30 then b.cluster_3 else 0 end) as cluster3
from table1 a
left outer join table2 b
on a.country=b.country
group by a.date,a.country
EDIT:
Here is a somewhat altered example. Say this is table1, an aggregated data set with date, city, cluster and count. Below it is the "query" dataset (table2). in this case we want to sum the count field from table1 for cluster1,cluster2,cluster3 (there is actually 100 of them) corresponding to the country id as long as the date field in table1 is within 30 days prior.
So for example, the first row of the query dataset has date 2/2/2015 and country 1. In table 1, there is only one row within 30 days prior and it is for cluster 2 with count 2.
Here is a dump of the two tables in CSV:
date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3
and table2:
date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2
Edit: Oop - wish I would have seen that edit about joining before submitting. Np, I'll leave this as it was fun practice. Critiques welcome.
Where table1 and table2 are located in the same directory as this script at "table1.csv" and "table2.csv", this should work.
I didn't get the same result as your examples with 30 days - had to bump it to 31 days, but I think the spirit is here:
import pandas as pd
import numpy as np
table1_path = './table1.csv'
table2_path = './table2.csv'
with open(table1_path) as f:
table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)
with open(table2_path) as f:
table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)
joined = pd.merge(table2, table1, how='outer', on=['country'])
joined['datediff'] = joined.date_x - joined.date_y
filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]
gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])
summed = pd.DataFrame(gb_date_x['count'].sum())
result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)
My test output:
ipdb> table1
date country cluster count
0 2014-01-30 00:00:00 1 1 1
1 2015-02-03 00:00:00 1 1 3
2 2015-01-30 00:00:00 1 2 2
3 2015-04-15 00:00:00 1 2 5
4 2015-03-01 00:00:00 2 1 6
5 2015-07-01 00:00:00 2 2 4
6 2015-01-31 00:00:00 2 3 8
7 2015-01-21 00:00:00 2 1 2
8 2015-01-21 00:00:00 2 1 3
ipdb> table2
date country
0 2015-02-01 00:00:00 1
1 2015-04-21 00:00:00 1
2 2015-02-21 00:00:00 2
...
ipdb> result
date_x country count
cluster 1 2 3
0 2015-02-01 00:00:00 1 0 2 0
1 2015-02-21 00:00:00 2 5 0 8
2 2015-04-21 00:00:00 1 0 5 0
UPDATE:
I think it doesn't make much sense to use pandas for processing data that can't fit into your memory. Of course there are some tricks how to deal with that, but it's painful.
If you want to process your data efficiently you should use a proper tool for that.
I would recommend to have a closer look at Apache Spark SQL where you can process your distributed data on multiple cluster nodes, using much more memory/processing power/IO/etc. compared to one computer/IO subsystem/CPU pandas approach.
Alternatively you can try use RDBMS like Oracle DB (very expensive, especially software licences! and their free version is full of limitations) or free alternatives like PostgreSQL (can't say much about it, because of lack of experience) or MySQL (not that powerful compared to Oracle; for example there is no native/clear solution for dynamic pivoting which you most probably will want to use, etc.)
OLD answer:
you can do it this way (please find explanations as comments in the code):
#
# <setup>
#
dates1 = pd.date_range('2016-03-15','2016-04-15')
dates2 = ['2016-02-01', '2016-05-01', '2016-04-01', '2015-01-01', '2016-03-20']
dates2 = [pd.to_datetime(d) for d in dates2]
countries = ['c1', 'c2', 'c3']
t1 = pd.DataFrame({
'date': dates1,
'country': np.random.choice(countries, len(dates1)),
'cluster': np.random.randint(1, 4, len(dates1)),
'count': np.random.randint(1, 10, len(dates1))
})
t2 = pd.DataFrame({'date': np.random.choice(dates2, 10), 'country': np.random.choice(countries, 10)})
#
# </setup>
#
# merge two DFs by `country`
merged = pd.merge(t1.rename(columns={'date':'date1'}), t2, on='country')
# filter dates and drop 'date1' column
merged = merged[(merged.date <= merged.date1 + pd.Timedelta('30days'))\
& \
(merged.date >= merged.date1)
].drop(['date1'], axis=1)
# group `merged` DF by ['country', 'date', 'cluster'],
# sum up `counts` for overlapping dates,
# reset the index,
# pivot: convert `cluster` values to columns,
# taking sum's of `count` as values,
# NaN's will be replaced with zeroes
# and finally reset the index
r = merged.groupby(['country', 'date', 'cluster'])\
.sum()\
.reset_index()\
.pivot_table(index=['country','date'],
columns='cluster',
values='count',
aggfunc='sum',
fill_value=0)\
.reset_index()
# rename numeric columns to: 'cluster_N'
rename_cluster_cols = {x: 'cluster_{0}'.format(x) for x in t1.cluster.unique()}
r = r.rename(columns=rename_cluster_cols)
Output (for my datasets):
In [124]: r
Out[124]:
cluster country date cluster_1 cluster_2 cluster_3
0 c1 2016-04-01 8 0 11
1 c2 2016-04-01 0 34 22
2 c3 2016-05-01 4 18 36