How to remove specific day timestamps from a big dataframe - python

I have a big dataframe consisting of 600 days worth of data. Each day has 100 timestamps. I have a separate list of 30 days from which I want to data. How do I remove data from these 30 days from the dataframe?
I tried a for loop, but it did not work. I know there is a simple method. But I don't know how to implement it.
df #is main dataframe which has many columns and rows. Index is a timestamp.
df['dates'] = df.index.strftime('%Y-%m-%d') # date part of timestamp is sliced and
#a new column is created. Instead of index, I want to use this column for comparing with bad list.
bad_list # it is a list of bad dates
for i in range(0,len(df)):
for j in range(0,len(bad_list)):
if str(df['dates'][i])== bad_list[j]:
df.drop(df[i].index,inplace=True)

You can do the following
df['dates'] = df.index.strftime('%Y-%m-%d')
#badlist should be in date format too.
newdf = df[~df['dates'].isin(badlist)]
# the ~ is used to denote "not in" the list.
#if Jan 1, 2000 is a bad date, it should be in the list as datetime(2000,1,1)

You can perform simple comparison:
>>> dates = pd.Series(pd.to_datetime(np.random.randint(int(time()) - 60 * 60 * 24 * 5, int(time()), 12), unit='s'))
>>> dates
0 2019-03-19 05:25:32
1 2019-03-20 00:58:29
2 2019-03-19 01:03:36
3 2019-03-22 11:45:24
4 2019-03-19 08:14:29
5 2019-03-21 10:17:13
6 2019-03-18 09:09:15
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
9 2019-03-23 06:19:35
10 2019-03-23 05:42:34
11 2019-03-21 11:37:46
>>> start_date = pd.to_datetime('2019-03-20')
>>> end_date = pd.to_datetime('2019-03-22')
>>> dates[(dates > start_date) & (dates < end_date)]
1 2019-03-20 00:58:29
5 2019-03-21 10:17:13
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
11 2019-03-21 11:37:46
If your source Series is not in datetime format, then you will need to use pd.to_datetime to convert it.

Related

Group by id and calculate variation on sells based on the date

My DataFrame looks like this:
id
date
value
1
2021-07-16
100
2
2021-09-15
20
1
2021-04-10
50
1
2021-08-27
30
2
2021-07-22
15
2
2021-07-22
25
1
2021-06-30
40
3
2021-10-11
150
2
2021-08-03
15
1
2021-07-02
90
I want to groupby the id, and return the difference of total value in a 90-days period.
Specifically, I want the values of last 90 days based on today, and based on 30 days ago.
For example, considering today is 2021-10-13, I would like to get:
the sum of all values per id between 2021-10-13 and 2021-07-15
the sum of all values per id between 2021-09-13 and 2021-06-15
And finally, subtract them to get the variation.
I've already managed to calculate it, by creating separated temporary dataframes containing only the dates in those periods of 90 days, grouping by id, and then merging these temp dataframes into a final one.
But I guess it should be an easier or simpler way to do it. Appreciate any help!
Btw, sorry if the explanation was a little messy.
If I understood correctly, you need something like this:
import pandas as pd
import datetime
## Calculation of the dates that we are gonna need.
today = datetime.datetime.now()
delta = datetime.timedelta(days = 120)
# Date of the 120 days ago
hundredTwentyDaysAgo = today - delta
delta = datetime.timedelta(days = 90)
# Date of the 90 days ago
ninetyDaysAgo = today - delta
delta = datetime.timedelta(days = 30)
# Date of the 30 days ago
thirtyDaysAgo = today - delta
## Initializing an example df.
df = pd.DataFrame({"id":[1,2,1,1,2,2,1,3,2,1],
"date": ["2021-07-16", "2021-09-15", "2021-04-10", "2021-08-27", "2021-07-22", "2021-07-22", "2021-06-30", "2021-10-11", "2021-08-03", "2021-07-02"],
"value": [100,20,50,30,15,25,40,150,15,90]})
## Casting date column
df['date'] = pd.to_datetime(df['date']).dt.date
grouped = df.groupby('id')
# Sum of last 90 days per id
ninetySum = grouped.apply(lambda x: x[x['date'] >= ninetyDaysAgo.date()]['value'].sum())
# Sum of last 90 days, starting from 30 days ago per id
hundredTwentySum = grouped.apply(lambda x: x[(x['date'] >= hundredTwentyDaysAgo.date()) & (x['date'] <= thirtyDaysAgo.date())]['value'].sum())
The output is
ninetySum - hundredTwentySum
id
1 -130
2 20
3 150
dtype: int64
You can double check to make sure these are the numbers you wanted by printing ninetySum and hundredTwentySum variables.

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

Tricky slicing specifications on business-day datetimeindex

I have a pandas dataframe with a business-day-based DateTimeIndex. For each month that's in the index, I also have a single 'marker' day specified.
Here's a toy version of that dataframe:
# a dataframe with business dates as the index
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# each month has an single, arbitrary marker day specified
marker_dates = [df.index[12], df.index[33], df.index[57]]
For each month in the index, I need to calculate average of the foo column in specific slice of rows in that month.
There are two different ways I need to be able to specify those slices:
1) m'th day to n'th day.
Example might be (2rd to 4th business day in that month). So april would be the average of 1 (apr2), 4 (apr3), and 5 (apr 6) = 3.33. May would be 33 (may 4), 34 (may 5), 35 (may 6) = 34. I don't consider the weekends/holidays that don't occur in the index as days.
2) m'th day before/after the marker date to the n'th day before/after the marker date.
Example might be "average of the slice from 1 day before the marker date to 1 day after the marker date in each month" Eg. In April, the marker date is 17Apr. Looking at the index, we want the average of apr16, apr17, and apr20.
For Example 1, I had an ugly solution that foreach month I would slice the rows of that month away, and then apply df_slice.iloc[m:n].mean()
Whenever I start doing iterative things with pandas, I always suspect I'm doing it wrong. So I imagine there is a cleaner, pythonic/vectorized way to make this result for all the months
For Example 2, I don't not know a good way to do this slice-averaging based on arbitrary dates across many months.
Use BDay() from pandas.tseries.offsets
import pandas as pd
from pandas.tseries.offsets import BDay
M=2
N=4
start_date = pd.datetime(2015,4,1)
end_date = pd.datetime(2015,6,30)
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last()
# for month starts
marker_dates = pd.date_range(start=start_date, end=end_date, freq='BMS')
# create IntervalIndex
bins = pd.IntervalIndex.from_tuples([ (d + (M-1)*BDay(), d + (N-1)*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-02, 2015-04-06] 3.333333
#[2015-05-04, 2015-05-06] 34.000000
#[2015-06-02, 2015-06-04] 63.000000
# any markers
marker_dates = [df.index[12], df.index[33], df.index[57]]
# M Bday before, and N Bday after
bins = pd.IntervalIndex.from_tuples([ (d - M*BDay(), d + N*BDay()) for d in marker_dates ], closed='both')
df.groupby(pd.cut(df.index, bins)).mean()
#[2015-04-15, 2015-04-23] 18.428571
#[2015-05-14, 2015-05-22] 48.000000
#[2015-06-17, 2015-06-25] 81.428571
The most pythonic/vectorized (pandonic?) way to do this might be to use df.rolling and df.shift to generate the window over which you'll take the average, then df.reindex to select the value at the dates you've marked.
For your example (2), this could look like:
df['foo'].rolling(3).mean().shift(-1).reindex(marker_dates)
Out[8]:
2015-04-17 17.333333
2015-05-18 47.000000
2015-06-19 80.333333
Name: foo, dtype: float64
This could be wrapped in a small function:
def window_mean_at_indices(df, indices, begin=-1, end=1):
return df.rolling(1+end-begin).mean().shift(-end).reindex(indices)
Helping to make it more clear how to apply this to situation (1):
month_starts = pd.date_range(df.index.min(), df.index.max(), freq='BMS')
month_starts
Out[11]: DatetimeIndex(['2015-04-01', '2015-05-01', '2015-06-01'],
dtype='datetime64[ns]', freq='BMS')
window_mean_at_indices(df['foo'], month_starts, begin=1, end=3)
Out[12]:
2015-04-01 3.333333
2015-05-01 34.000000
2015-06-01 63.000000
Freq: BMS, Name: foo, dtype: float64
For your first problem you can use grouper and iloc i.e
low = 2
high= 4
slice_mean = df.groupby(pd.Grouper(level=0,freq='m')).apply(lambda x : x.iloc[low-1:high].mean())
# or df.resample('m').apply(lambda x : x.iloc[low-1:high].mean())
foo
2015-04-30 3.333333
2015-05-31 34.000000
2015-06-30 63.000000
For your second problem you can concat the dates and take the groupy mean per month i.e
idx = pd.np.where(df.index.isin(pd.Series(marker_dates)))[0]
#array([12, 33, 57])
temp = pd.concat([df.iloc[(idx+i)] for i in [-1,0,1]])
foo
2015-04-16 15
2015-05-15 46
2015-06-18 78
2015-04-17 18
2015-05-18 47
2015-06-19 81
2015-04-20 19
2015-05-19 48
2015-06-22 82
# Groupby mean
temp.groupby(pd.Grouper(level=0,freq='m')).mean()
# or temp.resample('m').mean()
foo
2015-04-30 17.333333
2015-05-31 47.000000
2015-06-30 80.333333
dtype: float64
since the index of output aint specified in the question do let us know what the index of output be.
Here's what I managed to come up with:
Import pandas and setup the dataframe
import pandas as pd
df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B')
Start with a pure list of marker dates, since I'm guessing that what you're really starting with:
marker_dates = [
pd.to_datetime('2015-04-17', format='%Y-%m-%d'),
pd.to_datetime('2015-05-18', format='%Y-%m-%d'),
pd.to_datetime('2015-06-19', format='%Y-%m-%d')
]
marker_df = pd.DataFrame([], columns=['marker', 'start', 'end', 'avg'])
marker_df['marker'] = marker_dates
For the case where you want to just test ranges, input the start and end manually here instead of calculating it. If you want to change the range you can change the arguments to shift():
marker_df['start'] = df.index.shift(-1)[df.index.isin(marker_df['marker'])]
marker_df['end'] = df.index.shift(1)[df.index.isin(marker_df['marker'])]
Finally, use DataFrame.apply() to do a row by row calculation of averages:
marker_df.apply(
lambda x: df[(x['start'] <= df.index) & (df.index <= x['end'])]['foo'].mean(),
axis=1
)
Which gives us this result:
marker start end avg
0 2015-04-17 2015-04-16 2015-04-20 17.000000
1 2015-05-18 2015-05-15 2015-05-19 46.666667
2 2015-06-19 2015-06-18 2015-06-22 80.000000

Join and sum on subset of rows in a dataframe

I have a pandas dataframe which stores date ranges and some associated colums:
date_start date_end ... lots of other columns ...
1 2016-07-01 2016-07-02
2 2016-07-01 2016-07-03
3 2016-07-01 2016-07-04
4 2016-07-02 2016-07-07
5 2016-07-05 2016-07-06
and another dataframe of Pikachu sightings indexed by date:
pikachu_sightings
date
2016-07-01 2
2016-07-02 4
2016-07-03 6
2016-07-04 8
2016-07-05 10
2016-07-06 12
2016-07-07 14
For each row in the first df I'd like to calculate the sum of pikachu_sightings within that date range (i.e., date_start to date_end) and store that in a new column. So would end up with a df like this (numbers left in for clarity):
date_start date_end total_pikachu_sightings
1 2016-07-01 2016-07-02 2 + 4
2 2016-07-01 2016-07-03 2 + 4 + 6
3 2016-07-01 2016-07-04 2 + 4 + 6 + 8
4 2016-07-02 2016-07-07 4 + 6 + 8 + 10 + 12 + 14
5 2016-07-05 2016-07-06 10 + 12
If I was doing this iteratively I'd iterate over each row in the table of date ranges, select the subset of rows in the table of sightings that match the date range and perform a sum on it - but this is way too slow for my dataset:
for range in ranges.itertuples():
sightings_in_range = sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)]
sum_sightings_in_range = sightings_in_range["pikachu_sightings"].sum()
ranges.set_value(range.Index, 'total_pikachu_sightings', sum_sightings_in_range)
This is my attempt at using pandas, but fails because the length of the two dataframes does not match (and even if they did, there's probably some other flaw in my approach):
range["total_pikachu_sightings"] =
sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)
["pikachu_sightings"].sum()
I'm trying to understand what the general approach/design should look like as I'd like to aggregate with other functions too, sum just seems like the easiest for an example. Sorry if this is an obvious question - I'm new to pandas!
A sketch of a vectorized solution:
Start with a p as in piRSquared's answer.
Make sure date_ cols have datetime64 dtypes, i.e.:
df['date_start'] = pd.to_datetime(df.date_time)
Then calculate cumulative sums:
psums = p.cumsum()
and
result = psums.asof(df.date_end) - psums.asof(df.date_start)
It's not yet the end, though. asof returns the last good value, so it sometimes will take the exact start date and sometimes not (depending on your data). So, you have to adjust for that. (If the date frequency is day, then probably moving the index of p an hour backwards, e.g. -pd.Timedelta(1, 'h'), and then adding p.asof(df.start_date) might do the trick.)
First make sure that pikachu_sightings has a datetime index and is sorted.
p = pikachu_sightings.squeeze() # force into a series
p.index = pd.to_datetime(p.index)
p = p.sort_index()
Then make sure your date_start and date_end are datetime.
df.date_start = pd.to_datetime(df.date_start)
df.date_end = pd.to_datetime(df.date_end)
Then its simply
df.apply(lambda x: p[x.date_start:x.date_end].sum(), axis=1)
0 6
1 12
2 20
3 54
4 22
dtype: int64

Simplest way to find the difference between two dates in pandas

I'm trying to find the difference between two dates in a multi index data frame that is the result of a pivot table operation.
The data frame contains three columns. The first is a measurement the second is the end date and the third is the start date.
I've been able to successfully add a third multi index column to the data frame but only to make the result of reach cell zero
Pt["min"]["start_date"] = 0 but when I try to subtract the two dates I get a string error and appending .Dt.Days to the end of each column results in an error as well.
What is the simplest way to find the difference in days between two dates in a multi index pandas data frame?
You can select Multiindex in columns by tuples and subtract columns:
print (df)
a
meas end start
0 7 2015-04-05 2015-04-01
1 8 2015-04-07 2015-04-02
2 9 2015-04-14 2015-04-04
#if dtypes not datetime
df[('a','end')] = pd.to_datetime(df[('a','end')])
df[('a','start')] = pd.to_datetime(df[('a','start')])
df[('a','diff')] = df[('a','end')] - df[('a','start')]
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4 days
1 8 2015-04-07 2015-04-02 5 days
2 9 2015-04-14 2015-04-04 10 days
If need output in days:
df[('a','diff')] = (df[('a','end')] - df[('a','start')]).dt.days
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4
1 8 2015-04-07 2015-04-02 5
2 9 2015-04-14 2015-04-04 10

Categories