get subset dataframe by date - python

I have the following subset with a starting date (DD/MM/YYYY) and Amount
Start Date Amount
1 01/01/2013 20
2 02/05/2007 10
3 01/05/2004 15
4 01/06/2014 20
5 17/08/2008 21
I'd like to create a subset of this dataframe where only where the Start Date Day is 01:
Start Date Amount
1 01/01/2013 20
3 01/05/2004 15
4 01/06/2014 20
I've tried to loop through the table and use the index but couldn't find a suitable way to iterate through a dataframe rows.

Assuming your dates are datetime already then the following should work, if they are strings you can convert them using to_datetime so df['Start Date'] = pd.to_datetime(df['Start Date']), you may also need to pass param dayfirst = True if required. If you imported the data using read_csv you could've done this at the point of import so df = pd.read_csv('data.csv', parse_dates=[n], dayfirst=True) where n is the index (0-based of course) so if it was the first then pass parse_dates=[0].
One method could be to apply a lambda to the column and use the boolean index returned this to index against:
In [19]:
df[df['Start Date'].apply(lambda x: x.day == 1)]
Out[19]:
Start Date Amount
index
1 2013-01-01 20
3 2004-05-01 15
4 2014-06-01 20
Not sure if there is a built in method that doesn't involve setting this to be the index which will convert it into a timeseries index.

Related

How to subset rows based on date overlap range efficiently using python pandas?

My data frame has two date type columns: start and end (yyyy-mm-dd).
Here's my data frame:
import pandas as pd
import datetime
data=[["2016-10-17","2017-03-08"],["2014-08-17","2016-09-08"],["2014-01-01","2015-01-01"],["2017-12-20","2019-01-01"]]
df=pd.DataFrame(data,columns=['start','end'])
df['start'] = pd.to_datetime(df['start'], format='%Y-%m-%d')
df['end'] = pd.to_datetime(df['end'], format='%Y-%m-%d')
start end
0 2016-10-17 2017-03-08
1 2014-08-17 2016-09-08
2 2014-01-01 2015-01-01
3 2017-12-20 2019-01-01
And I have reference start and end date as following.
ref_start=datetime.date(2015, 9, 20)
ref_end=datetime.date(2017,1,31)
print(ref_start,ref_end)
2015-09-20 2017-01-31
I would like to subset rows if the start and end date range of a row overlaps with reference start and end date. The third and the fourth rows are not selected since the start and end date range does not overlap with reference date range (2015-09-20 ~ 2017-01-31)
So my desired outcome looks like this:
start end
0 2016-10-17 2017-03-08
1 2014-08-17 2016-09-08
To do that, I was thinking about using the following codes based on this: Efficient date range overlap calculation in python?
df[(max(df['start'],ref_start)>min(df['end'],ref_end))]
However, it doesn't work. Is there any way to get the desired outcome efficiently?
A trick I learned early on in my career is what I call "crossing the dates": you compare the start of one range against the end of the other.
# pd.Timestamp can do everything that datetime/date does and some more
ref_start = pd.Timestamp(2015, 9, 20)
ref_end = pd.Timestamp(2017,1,31)
# Compare the start of one range to the end of another and vice-versa
# Made into a separate variable for reability
cond = (ref_start <= df['end']) & (ref_end >= df['start'])
df[cond]

Create a new column in a dataframe that shows Day of the Week from an already existing dd/mm/yy column? Python

I have a dataframe that contains a column with dates e.g. 24/07/15 etc
Is there a way to create a new column into the dataframe that displays all the days of the week corresponding to the already existing 'Date' column?
I want the output to appear as:
[Date][DayOfTheWeek]
This might work:
If you want day name:
In [1405]: df
Out[1405]:
dates
0 24/07/15
1 25/07/15
2 26/07/15
In [1406]: df['dates'] = pd.to_datetime(df['dates']) # You don't need to specify the format also.
In [1408]: df['dow'] = df['dates'].dt.day_name()
In [1409]: df
Out[1409]:
dates dow
0 2015-07-24 Friday
1 2015-07-25 Saturday
2 2015-07-26 Sunday
If you want day number:
In [1410]: df['dow'] = df['dates'].dt.day
In [1411]: df
Out[1411]:
dates dow
0 2015-07-24 24
1 2015-07-25 25
2 2015-07-26 26
I would try the apply function, so something like this:
def extractDayOfWeek(dateString):
...
df['DayOfWeek'] = df.apply(lambda x: extractDayOfWeek(x['Date'], axis=1)
The idea is that, you map over every row, extract the 'date' column, and then apply your own function to create a new row entry named 'Day'
Depending of the type of you column Date.
df['Date']=pd.to_datetime(df['Date'], format="d/%m/%y")
df['weekday'] = df['Date'].dt.dayofweek

Combining Columns in Isoweek Object

I have a pandas dataframe that contains a year and week column:
year week
2018 18
2019 17
2019 17
I'm trying to combine the year and week columns into a new 'isoweek' column using the isoweek library. I can't seem to figure out how to properly loop through the rows to create the object column. If I do something like:
df['isoweek'] = Week(df['year'],df['week'])
isoweek chokes on the vectorization. I've tried creating a basic list and appending it to my dataframe, like so:
obj_list = []
for i in range(500):
year = df['year'][i]
week = df['week'][i]
w = Week(year,week)
obj_list.append(w)
df['isoweek'] = obj_list
But I end up with a simple tuple in the column.
The goal is to be able to use some of the isoweek library's operations to calculate date differences, like:
df['isoweek'] - 4
>isoweek.Week(2019, 34)
Is it even possible to store an object like this in a dataframe column? If so, how does one go about it?
As an alternative, you can use the built in method for datetime:
df['week_start'] = pd.to_datetime(df['year'].astype(str), format='%Y') + pd.to_timedelta(df['week'].mul(7).astype(str) + ' days')
# Output:
week year week_start
0 18 2018 2018-05-07
1 17 2019 2019-04-30
2 17 2019 2019-04-30
Calculating time differences is pretty straightforward here:
# Choose 7 weeks
n_weeks = pd.to_timedelta(7, unit='W')
# Adding is simple
df['week_start'] + n_weeks
# Output
0 2018-06-25
1 2019-06-18
2 2019-06-18
For more on this, read: Pandas: How to create a datetime object from Week and Year?
Potentially you could do this
First, set up the example dataframe
from isoweek import Week
df = pd.DataFrame ({'year' : [2018,2019,2019],
'week' : [18,17,17]})
Loop through the dataframe, adding the isoweek to a list
ls_isoweek = []
for row in df.itertuples():
ls_isoweek.append(Week(row[1],row[2]))
The list looks like this
[isoweek.Week(2018, 18), isoweek.Week(2019, 17), isoweek.Week(2019, 17)]
This list can be accessed thusly
ls_isoweek[0] - 4
Produces this output
isoweek.Week(2018, 14)
However, the list can also be added back to the dataframe if you wish
df['isoweek'] = ls_isoweek
You can then do things like ...
df['isoweek_minus_4'] = df['isoweek'].apply(lambda x: x-4)
Producing an output like the below
A little late, but if anyone else is still looking to use a solution of this form as I was, you could use lambda functions along with apply. For the dataframe below (with int64 dtypes),
year week
0 2018 18
1 2019 17
2 2019 17
Now we use isoweek to appropriately parse the data,
from isoweek import Week
df.apply(lambda row : Week(row["year"],row["week"]),axis=1)
This produces the output,
0 (2018, 18)
1 (2019, 17)
2 (2019, 17)
dtype: object
You could also identify the (week,year) with a datetime object by combining this approach with this answer https://stackoverflow.com/a/7687085.
df.apply(lambda row : Week(int(row["year"]),int(row["week"])).monday(),axis=1)
The int appears a little redundant there, but pandas by default uses int64 which doesn't appear to function with isoweek correctly. This produces the output,
0 2018-04-30
1 2019-04-22
2 2019-04-22
dtype: object

Python data-frame using pandas

I have a dataset which looks like below
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:16 000]
[25/May/2015:23:11:16 000]
Now i have made this into a DF and df[0] has [25/May/2015:23:11:15 and df[1] has 000]. I want to send all the data which ends with same seconds to a file. in the above example they end with 15 and 16 as seconds. So all ending with 15 seconds into one and the other into a different one and many more
I have tried the below code
import pandas as pd
data = pd.read_csv('apache-access-log.txt', sep=" ", header=None)
df = pd.DataFrame(data)
print(df[0],df[1].str[-2:])
Converting that column to a datetime would make it easier to work on, e.g.:
df['date'] = pd.to_datetime(df['date'], format='%d/%B/%Y:%H:%m:%S')
The you can simply iterate over a groupby(), e.g.:
In []:
for k, frame in df.groupby(df['date'].dt.second):
#frame.to_csv('file{}.csv'.format(k))
print('{}\n{}\n'.format(k, frame))
Out[]:
15
date value
0 2015-11-25 23:00:15 0
1 2015-11-25 23:00:15 0
16
date value
2 2015-11-25 23:00:16 0
3 2015-11-25 23:00:16 0
You can set your datetime as the index for the dataframe, and then use loc and to_csv Pandas' functions. Obviously, as other answers points out, you should convert your date to datetime while reading your dataframe.
Example:
df = df.set_index(['date'])
df.loc['25/05/2018 23:11:15':'25/05/2018 23:11:15'].to_csv('df_data.csv')
Try out this,
## Convert a new column with seconds value
df['seconds'] = df.apply(lambda row: row[0].split(":")[3].split(" ")[0], axis=1)
for sec in df['seconds'].unique():
## filter by seconds
print("Resutl ",df[df['seconds'] == sec])

How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1

Categories