pandas insert new first row and calculate timestamp based on minimum date - python

Hi i am looking for a more elegant solution than my code. i have a given df which look like this:
import pandas as pd
from pandas.tseries.offsets import DateOffset
sdate = date(2021,1,31)
edate = date(2021,8,30)
date_range = pd.date_range(sdate,edate-timedelta(days=1),freq='m')
df_test = pd.DataFrame({ 'Datum': date_range})
i take this df and have to insert a new first row with the minimum date
data_perf_indexed_vv = df_test.copy()
minimum_date = df_test['Datum'].min()
data_perf_indexed_vv = data_perf_indexed_vv.reset_index()
df1 = pd.DataFrame([[np.nan] * len(data_perf_indexed_vv.columns)],
columns=data_perf_indexed_vv.columns)
data_perf_indexed_vv = df1.append(data_perf_indexed_vv, ignore_index=True)
data_perf_indexed_vv['Datum'].iloc[0] = minimum_date - DateOffset(months=1)
data_perf_indexed_vv.drop(['index'], axis=1)
may somebody have a shorter or more elegant solution. thanks

Instead of writing such big 2nd block of code just make use of:
df_test.loc[len(df_test)+1,'Datum']=(df_test['Datum'].min()-DateOffset(months=1))
Finally make use of sort_values() method:
df_test=df_test.sort_values(by='Datum',ignore_index=True)
Now if you print df_test you will get desired output:
#output
Datum
0 2020-12-31
1 2021-01-31
2 2021-02-28
3 2021-03-31
4 2021-04-30
5 2021-05-31
6 2021-06-30
7 2021-07-31

Related

Cannot find index of corresponding date in pandas DataFrame

I have the following DataFrame with a Date column,
0 2021-12-13
1 2021-12-10
2 2021-12-09
3 2021-12-08
4 2021-12-07
...
7990 1990-01-08
7991 1990-01-05
7992 1990-01-04
7993 1990-01-03
7994 1990-01-02
I am trying to find the index for a specific date in this DataFrame using the following code,
# import raw data into DataFrame
df = pd.DataFrame.from_records(data['dataset']['data'])
df.columns = data['dataset']['column_names']
df['Date'] = pd.to_datetime(df['Date'])
# sample date to search for
sample_date = dt.date(2021,12,13)
print(sample_date)
# return index of sample date
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
The output of the program is,
2021-12-13
[]
I can't understand why. I have cast the Date column in the DataFrame to a DateTime and I'm doing a like-for-like comparison.
I have reproduced your Dataframe with minimal samples. By changing the way that you can compare the date will work like this below.
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2021-12-13','2021-12-10','2021-12-09','2021-12-08']})
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')
sample_date = dt.datetime.strptime('2021-12-13', '%Y-%m-%d')
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
output:
[0]
The search data was in the index number 0 of the DataFrame
Please let me know if this one has any issues

Convert a string "ddMONyyyy" onto date in Python

I have the following pd data df that includes one string column mydate
import pandas as pd
df = {'mydate': ['01JAN2009','20FEB2013','13MAR2010','01APR2012', '20MAY2013', '18JUN2018', '10JUL2002', '30AUG2000', '15SEP2001', '30OCT1999',
'04NOV2020', '23DEC1995']}
df = pd.DataFrame(df, columns = ['mydate'])
I need to convert mydate into date type and store it in a new column mydate2.
You could try this:
import pandas as pd
df = {'mydate': ['01JAN2009','20FEB2013','13MAR2010','01APR2012', '20MAY2013', '18JUN2018', '10JUL2002', '30AUG2000', '15SEP2001', '30OCT1999',
'04NOV2020', '23DEC1995']}
df = pd.DataFrame(df, columns = ['mydate'])
df['mydate2']=pd.to_datetime(df['mydate'])
print(df)
Output:
mydate mydate2
0 01JAN2009 2009-01-01
1 20FEB2013 2013-02-20
2 13MAR2010 2010-03-13
3 01APR2012 2012-04-01
4 20MAY2013 2013-05-20
5 18JUN2018 2018-06-18
6 10JUL2002 2002-07-10
7 30AUG2000 2000-08-30
8 15SEP2001 2001-09-15
9 30OCT1999 1999-10-30
10 04NOV2020 2020-11-04
11 23DEC1995 1995-12-23

Python data-frame using pandas

I have a dataset which looks like below
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:16 000]
[25/May/2015:23:11:16 000]
Now i have made this into a DF and df[0] has [25/May/2015:23:11:15 and df[1] has 000]. I want to send all the data which ends with same seconds to a file. in the above example they end with 15 and 16 as seconds. So all ending with 15 seconds into one and the other into a different one and many more
I have tried the below code
import pandas as pd
data = pd.read_csv('apache-access-log.txt', sep=" ", header=None)
df = pd.DataFrame(data)
print(df[0],df[1].str[-2:])
Converting that column to a datetime would make it easier to work on, e.g.:
df['date'] = pd.to_datetime(df['date'], format='%d/%B/%Y:%H:%m:%S')
The you can simply iterate over a groupby(), e.g.:
In []:
for k, frame in df.groupby(df['date'].dt.second):
#frame.to_csv('file{}.csv'.format(k))
print('{}\n{}\n'.format(k, frame))
Out[]:
15
date value
0 2015-11-25 23:00:15 0
1 2015-11-25 23:00:15 0
16
date value
2 2015-11-25 23:00:16 0
3 2015-11-25 23:00:16 0
You can set your datetime as the index for the dataframe, and then use loc and to_csv Pandas' functions. Obviously, as other answers points out, you should convert your date to datetime while reading your dataframe.
Example:
df = df.set_index(['date'])
df.loc['25/05/2018 23:11:15':'25/05/2018 23:11:15'].to_csv('df_data.csv')
Try out this,
## Convert a new column with seconds value
df['seconds'] = df.apply(lambda row: row[0].split(":")[3].split(" ")[0], axis=1)
for sec in df['seconds'].unique():
## filter by seconds
print("Resutl ",df[df['seconds'] == sec])

Pandas updating weekend date to nearest business day

I have a dataframe that currently looks like this:
raw_data = {'AllDate':['2017-04-05','2017-04-06','2017-04-07','2017-04-08','2017-04-09']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['AllDate'])
print df
I would like to add a WeekDate column to this dataframe such as if the date in the 'AllDate' falls on the weekend, the 'WeekDate' column has the date from the Friday before. If the date falls on the weekday, the date should remain the same.
As an example, the resulting DataFrame should look like this:
raw_data = {'AllDate':['2017-04-05','2017-04-06','2017-04-07','2017-04-08','2017-04-09'],'WeekDate':['2017-04-05','2017-04-06','2017-04-07','2017-04-07','2017-04-07']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['AllDate','WeekDate'])
print df
Any ideas how I could achieve this?
This works best (adding to the answer posted by Zhe):
import pandas as pd
import time
from datetime import datetime,timedelta
df = pd.DataFrame({'AllDate':['2017-04-05','2017-04-06','2017-04-07','2017-04-08','2017-04-09']})
df['WeekDate'] = [x if x.weekday() not in [5,6] else x - timedelta(days = (x.weekday()-4)) for x in pd.to_datetime(df['AllDate'])]
Try:
import pandas as pd
import time
df = pd.DataFrame({
'AllDate':['2017-04-05','2017-04-06','2017-04-07','2017-04-08','2017-04-09']
})
df['WeekDate'] = [
x if x.weekday() not in [5,6] else None for x in pd.to_datetime(df['AllDate'])
]
print(df.ffill())
Here's perhaps a simpler answer that comes up a lot dealing with timeseries, etc. Key is the offset objects available in Pandas tseries
df = pd.DataFrame({"AllDate": ["2017-04-01", "2017-04-02", "2017-04-03", "2017-04-04", "2017-04-09"]})
df["AllDate"] = pd.to_datetime(df["AllDate"])
df["PrevBusDate"] = df["AllDate"].apply(pd.tseries.offsets.BusinessDay().rollback)
df.head()
...
>>> AllDate PrevBusDate
0 2017-04-01 2017-03-31
1 2017-04-02 2017-03-31
2 2017-04-03 2017-04-03
3 2017-04-04 2017-04-04
4 2017-04-09 2017-04-07
NB: Don't have to convert the 'AllDate' column if you don't want to. Can simply generate the offsets and work with them however you like, eg:
[pd.tseries.offsets.BusinessDay().rollback(d) for d in pd.to_datetime(df["AllDate"])]

How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1

Categories