Below is the example of a sample pandas dataframe. I am trying to find the difference between the dates in the two rows (with the first row as the base):
PH_number date Type
H09879721 2018-05-01 AccountHolder
H09879731 2018-06-22 AccountHolder
If the difference between two dates is within 90 days, then those two rows should be added to a new pandas dataframe. The date column is of type object.
How can I do this?
Use .diff():
df.date.diff()<=pd.Timedelta(90,'d')
0 False
1 True
Name: date, dtype: bool
Convert date column to datetime64[ns] data type using pd.to_datetime and then subtract as given:
df['date'] = pd.to_datetime(df['date'])
#if comparing with only 1st row
mask = (df['date']-df.loc[0,'date']).dt.days<=90
# alternative mask = (df['date']-df.loc[0,'date']).dt.days.le(90)
#if comparing with immediate rows.
mask = df['date'].diff().dt.days<=90
# alternative mask = df['date'].diff().dt.days.le(90)
df1 = df.loc[mask,:] #gives you required rows with all columns
Related
I have a dataframe containing a column which contains a list of dates. The length of dates can range (2+ dates). I was hoping to create a new column that contains the number of days between the minimum and maximum of dates in this list and am not entirely sure the best way to do this? Any help would be greatly appreciated!
data = [
["Item_1", ["2020-06-01", "2020-06-02", "2020-07-05"]],
["Item_2", ["2018-04-15", "2018-04-22"]],
["Item_3", ["2016-02-15", "2016-02-22", "2016-03-05", "2016-04-01"]],
]
df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
df
We can Series.explode the Dates column, convert to_datetime, then groupby agg to find the min and max dates per group, take the diff of each group, and assign the result back to a new column:
df['Duration'] = (
# explode lists into usable Series and convert to Datetime
pd.to_datetime(df['Dates'].explode())
.groupby(level=0).agg(['min', 'max']) # Get min and max per group
.diff(axis=1) # Diff across rows
.iloc[:, -1] # Get the resulting difference
)
If the lists are guaranteed to be sorted, we can simply subtract the last value in the list from the first to get the duration after converting to_datetime:
df['Duration'] = (
# get last value in list and subtract from first value
# after converting each to datetime
pd.to_datetime(df['Dates'].str[-1]) - pd.to_datetime(df['Dates'].str[0])
)
Both options produce df:
Item_ID Dates Duration
0 Item_1 [2020-06-01, 2020-06-02, 2020-07-05] 34 days
1 Item_2 [2018-04-15, 2018-04-22] 7 days
2 Item_3 [2016-02-15, 2016-02-22, 2016-03-05, 2016-04-01] 46 days
There are many ways
Option 1: Keep it numpy and one liner
df['Lapse'] =df.agg(lambda x: np.ptp(np.array(x['Dates'], dtype='datetime64')), axis=1)
Option 2: Go the long way
Explode
Coerce date to date time
Find differences of the extremes using np.ptpt
df=df.explode('Dates')
df['Dates']=pd.to_datetime(df['Dates'], format='%d,%m,%Y')
df.groupby('Item_ID').agg(lapse= ('Dates', np.ptp), Dates=('Dates', list))
I have a csv that looks like this
valid,value
2004-07-21 09:00:00,200
2004-07-21 10:00:00,200
2004-07-21 11:00:00,150
I must set the valid column as index like this
import pandas as pd
df = pd.read_csv('test.csv')
df['valid'] = pd.to_datetime(df['valid']) # convert 'valid' column to pd.datetime objects
df = df.set_index('valid') # set the 'valid' as index
I would still like to be able to access the data by row index too however like this
for row_index in range(len(df)): # I know iterating over a df is not avisable
print(df.at[row_index])
But I get an error. ValueError: At based indexing on an non-integer index can only have non-integer indexers
I for sure have to have the valid column as index. But how can I also print a row given it's index?
Change selecting by label:
print(df.at[row_index])
to selecting by positions with select first column by DataFrame.iat:
for row_index in range(len(df)): # I know iterating over a df is not avisable
#convert position of value column to 0
print(df.iat[row_index, df.columns.get_loc('value')])
#selecting first column - 0
#print(df.iat[row_index, 0])
200
200
150
Or use DataFrame.iloc:
for row_index in range(len(df)): # I know iterating over a df is not avisable
print(df.iloc[row_index])
value 200
Name: 2004-07-21 09:00:00, dtype: int64
value 200
Name: 2004-07-21 10:00:00, dtype: int64
value 150
Name: 2004-07-21 11:00:00, dtype: int64
Difference is iat is faster, but return intersection index/column value - scalar, but iloc is more general, here return all columns to Series - Series name is index values, index values of Series are columns names.
for row_index in range(len(df)):
print(df.iloc[row_index])
I have attached a photo of how the data is formatted when I print the df in Jupyter, please check that for reference.
Set the DATE column as the index, checked the data type of the index, and converted the index to be a datetime index.
import pandas as pd
df = pd.read_csv ('UMTMVS.csv',index_col='DATE',parse_dates=True)
df.index = pd.to_datetime(df.index)
I need to print out percent increase in value from Month/Year to Month/Year and percent decrease in value from Month/Year to Month/Year.
dataframe format picture
The first correction pertains to how to read your DataFrame.
Passing parse_dates you should define a list of columns to be parsed
as dates. So this instruction should be changed to:
df = pd.read_csv('UMTMVS.csv', index_col='DATE', parse_dates=['DATE'])
and then the second instruction in not needed.
To find the percent change in UMTMVS column, use: df.UMTMVS.pct_change().
For your data the result is:
DATE
1992-01-01 NaN
1992-02-01 0.110968
1992-03-01 0.073036
1992-04-01 -0.040080
1992-05-01 0.014875
1992-06-01 -0.330455
1992-07-01 0.368293
1992-08-01 0.078386
1992-09-01 0.082884
1992-10-01 -0.030528
1992-11-01 -0.027791
Name: UMTMVS, dtype: float64
Maybe you should multiply it by 100, to get true percents.
please see the data here: screenshot from Google Colab
I am trying to assign the time 19:00 (7pm) for all records of the column "Beginn_Zeit". For now I put the float 19.00. Now I need to convert it to a time format so that I can subsequently merge it with a date of the column "Beginn_Datum". Once I have this merged column, I need to paste its value to a all records with NaT of a different column "Delta2".
dfd['Beginn'] = pd.to_datetime(df['Beginn'], dayfirst=True)
dfd['Ende'] = pd.to_datetime(df['Ende'], dayfirst=True)
dfd['Delta2'] = dfd['Ende']-dfd['Beginn']
dfd.Ende.fillna(dfd.Beginn,inplace=True)
dfd['Beginn_Datum'] = dfd['Beginn'].dt.date
dfd["Beginn_Zeit"] = 19.00
Edited to better match your updated example.
from datetime import time, datetime
dfd['Beginn_Zeit'] = time(19,0)
# create new column combining date and time
new_col = dfd.apply(lambda row: datetime.combine(row['Beginn_Datum'], row['Beginn_Zeit']), axis=1)
# replace null values in Delta2 with new combined dates
dfd.loc[dfd['Delta2'].isnull(), 'Delta2'] = new_col
New to pandas.
Have a DataFrame of the order:
A B C Date1 Date2 D with multiple rows with values. I want to divide the entire DataFrame into multiple dataframes based on quarters, i.e (Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec). I am trying to use only the Date1 column values for the same. I tried the following so far:
data_q = data.groupby(pandas.TimeGrouper(freq = '3M'))
The dates are in the form 2009-11-03.
There a few ways to do this.
I would ensure that Date1 column is a datetime type using the .dtype method.
e.g. df['Date1'].dtype
If it's not, cast to datetime object using:
df.Date1 = pd.to_datetime(df.Date1)
Add a quarters column for eventual data frame slicing:
df['quarters'] = df.Date1.dt.quarter
Create your data frames:
q1 = df[df.quarters == 1]
q2 = df[df.quarters == 2]
q3 = df[df.quarters == 3]
q4 = df[df.quarters == 4]
So the approach that appears easiest to me is to convert Date1 to your index, then groupby on the quarter.
df2 = df.set_index('Date1')
quardfs = list(df2.groupby(df2.index.quarter))
This will leave you with quardfs, which a list of DataFrames.
If you don't want to set Date1 to an index, you can also copy it out of the DataFrame and use it:
quars = pd.DatetimeIndex(df['Date1']).quarter
quardfs = list(df2.groupby(quars))