Displaying only the intersection of date range rows in pandas - python

Following from here
import pandas as pd
data = {'date': ['1998-03-01 00:00:01', '2001-04-01 00:00:01','1998-06-01 00:00:01','2001-08-01 00:00:01','2001-05-03 00:00:01','1994-03-01 00:00:01'],
'node1': [1, 1, 2,2,3,2],
'node2': [8,316,26,35,44,56],
'weight': [1,1,1,1,1,1], }
df = pd.DataFrame(data, columns = ['date', 'node1','node2','weight'])
df['date'] = pd.to_datetime(df['date'])
mask = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([1998,1999,2000])).any())
mask2 = df.groupby('node1').apply(lambda x : (x['date'].dt.year.isin([2001,2002,2003])).any())
print df[df['node1'].isin(mask[mask & mask2].index)]
The output I require are the nodes which are in the year range (98-00) and (01-03) but also it should only display the rows which are in both the ranges.
Expected Output-
node1 node2 date
1 8 1998-03-01
1 316 2001-04-01
2 26 1998-06-01
2 35 2001-08-01
right now this code is also printing this row: 2 56 1994-03-01 too.

One simple solution is to first remove the dates that are not in both the date ranges then apply mask i.e
l1 = [1998,1999,2000]
l2 = [2001,2002,2003]
ndf = df[df['date'].dt.year.isin(l1+l2)]
After getting the ndf:
Option 1: You can go for dual groupby mask based approach i.e
mask = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l1)).any())
mask2 = ndf.groupby('node1').apply(lambda x : (x['date'].dt.year.isin(l2)).any())
new = ndf[ndf['node1'].isin(mask[mask & mask2].index)]
Thank you #Zero
Option 2: You can go for groupby transform
new = ndf[ndf.groupby('node1')['date'].transform(lambda x: x.dt.year.isin(l1).any() & x.dt.year.isin(l2).any())]
Option 3: groupby filter
new = ndf.groupby('node1').filter(lambda x: x['date'].dt.year.isin(l1).any() & x['date'].dt.year.isin(l2).any())
Output :
date node1 node2 weight
0 1998-03-01 00:00:01 1 8 1
1 2001-04-01 00:00:01 1 316 1
2 1998-06-01 00:00:01 2 26 1
3 2001-08-01 00:00:01 2 35 1

Related

check if column is blank in pandas dataframe

I have the next csv file:
A|B|C
1100|8718|2021-11-21
1104|21|
I want to create a dataframe that gives me the date output as follows:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
This means
if C is empty:
put doublequotes
else:
format date to yyyymmddhhmmss (adding 0s to hhmmss)
My code:
df['C'] = np.where(df['C'].empty, df['C'].str.replace('', '""'), df['C'] + '000000')
but it gives me the next:
A B C
0 1100 8718 2021-11-21
1 1104 21 0
I have tried another piece of code:
if df['C'].empty:
df['C'] = df['C'].str.replace('', '""')
else:
df['C'] = df['C'].str.replace('-', '') + '000000'
OUTPUT:
A B C
0 1100 8718 20211121000000
1 1104 21 0000000
Use dt.strftime:
df = pd.read_csv('data.csv', sep='|', parse_dates=['C'])
df['C'] = df['C'].dt.strftime('%Y%m%d%H%M%S').fillna('""')
print(df)
# Output:
A B C
0 1100 8718 20211121000000
1 1104 21 ""
A good way would be to convert the column into datetime using pd.to_datetime with parameter errors='coerce' then dropping None values.
import pandas as pd
x = pd.DataFrame({
'one': 20211121000000,
'two': 'not true',
'three': '20211230'
}, index = [1])
x.apply(lambda x: pd.to_datetime(x, errors='coerce')).T.dropna()
# Output:
1
one 1970-01-01 05:36:51.121
three 2021-12-30 00:00:00.000

Python Dataframes: Filter a dataframe according to groupby condition

hi I have a dataframe like below:
ID date
1 01.01.2017
1 01.01.2017
1 01.04.2017
2 01.01.2017
2 01.01.2017
2 01.02.2017
What I want is to filter the id's which the related min and max of the dates' difference is 3 days. The final dataframe should be like this since only id 1 matches the condition:
ID date
1 01.01.2017
1 01.01.2017
1 01.04.2017
Thank you.
You can use Groupby.filter with a custom lambda function to check if the difference between the maximum date and the minimum is of 3 days:
d = datetime.timedelta(days=3)
df.groupby('ID').date.filter(lambda x: (x.max() - x.min()) == d)
ID
1 2017-01-01
1 2017-01-01
1 2017-01-04
Name: date, dtype: datetime64[ns]
You can create a mask and then use it as a filter:
import pandas as pd
# create sample data-frame
data = [[1, '01.01.2017'], [1, '01.01.2017'], [1, '01.04.2017'],
[2, '01.01.2017'], [2, '01.01.2017'], [2, '01.02.2017']]
df = pd.DataFrame(data=data, columns=['id', 'date'])
df['date'] = pd.to_datetime(df.date)
# create mask
mask = df.groupby('id')['date'].transform(lambda x: (x.max() - x.min()).days == 3)
# filter
result = df[mask]
print(result)
Output
id date
0 1 2017-01-01
1 1 2017-01-01
2 1 2017-01-04

Python pandas dataframes - transforming 2 columns with date ranges - into monthly columns for each month

I am new to Python, starting to use Pandas to replace some processes done in MS Excel.
Below is my problem description
Initial dataframe:
Contract Id, Start date, End date
12378, '01-01-2018', '15-05-2018'
45679, '10-03-2018', '31-07-2018'
567982, '01-01-2018', '31-12-2020'
Expected output
Contract Id , Start date, End date, Jan-18,Feb-18,Mar-18,Apr-18,May-18...Dec-18
12378, '01-01-2018', '15-05-2018', 1, 1, 1, 1, 1, 0, 0, 0, 0, .....,0
45679, '10-03-2018', '31-07-2018', 0, 0, 1, 1, 1, 1, 1, 0, 0, 0....,0
567982,'01-01-2018', '31-12-2020', 1, 1, 1, 1.........………..., 1, 1, 1
I would like to create a set of new columns with Month Id as column headers and populate them with a flag (1 or 0) if the contract is active during the specified month.
any help will be highly appreciated. Thank you
I also am new to pandas. Maybe there is a better method to do this, but here is what I have:
data['S_month'] = data['S'].apply(lambda x:int(x.split('-')[1]))
data['E_month'] = data['E'].apply(lambda x:int(x.split('-')[1]))
months = []
for s_e in data[['S_month','E_month']].values:
month = np.zeros(12)
month[s_e[0]-1:s_e[1]] = 1
months.append(month)
months = pd.DataFrame(months,dtype=int,columns=np.arange(1,13))
data.join(months)
Or you could just skip the first two lines and do this:
months = []
for s_e in data[['S','E']].values:
month = np.zeros(12)
month[int(s_e[0].split('-')[1])-1:int(s_e[1].split('-')[1])] = 1
months.append(month)
months = pd.DataFrame(months,dtype=int,columns=np.arange(1,13))
data.join(months)
This approach uses the very rich date functionality in pandas, specifically the PeriodIndex
import pandas as pd
import numpy as np
from io import StringIO
# Sample data (simplified)
df1 = pd.read_csv(StringIO("""
'Contract Id','Start date','End date'
12378,'01-02-2018','15-03-2018'
45679,'10-03-2018','31-05-2018'
567982,'01-01-2018','30-06-2018'
"""), quotechar="'", dayfirst=True, parse_dates=[1,2])
# Establish the month dates as a pandas PeriodIndex, using month end
dates = pd.period_range(df1['Start date'].min(), df1['End date'].max(), freq="M")
# create new dataframe with date matches with apply
# Match the start dates to the periods using the Period dates comparisons
# AND the result elementwise using numpy logial _nd
data = df1.apply(lambda r: pd.Series(np.logical_and(r[1] <= dates, r[2] >= dates).astype(int)), axis=1)
# Data frame with named month columns
df2 = pd.DataFrame(data)
df2.columns = dates
# Cooncatenate
result = pd.concat([df1, df2], axis=1)
result
# Contract Id Start date End date 2018-01 2018-02 2018-03 2018-04 2018-05 2018-06
#0 12378 2018-02-01 2018-03-15 0 1 1 0 0 0
#1 45679 2018-03-10 2018-05-31 0 0 1 1 1 0
#2 567982 2018-01-01 2018-06-30 1 1 1 1 1 1
Pandas comes with a lot of built-in date/time handling methods that can be put to good use here.
# SETUP
# -----
import pandas as pd
# Initialize input dataframe
data = [
[12378, '01-01-2018', '15-05-2018'],
[45679, '10-03-2018', '31-07-2018'],
[567982, '01-01-2018', '31-12-2020'],
]
columns = ['Contract Id', 'Start date', 'End date']
df = pd.DataFrame(data, columns=columns)
# SOLUTION
# --------
# Convert strings to datetime objects
df['Start date'] = pd.to_datetime(df['Start date'], format='%d-%m-%Y')
df['End date'] = pd.to_datetime(df['End date'], format='%d-%m-%Y')
# For each month in year 2018 ...
for x in pd.date_range('2018-01', '2018-12', freq='MS'):
# Create a column with contract-active flags
df[x.strftime("%b-%y")] = (df['Start date'].dt.month <= x.month) & (x.month <= df['End date'].dt.month)
# Optional: convert True/False values to 0/1 values
df[x.strftime("%b-%y")] = df[x.strftime("%b-%y")].astype(int)
which gives as result:
In [1]: df
Out[1]:
Contract Id Start date End date Jan-18 Feb-18 Mar-18 Apr-18 May-18 Jun-18 Jul-18 Aug-18 Sep-18 Oct-18 Nov-18 Dec-18
0 12378 2018-01-01 2018-05-15 1 1 1 1 1 0 0 0 0 0 0 0
1 45679 2018-03-10 2018-07-31 0 0 1 1 1 1 1 0 0 0 0 0
2 567982 2018-01-01 2020-12-31 1 1 1 1 1 1 1 1 1 1 1 1

How to groupby on a column while doing sort on another column?

I have the a df,
date amount code id
2018-01-01 50 12 1
2018-02-03 100 12 1
2017-12-30 1 13 2
2017-11-30 2 14 2
I want to groupby id, while in each group the date is also sorted in ascending or descending order, so I can do the following,
grouped = df.groupby('id')
a = np.where(grouped['code'].transform('nunique') == 1, 20, 0)
b = np.where(grouped['amount'].transform('max') > 100, 20, 0)
c = np.where(grouped['date'].transform(lambda x: x.diff().dropna().sum()).dt.days < 5, 30, 0)
You can sort the data within each group by using apply and sort_values:
grouped = df.groupby('id').apply(lambda g: g.sort_values('date', ascending=True))
Adding to the previous answer, if you wish indexes to remain as they were, you might consider the following :
import pandas as pd
df = {'a':[1,2,3,0,5], 'b':[2,2,3,2,5], 'c':[22,11,11,42,12]}
df = pd.DataFrame(df)
e = (df.groupby(['c','b', 'a']).size()).reset_index()
e = e[['a', 'b', 'c']]
e = e.sort_values(['c','a'])
print(e)

How to combine two pandas dataframes value by value

I have 2 dataframes - players (only has playerid) and dates (only has date). I want new dataframe which will contain for each player each date. In my case, players df contains about 2600 rows and date df has 1100 rows. I used 2 for loops to do this, but it is really slow, is there a way to do it faster via some function? thx
my loop:
player_elo = pd.DataFrame(columns = ['PlayerID','Date'])
for row in players.itertuples():
idx = row.Index
pl = players.at[idx,'PlayerID']
for i in dates.itertuples():
idd = row.Index
dt = dates.at[idd, 0]
new = {'PlayerID': [pl], 'Date': [dt]}
new = pd.DataFrame(new)
player_elo = player_elo.append(new)
If you have a key that is repeated for each df, you can come up with the cartesian product you are looking for using pd.merge().
import pandas as pd
players = pd.DataFrame([['A'], ['B'], ['C']], columns=['PlayerID'])
dates = pd.DataFrame([['12/12/2012'],['12/13/2012'],['12/14/2012']], columns=['Date'])
dates['Date'] = pd.to_datetime(dates['Date'])
players['key'] = 1
dates['key'] = 1
print(pd.merge(players, dates,on='key')[['PlayerID', 'Date']])
Output
PlayerID Date
0 A 2012-12-12
1 A 2012-12-13
2 A 2012-12-14
3 B 2012-12-12
4 B 2012-12-13
5 B 2012-12-14
6 C 2012-12-12
7 C 2012-12-13
8 C 2012-12-14

Categories