"Dynamic" column selection - python

The problem:
The input table, let's say, is a merged table of calls and bills, having columns: TIME of the call and months of all the bills. The idea is to have a table that has the last 3 bills the person paid starting from the time of the call. That way putting the bills in context of the call.
The Example input and output:
# INPUT:
# df
# TIME ID 2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01
# 2019-12-01 1 1 2 3 4 5
# 2019-11-01 2 6 7 8 9 10
# 2019-10-01 3 11 12 13 14 15
# EXPECTED OUTPUT:
# df_context
# TIME ID 0 1 2
# 2019-12-01 1 3 4 5
# 2019-11-01 2 7 8 9
# 2019-10-01 3 11 12 13
EXAMPLE INPUT CREATION:
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
The code I have got so far:
# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3
df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()
OUTPUT of my code:
# OUTPUTS:
# TIME 0 1 2
# 0 2019-12-01 2 3 4 should be 3 4 5
# 1 2019-11-01 7 8 9 all good
# 2 2019-10-01 12 13 14 should be 11 12 13
What my code seems to lack if a for loop or two, for the first two lines of code, to do waht I want it to do, but I just can't believe that there isn't a better a solution than the one I am concocting right now.

I would suggest the following steps so that you can avoid dynamic column selection altogether.
Convert the wide table (reference date as columns) to a long table (reference date as rows)
Compute the difference in months between time of the call TIME and reference date
Select only those with difference >= 0 and difference < 3
Format the output table (add a running number, pivot it) according to your requirements
# Initialize dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL
date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')
# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])
# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()
# Keep only the preceding 3 months (including the month = TIME)
selection = (
(df['TIME_DIFF'] < 3) &
(df['TIME_DIFF'] >= 0)
)
# Apply selection, sort the columns and keep only columns needed
df_out = (
df[selection]
.sort_values(['TIME','ID','REF_TIME'])
[['TIME','ID','BILL']]
)
# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)
# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')
Output:
BILL_NO 1 2 3
ID TIME
1 2019-12-01 3 4 5
2 2019-11-01 7 8 9
3 2019-10-01 11 12 13

Here is my (newbie's) solution, it's gonna work only if the dates in column names are in ascending order:
# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],})
cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:
for i in range(len(df)):
searched_date = df.iloc[i, 0]
searched_column_index = cols.index(searched_date)
searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
searched_df = searched_row.rename(mapping_column_names, axis=1)
new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df
Output:
TIME ID 0 1 2
0 2019-12-01 1 3 4 5
1 2019-11-01 2 7 8 9
2 2019-10-01 3 11 12 13
Anyway I think #Toukenize solution is better since it doesn't require iterating.

Related

pandas selecting rows for specific time period

I have a pandas dataframe with date index. Like this
A B C
date
2021-04-22 2 1 3
2021-05-22 3 2 4
2021-06-22 4 3 5
2021-07-22 5 4 6
2021-08-22 6 5 7
I want to create a new dataframe that selects rows that are only for 2 days previous for a given date. So for example if I give selected = '2021-08-22', what I need is a new dataframe like below
A B C
date
2021-07-22 5 4 6
2021-08-22 6 5 7
can someone please help me with this? Many thanks for your help
You can convert convert the index to DatetimeIndex, then use df[start_date : end_date]
df.index = pd.to_datetime(df.index)
selected = '2021-08-22'
res = df[(pd.to_datetime(selected)-pd.Timedelta(days=2)) : selected]
print(res)
A B C
2021-08-22 6 5 7
I'm assuming that you meant months instead of days.
You can use the df.apply method in order to filter the dataframe rows with a function.
Here is a function that received the inputs you described and returns the new dataframe:
Working example
def filter_df(df, date, num_months):
def diff_month(row):
date1 = datetime.strptime(row["date"], '%Y-%m-%d')
date2 = datetime.strptime(date, '%Y-%m-%d')
return ((date1.year - date2.year) * 12 + date1.month - date2.month)
return df[df.apply(diff_month, axis=1) > - num_months]
print(filter_df(df, "2021-08-22", 2))

Set particular value for a month of data based on column in dataframe

I have a dataframe made up of daily data across a number of columns;
A B C D
01/01/2020 12 3 2 1
02/01/2020 8 14 5 1
03/01/2020 45 4 1 3
.
.
.
.
31/12/2021 5 1 5 3
The data is generated automatically but I would to be able to overwrite data by month or by date.
I understand something like this could reset a value but is there anyway to do it in bulk by month or between two certain dates?
df.set_value('C', 'x', 10)
Any help much appreciated!
Create DatetimeIndex first and the set values in DataFrame.loc, also here working partialy string indexing for set values of month:
df.index = pd.to_datetime(df.index, dayfirst=True)
df.loc['2020-01-02','C'] = 100
df.loc['2020-01','B'] = 500
df.loc['2020-01-01':'2020-01-02','A'] = 0
#select multiple columns by list
df.loc['2020-01-03':'2021-12-31', ['C','D']] = 1000
print (df)
A B C D
2020-01-01 0 500 2 1
2020-01-02 0 500 100 1
2020-01-03 45 500 1000 1000
2021-12-31 5 1 1000 1000

How to merge two datasets based on conditions

I'm attempting to merge two datasets in python based on 3 conditions. They have to have the same Longtitude,Latitude and month of a specific year. One dataset has the size of about 16k and the other 1.7k.
A simple example of the inputs and expected output is as follows:
>df1
long lat date proximity
5 8 23/06/2009 Near
6 10 05/10/2012 Far
8 6 19/02/2010 Near
3 4 30/04/2014 Near
5 8 01/06/2009 Far
>df2
long lat date mine
5 8 10/06/2009 1
8 6 24/02/2010 0
7 2 19/04/2014 1
3 4 30/04/2013 1
If any condition is false the value in "mine" when merged is 0. How would I merge to get:
long lat date proximity mine
5 8 23/06/2009 Near 1
6 10 05/10/2012 Far 0
8 6 19/02/2010 Near 0
3 4 30/04/2014 Near 0
5 8 01/06/2009 Far 1
The date column is not necessary in the output if that makes it easier.
Here you go:
df1['year-month'] = pd.to_datetime(df1['date'], format='%d/%m/%Y').dt.strftime('%Y/%m')
df2['year-month'] = pd.to_datetime(df2['date'], format='%d/%m/%Y').dt.strftime('%Y/%m')
joined = df1.merge(df2,
how='left',
on =['long', 'lat', 'year-month'],
suffixes=['', '_r']).drop(columns = ['date_r', 'year-month'])
joined['mine'] = joined['mine'].fillna(0).astype(int)
print(joined)
Output
long lat date proximity mine
0 5 8 23/06/2009 Near 1
1 6 10 05/10/2012 Far 0
2 8 6 19/02/2010 Near 0
3 3 4 30/04/2014 Near 0
4 5 8 01/06/2009 Far 1
First extract the month and year from the date column and assign it to temporary column mon-year, then use DataFrame.merge to left merge the dataframes df1, df2 on long, lat and mon-year, then use Series.fillna to fill the NaN values in the mine column with 0, finally use DataFrame.drop to drop the temporary column mon-year:
df1['mon-year'] = df1['date'].str.extract(r'/(.*)')
df2['mon-year'] = df2['date'].str.extract(r'/(.*)')
# OR we can use pd.to_datetime,
# df1['mon-year'] = pd.to_datetime(df1['date'], format='%d/%m/%Y').dt.strftime('%m-%Y')
# df2['mon-year'] = pd.to_datetime(df2['date'], format='%d/%m/%Y').dt.strftime('%m-%Y')
df3 = df1.merge(
df2.drop('date', 1),
on=['long', 'lat', 'mon-year'], how='left').drop('mon-year', 1)
df3['mine'] = df3['mine'].fillna(0)
Result:
# print(df3)
long lat date proximity mine
0 5 8 23/06/2009 Near 1.0
1 6 10 05/10/2012 Far 0.0
2 8 6 19/02/2010 Near 0.0
3 3 4 30/04/2014 Near 0.0
4 5 8 01/06/2009 Far 1.0
You could merge using mutiple keys as follows:
df_1.merge(df_2, how='left', left_on=['long', 'lat', 'date'], right_on=['long', 'lat', 'date'])

Auto join a python dataframe to update it

I would like to perform an auto-join on a python dataframe to update it.
Here is the situation, I have a first df with three columns:
In, Out & Date. It means that at a specific date the item "Out" is replaced by "In".
import pandas as pd
import numpy as np
from datetime import datetime
data = [[1,10,"2017-01-01"],[2,10,"2017-01-01"],[10,11,"2017-06-01"],[4,14,"2017-04-01"],[5,14,"2017-12-01"]]
label = ["Out","In","Date"]
df = pd.DataFrame(data,columns=label)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Out In Date
0 1 10 2017-01-01
1 2 10 2017-01-01
2 10 11 2017-06-01
3 4 14 2017-04-01
4 5 14 2017-12-01
For example it means here that as of first of Jan 2017, item #1 is replaced by item #10.
The trick is that as of june 2017, this item #10 is also replaced by item #11. So that #1 becomes #10 that becomes #11.
Now I would like to populate a final table that gives the final relationships up to a certain date.
If date = 2017-08-01, I would get this table
date = pd.to_datetime("2017-08-01")
data = [[1,11],[2,11],[10,11],[4,14]]
df_final = pd.DataFrame(data,columns=["Out","In"])
print(df_final)
Out In
0 1 11
1 2 11
2 10 11
3 4 14
Would you know how to perform such an auto join?
Thanks,
You can use List comprehension methods and .loc to locate the values.
import pandas as pd
import numpy as np
from datetime import datetime
data = [[1,10,"2017-01-01"],[2,10,"2017-01-01"],[10,11,"2017-06-01"],[4,14,"2017-04-01"],[5,14,"2017-12-01"],[11,18,"2017-12-01"]]
label = ["Out","In","Date"]
df = pd.DataFrame(data,columns=label)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Out In Date
0 1 10 2017-01-01
1 2 10 2017-01-01
2 10 11 2017-06-01
3 4 14 2017-04-01
4 5 14 2017-12-01
5 11 18 2017-12-01
L=[]
for row in df.iterrows():
x = row[1]['Out']
y = row[1]['In']
while y in df.Out.values.tolist():
y = df.loc[df['Out'] == y,'In'].iloc[0]
L.append((x,y))
df2 = pd.DataFrame(L, columns=['Out', 'In'])
print(df2)
Out In
1 18
2 18
10 18
4 14
5 14
11 18

Pandas groupby to get an average day

I have a dataframe which is the result from reading a csv. It contains a datetime column and data related to an event. I need to calculate an average day with statistical data per 20 minutes, in the code below I use 'mean' as an example.
Edit:
My data are observations. This means that not all bins have data in it. But this zero-counts do have to be taken into account when calculating the mean value: mean = count / #days
This code works but is this the way to go? It looks to complicated to me and I wonder if I really need to us a BinID and cant't group by time of day.
import pandas as pd
# Create dataframe
data = {'date': pd.date_range('2017-01-01 00:30:00', freq='10min', periods=282),
'i/o': ['in', 'out'] * 141}
df = pd.DataFrame(data)
# Add ones
df['move'] = 1
# I did try:
# 1)
# df['time'] = df['date'].dt.time
# df.groupby(['i/o', pd.Grouper(key='time', freq='20min')])
# This failed with groupby, so should I use my own bins then???
# 2)
# Create 20 minutes bins
# df['binID'] = df['date'].dt.hour*3 + df['date'].dt.minute//20
# averageDay = df.groupby(['i/o', 'binID']).agg(['count', 'sum', 'mean'])
#
# Well, bins with zero moves aren't their.
# So 'mean' can't be used as well as other functions that
# need the number of observations. Resample and reindex then???
# Resample
df2 = df.groupby(['i/o', pd.Grouper(key='date', freq='20min')]).agg('sum')
# Reindex and reset (for binID and groupby)
levels = [['in', 'out'],
pd.date_range('2017-01-01 00:00:00', freq='20min', periods=144)]
newIndex = pd.MultiIndex.from_product(levels, names=['i/o', 'date'])
df2 = df2.reindex(newIndex, fill_value=0).reset_index()
# Create 20 minutes bins
df2['binID'] = df2['date'].dt.hour*3 + df2['date'].dt.minute//20
# Average day
averageDay2 = df2.groupby(['i/o', 'binID']).agg(['count', 'sum', 'mean'])
print(averageDay2)
IIUC:
In [124]: df.groupby(['i/o',df.date.dt.hour*3 + df.date.dt.minute//20]) \
.agg(['count','sum','mean'])
Out[124]:
move
count sum mean
i/o date
in 0 1 1 1
1 2 2 1
2 2 2 1
3 2 2 1
4 2 2 1
5 2 2 1
6 2 2 1
7 2 2 1
8 2 2 1
9 2 2 1
... ... .. ...
out 62 2 2 1
63 2 2 1
64 2 2 1
65 2 2 1
66 2 2 1
67 2 2 1
68 2 2 1
69 2 2 1
70 2 2 1
71 1 1 1
[144 rows x 3 columns]

Categories