Related
I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
combos = list(combinations(df.columns, 2))
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0
I have a data set with first column is the Date, Second column is the Collaborator and third column is price paid.
I want to get the mean price paid of every Collaborator for the previous month. I want to return a table tha looks like this:
I used some solutions like rolling but i could get only the past X days, not the past month
Pandas has a built-in method .rolling
x = 3 # This is where you define the number of previous entries
df.rolling(x).mean() # Apply the mean
Hence:
df['LastMonthMean'] = df['Price'].rolling(x).mean()
I'm not sure how you want to calculate your mean but hope this helps
I would first add month column and then use groupby and would retrieve the first item
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 1, 2, 2, 2],
'collaborator': [1, 2, 3, 1, 2, 3],
'price': [100, 200, 300, 400, 500, 600]
})
df.groupby(['collaborator', 'month']).mean()
The rolling() method would have to be applied to the DataFrame grouped by Collaborator to obtain the mean sale price of every collaborator in the previous month.
Because the data would be grouped by and summarised, the number of data points would not match the original dataset, thus not allowing you to easily append the result to the original dataset.
If you use a DatetimeIndex in your DataFrame it will be considered a time series and then you can resample() the data more easily.
I have produced a replicable solution below, based on your initial question in which I resample the data and append the last month's mean to it. Thanks to #akilat90 for the function to generate random dates within a range.
import pandas as pd
import numpy as np
def random_dates(start, end, n=10):
# Function copied from #akilat90
# Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
start_u = pd.to_datetime(start).value//10**9
end_u = pd.to_datetime(end).value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
size = 1000
index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()
collaborators = np.random.randint(low=1, high=4, size=size)
prices = np.random.uniform(low=5., high=25., size=size)
data = pd.DataFrame({'Collaborator': collaborators,
'Price': prices}, index=index)
monthly_mean = data.groupby('Collaborator').resample('M')['Price'].mean()
data_final = pd.merge(data, monthly_mean, how='left', left_on=['Collaborator', data.index.month],
right_on=[monthly_mean.index.get_level_values('Collaborator'), monthly_mean.index.get_level_values(1).month + 1])
data_final.index = data.index
data_final = data_final.drop('key_1', axis=1)
data_final.columns = ['Collaborator', 'Price', 'LastMonthMean']
This is the output:
Collaborator Price LastMonthMean
2021-01-31 04:26:16 2 21.838910 NaN
2021-01-31 05:33:04 2 19.164086 NaN
2021-01-31 12:32:44 2 24.949444 NaN
2021-01-31 12:58:02 2 8.907224 NaN
2021-01-31 14:43:07 1 7.446839 NaN
2021-01-31 18:38:11 3 6.565208 NaN
2021-02-01 00:08:25 2 24.520149 15.230642
2021-02-01 09:25:54 2 20.614261 15.230642
2021-02-01 09:59:48 2 10.879633 15.230642
2021-02-02 10:12:51 1 22.134549 14.180087
2021-02-02 17:22:18 2 24.469944 15.230642
As you can see, the records in January 2021, the first month in this time series, do not have a valid Last Month Mean, unlike the records in February.
Hi I'm trying to feature engineer a Patient dataset from movement level to patient level.
Original df looks like this:
Conditions:
1) Create Last Test<n> Change cols - For CaseNo that encounters the Category value 'ICU', take the Test<n> change before 'ICU' value (189-180 for Test1, CaseNo 1), else take the latest Test<n> change (256-266 for Test1, CaseNo 2).
2) Create Test<n>_Pattern cols - For CaseNo that encounters the Category value 'ICU', pivot all the Test<n> values from start till before 'ICU' value. Else pivot all Test<n> values from start to end.
3)Create Last Test<n> Count cols - For CaseNo that encounters the Category value 'ICU', take the last Test<n> value before 'ICU' encounter. Else take the last Test<n> value.
Expected Outcome:
How do I go about this in Python?
Code for df:
df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
'2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
'2020-03-03 17:50:00','2020-03-17 08:24:19'],
'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
'2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
'2222-12-31 23:00:00','2020-03-18 18:50:00'],
'Category':['A','A','ICU','A','B','B','B','B'],
'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
'2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
'Test1':['180','189','190','188','328','NA','266','256'],
'Test2':['20','21','15','10','33','30','28','15'],
'Test3':['55','NA','65','70','58','64','68','58'],
'Age':['65','65','65','65','45','45','45','45']})
Expected Outcome:
df2 = pd.DataFrame({'CaseNo':[1, 2],
'Last Test1 Change':[9, -10],
'Test1 Pattern':['180, 189', '328, 266, 256'],
'Last Test1 Count':[189, 256],
'Last Test2 Change':[1, -13],
'Test2 Pattern':['20, 21', '33, 30, 28, 15'],
'Last Test2 Count':[21, 15],
'Last Test3 Change':[10, -10],
'Test3 Pattern':['55', '58, 64, 68, 58'],
'Last Test3 Count':[55, 58],
'Age':[65, 45]})
I am just gonna show you how to approach your problem in a general way.
For your first condition, you can create a helper index by cumsum to filter out the data after ICU:
df["helper"] = df.groupby("CaseNo")["Category"].transform(lambda d: d.eq("ICU").cumsum())
I am not really sure what n stands for, but if you just want to grab certain amount of data, use groupby and tail:
s = df.loc[df["helper"].eq(0)].groupby("CaseNo").tail(4).filter(regex="CaseNo|Test.*|Age")
print (s)
CaseNo Test1 Test2 Test3 Age
0 1 180.0 20.0 55.0 65
1 1 189.0 21.0 NaN 65
4 2 328.0 33.0 58.0 45
5 2 NaN 30.0 64.0 45
6 2 266.0 28.0 68.0 45
7 2 256.0 15.0 58.0 45
Finally pivot your data:
res = (pd.pivot_table(s, index=["CaseNo", "Age"],
aggfunc=["last", list]).reset_index())
print (res)
CaseNo Age last list
Test1 Test2 Test3 Test1 Test2 Test3
0 1 65 189.0 21.0 55.0 [180.0, 189.0] [20.0, 21.0] [55.0, nan]
1 2 45 256.0 15.0 58.0 [328.0, nan, 266.0, 256.0] [33.0, 30.0, 28.0, 15.0] [58.0, 64.0, 68.0, 58.0]
From here you can work towards your final goal.
try in this way:
df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
'2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
'2020-03-03 17:50:00','2020-03-17 08:24:19'],
'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
'2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
'2222-12-31 23:00:00','2020-03-18 18:50:00'],
'Category':['A','A','ICU','A','B','B','B','B'],
'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
'2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
'Test1':['180','189','190','188','328','NA','266','256'],
'Test2':['20','21','15','10','33','30','28','15'],
'Test3':['55','NA','65','70','58','64','68','58'],
'Age':['65','65','65','65','45','45','45','45']})
# simple data management
df = df.replace('NA', np.nan)
df[['Test1','Test2','Test3','Age']] = df[['Test1','Test2','Test3','Age']].astype(float)
# create empty df to store results
results = pd.DataFrame()
# split original df in groups based on CaseNo
for jj,(j,gr) in enumerate(df.groupby('CaseNo')):
group = gr.copy()
# idenfify the presence of ICU
group['Category'] = (group['Category'].values == 'ICU').cumsum()
# replace NaN value with the next useful value
# this is useful to fill NaN in Test1, Test2, Test3
group_fill = group.fillna(method='bfill')
# select part of df before the first ICU matched
group_fill = group_fill[group_fill.Category == 0]
group = group[group.Category == 0]
# at this point we have two copy of our group df (group and group_fill)
# group contains the raw (inclused NaN) values for a selected CaseNo
# group_fill contains the filled values for a selected CaseNo
# create empty df to store partial results
partial = pd.DataFrame()
# select unique CaseNo
partial['CaseNo'] = group['CaseNo'].unique()
# for loop to make operation on Test1, Test2 and Test3
for i in range(1,4):
# these are simply the operation you required
# NB: 'Last TestN Change' is computed on the group df without NaN
# this is important to avoid errors when the last obsevarion is NaN
# 'TestN Pattern' and 'Last TestN Count' can be computed on the filled group df
partial[f'Last Test{i} Change'] = group_fill[f'Test{i}'].tail(2).diff().tail(1).values
partial[f'Test{i} Pattern'] = [group[f'Test{i}'].dropna().to_list()]
partial[f'Last Test{i} Count'] = group[f'Test{i}'].dropna().tail(1).values
# select unique age
partial['Age'] = group['Age'].unique()
# create correct index for the final results
partial.index = range(jj,jj+1)
# append partial results to final results df
results = results.append(partial)
# print final results df
results
I have two dataframes with different timeseries data (see example below). Whereas Dataframe1 contains multiple daily observations per month, Dataframe2 only contains one observation per month.
What I want to do now is to align the data in Dataframe2 with the last day every month in Dataframe1. The last day per month in Dataframe1 does not necessarily have to be the last day of that respective calendar month.
I'm grateful for every hint how to tackle this problem in an efficient manner (as dataframes can be quite large)
Dataframe1
----------------------------------
date A B
1980-12-31 152.799 209.132
1981-01-01 152.799 209.132
1981-01-02 152.234 209.517
1981-01-05 152.895 211.790
1981-01-06 155.131 214.023
1981-01-07 152.596 213.044
1981-01-08 151.232 211.810
1981-01-09 150.518 210.887
1981-01-12 149.899 210.340
1981-01-13 147.588 207.621
1981-01-14 148.231 208.076
1981-01-15 148.521 208.676
1981-01-16 148.931 209.278
1981-01-19 149.824 210.372
1981-01-20 149.849 210.454
1981-01-21 150.353 211.644
1981-01-22 149.398 210.042
1981-01-23 148.748 208.654
1981-01-26 148.879 208.355
1981-01-27 148.671 208.431
1981-01-28 147.612 207.525
1981-01-29 147.153 206.595
1981-01-30 146.330 205.558
1981-02-02 145.779 206.635
Dataframe2
---------------------------------
date C D
1981-01-13 53.4 56.5
1981-02-15 52.2 60.0
1981-03-15 51.8 58.0
1981-04-14 51.8 59.5
1981-05-16 50.7 58.0
1981-06-15 50.3 59.5
1981-07-15 50.6 53.5
1981-08-17 50.1 44.5
1981-09-12 50.6 38.5
To provide a readable example, I prepared test data as follows:
df1 - A couple of observations from January and February:
date A B
0 1981-01-02 152.234 209.517
1 1981-01-07 152.596 213.044
2 1981-01-13 147.588 207.621
3 1981-01-20 151.232 211.810
4 1981-01-27 150.518 210.887
5 1981-02-05 149.899 210.340
6 1981-02-14 152.895 211.790
7 1981-02-16 155.131 214.023
8 1981-02-21 180.000 200.239
df2 - Your data, also from January and February:
date C D
0 1981-01-13 53.4 56.5
1 1981-02-15 52.2 60.0
Both dataframes have date column of datetime type.
Start from getting the last observation in each month from df1:
res1 = df1.groupby(df1.date.dt.to_period('M')).tail(1)
The result, for my data, is:
date A B
4 1981-01-27 150.518 210.887
8 1981-02-21 180.000 200.239
Then, to join observations, the join must be performed on the
whole month period, not the exact date. To do this, run:
res = pd.merge(res1.assign(month=res1['date'].dt.to_period('M')),
df2.assign(month=df2['date'].dt.to_period('M')),
how='left', on='month', suffixes=('_1', '_2'), )
The result is:
date_1 A B month date_2 C D
0 1981-01-27 150.518 210.887 1981-01 1981-01-13 53.4 56.5
1 1981-02-21 180.000 200.239 1981-02 1981-02-15 52.2 60.0
If you want the merge to include data only for months where there
is at least one observation in both df1 and df2, drop how parameter.
Its default value is inner, which is the correct mode in this case.
When you have a sample dataframe, you can provide code for doing so. Simply select a column as a list (step 1 and 2) and use that list to build the dataframe with code (step 3 and 4).
import pandas as pd
# Step 1: create your dataframe, and print each column as a list, copy-paste into code example below.
df_1 = pd.read_csv('dataset1.csv')
print(list(df_1['date']))
print(list(df_1['A']))
print(list(df_1['B']))
# Step 2: create your dataframe, and print each column as a list, copy-paste into code example below.
df_2 = pd.read_csv('dataset2.csv')
print(list(df_2['date']))
print(list(df_2['C']))
print(list(df_2['D']))
# Step 3: create sample dataframe ... good if you can provide this in your future questions
df_1 = pd.DataFrame({
'date': ['12/31/1980', '1/1/1981', '1/2/1981', '1/5/1981', '1/6/1981',
'1/7/1981', '1/8/1981', '1/9/1981', '1/12/1981', '1/13/1981',
'1/14/1981', '1/15/1981', '1/16/1981', '1/19/1981', '1/20/1981',
'1/21/1981', '1/22/1981', '1/23/1981', '1/26/1981', '1/27/1981',
'1/28/1981', '1/29/1981', '1/30/1981', '2/2/1981'],
'A': [152.799, 152.799, 152.234, 152.895, 155.131,
152.596, 151.232, 150.518, 149.899, 147.588,
148.231, 148.521, 148.931, 149.824, 149.849,
150.353, 149.398, 148.748, 148.879, 148.671,
147.612, 147.153, 146.33, 145.779],
'B': [209.132, 209.132, 209.517, 211.79, 214.023,
213.044, 211.81, 210.887, 210.34, 207.621,
208.076, 208.676, 209.278, 210.372, 210.454,
211.644, 210.042, 208.654, 208.355, 208.431,
207.525, 206.595, 205.558, 206.635]
})
# Step 4: create sample dataframe ... good if you can provide this in your future questions
df_2 = pd.DataFrame({
'date': ['1/13/1981', '2/15/1981', '3/15/1981', '4/14/1981', '5/16/1981',
'6/15/1981', '7/15/1981', '8/17/1981', '9/12/1981'],
'C': [53.4, 52.2, 51.8, 51.8, 50.7, 50.3, 50.6, 50.1, 50.6],
'D': [56.5, 60.0, 58.0, 59.5, 58.0, 59.5, 53.5, 44.5, 38.5]
})
# Step 5: make sure the date field is actually a date, not a string
df_1['date'] = pd.to_datetime(df_1['date']).dt.date
# Step 6: create new colum with year and month
df_1['date_year_month'] = pd.to_datetime(df_1['date']).dt.to_period('M')
# Step 7: create boolean mask that grabs the max date for each year-month
mask_last_day_month = df_1.groupby('date_year_month')['date'].transform(max) == df_1['date']
# Step 8: create new dataframe with only last day of month
df_1_max = df_1.loc[mask_last_day_month]
print('here is dataframe 1 with only last day in the month')
print(df_1_max)
print()
# Step 9: make sure the date field is actually a date, not a string
df_2['date'] = pd.to_datetime(df_2['date']).dt.date
# Step 10: create new colum with year and month
df_2['date_year_month'] = pd.to_datetime(df_2['date']).dt.to_period('M')
print('here is the original dataframe 2')
print(df_2)
print()
I have two dataframes:
df1 - is a pivot table that has totals for both columns and rows, both with default names "All"
df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.
I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.
When I perform multiplication, I get the following error:
ValueError: cannot join with no level specified and no overlapping names
When I try the same on dummy dataframes it works as expected:
import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
[50, 60, 70, 180],
[90, 10, 10, 110],
[150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)
table2 = np.matrix([[1.0, 2.0, 3.0],
[5.0, 6.0, 7.0],
[2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)
df3 = df1*df2
print(df3)
This gives me the following output:
A B C All
One 10 20 30 60
Two 50 60 70 180
Three 90 10 10 110
All 150 90 110 350
A B C
One 1.00 2.00 3.00
Two 5.00 6.00 7.00
Three 2.00 1.00 5.00
A All B C
All nan nan nan nan
One 10.00 nan 40.00 90.00
Three 180.00 nan 10.00 50.00
Two 250.00 nan 360.00 490.00
So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".
And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:
df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
columns = ['CR Range'], aggfunc = [np.sum], margins = True)
I do need to keep the total as I'm using them in other calculations.
I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.
Thank you for reading. I realize it's a very long post..
IIUC,
My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.
df1.mul(df2, fill_value=1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1
(df1 * df2).combine_first(df1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
I really like Pir 's approach , and here is mine :-)
df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]:
A B C All
One 10.0 40.0 90.0 60
Two 250.0 360.0 490.0 180
Three 180.0 10.0 50.0 110
All 150.0 90.0 110.0 350
#Wen, #piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.
Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:
Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .
Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?
This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.
I was then able to multiply the two dataframes with no issues. I used approach suggested by #Wen because I like that it preserves the structure.