I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
combos = list(combinations(df.columns, 2))
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0
Hi I'm trying to feature engineer a Patient dataset from movement level to patient level.
Original df looks like this:
Conditions:
1) Create Last Test<n> Change cols - For CaseNo that encounters the Category value 'ICU', take the Test<n> change before 'ICU' value (189-180 for Test1, CaseNo 1), else take the latest Test<n> change (256-266 for Test1, CaseNo 2).
2) Create Test<n>_Pattern cols - For CaseNo that encounters the Category value 'ICU', pivot all the Test<n> values from start till before 'ICU' value. Else pivot all Test<n> values from start to end.
3)Create Last Test<n> Count cols - For CaseNo that encounters the Category value 'ICU', take the last Test<n> value before 'ICU' encounter. Else take the last Test<n> value.
Expected Outcome:
How do I go about this in Python?
Code for df:
df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
'2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
'2020-03-03 17:50:00','2020-03-17 08:24:19'],
'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
'2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
'2222-12-31 23:00:00','2020-03-18 18:50:00'],
'Category':['A','A','ICU','A','B','B','B','B'],
'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
'2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
'Test1':['180','189','190','188','328','NA','266','256'],
'Test2':['20','21','15','10','33','30','28','15'],
'Test3':['55','NA','65','70','58','64','68','58'],
'Age':['65','65','65','65','45','45','45','45']})
Expected Outcome:
df2 = pd.DataFrame({'CaseNo':[1, 2],
'Last Test1 Change':[9, -10],
'Test1 Pattern':['180, 189', '328, 266, 256'],
'Last Test1 Count':[189, 256],
'Last Test2 Change':[1, -13],
'Test2 Pattern':['20, 21', '33, 30, 28, 15'],
'Last Test2 Count':[21, 15],
'Last Test3 Change':[10, -10],
'Test3 Pattern':['55', '58, 64, 68, 58'],
'Last Test3 Count':[55, 58],
'Age':[65, 45]})
I am just gonna show you how to approach your problem in a general way.
For your first condition, you can create a helper index by cumsum to filter out the data after ICU:
df["helper"] = df.groupby("CaseNo")["Category"].transform(lambda d: d.eq("ICU").cumsum())
I am not really sure what n stands for, but if you just want to grab certain amount of data, use groupby and tail:
s = df.loc[df["helper"].eq(0)].groupby("CaseNo").tail(4).filter(regex="CaseNo|Test.*|Age")
print (s)
CaseNo Test1 Test2 Test3 Age
0 1 180.0 20.0 55.0 65
1 1 189.0 21.0 NaN 65
4 2 328.0 33.0 58.0 45
5 2 NaN 30.0 64.0 45
6 2 266.0 28.0 68.0 45
7 2 256.0 15.0 58.0 45
Finally pivot your data:
res = (pd.pivot_table(s, index=["CaseNo", "Age"],
aggfunc=["last", list]).reset_index())
print (res)
CaseNo Age last list
Test1 Test2 Test3 Test1 Test2 Test3
0 1 65 189.0 21.0 55.0 [180.0, 189.0] [20.0, 21.0] [55.0, nan]
1 2 45 256.0 15.0 58.0 [328.0, nan, 266.0, 256.0] [33.0, 30.0, 28.0, 15.0] [58.0, 64.0, 68.0, 58.0]
From here you can work towards your final goal.
try in this way:
df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
'2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
'2020-03-03 17:50:00','2020-03-17 08:24:19'],
'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
'2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
'2222-12-31 23:00:00','2020-03-18 18:50:00'],
'Category':['A','A','ICU','A','B','B','B','B'],
'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
'2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
'Test1':['180','189','190','188','328','NA','266','256'],
'Test2':['20','21','15','10','33','30','28','15'],
'Test3':['55','NA','65','70','58','64','68','58'],
'Age':['65','65','65','65','45','45','45','45']})
# simple data management
df = df.replace('NA', np.nan)
df[['Test1','Test2','Test3','Age']] = df[['Test1','Test2','Test3','Age']].astype(float)
# create empty df to store results
results = pd.DataFrame()
# split original df in groups based on CaseNo
for jj,(j,gr) in enumerate(df.groupby('CaseNo')):
group = gr.copy()
# idenfify the presence of ICU
group['Category'] = (group['Category'].values == 'ICU').cumsum()
# replace NaN value with the next useful value
# this is useful to fill NaN in Test1, Test2, Test3
group_fill = group.fillna(method='bfill')
# select part of df before the first ICU matched
group_fill = group_fill[group_fill.Category == 0]
group = group[group.Category == 0]
# at this point we have two copy of our group df (group and group_fill)
# group contains the raw (inclused NaN) values for a selected CaseNo
# group_fill contains the filled values for a selected CaseNo
# create empty df to store partial results
partial = pd.DataFrame()
# select unique CaseNo
partial['CaseNo'] = group['CaseNo'].unique()
# for loop to make operation on Test1, Test2 and Test3
for i in range(1,4):
# these are simply the operation you required
# NB: 'Last TestN Change' is computed on the group df without NaN
# this is important to avoid errors when the last obsevarion is NaN
# 'TestN Pattern' and 'Last TestN Count' can be computed on the filled group df
partial[f'Last Test{i} Change'] = group_fill[f'Test{i}'].tail(2).diff().tail(1).values
partial[f'Test{i} Pattern'] = [group[f'Test{i}'].dropna().to_list()]
partial[f'Last Test{i} Count'] = group[f'Test{i}'].dropna().tail(1).values
# select unique age
partial['Age'] = group['Age'].unique()
# create correct index for the final results
partial.index = range(jj,jj+1)
# append partial results to final results df
results = results.append(partial)
# print final results df
results
Below is some dummy data that reflects the data I am working with.
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
# Dummy data that represents a percent change
datelist = pd.date_range(start='1983-01-01', end='1994-01-01', freq='Y')
df1 = pd.DataFrame({"P Change_1": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,)),
"P Change_2": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,))})
#This dataframe contains the rows we want to operate on
df2 = pd.DataFrame({
'Loc1': [None, None, None, None, None, None, None, None, None, None, 2.5415],
'Loc2': [None, None, None, None, None, None, None, None, None, None, 3.2126],})
#Set the datetime index
df1 = df1.set_index(datelist)
df2 = df2.set_index(datelist)
df1:
P Change_1 P Change_2
1984-12-31 -0.172080 -0.231574
1985-12-31 -0.328773 -0.247018
1986-12-31 -0.160834 -0.099079
1987-12-31 -0.457924 0.000266
1988-12-31 0.017374 -0.501916
1989-12-31 -0.349052 -0.438816
1990-12-31 0.034711 0.036164
1991-12-31 -0.415445 -0.415372
1992-12-31 -0.206852 -0.413107
1993-12-31 -0.313341 -0.181030
1994-12-31 -0.474234 -0.118058
df2:
Loc1 Loc2
1984-12-31 NaN NaN
1985-12-31 NaN NaN
1986-12-31 NaN NaN
1987-12-31 NaN NaN
1988-12-31 NaN NaN
1989-12-31 NaN NaN
1990-12-31 NaN NaN
1991-12-31 NaN NaN
1992-12-31 NaN NaN
1993-12-31 NaN NaN
1994-12-31 2.5415 3.2126
DataFrame details:
First off, Loc1 will correspond with P Change_1 and Loc2 corresponds to P Change_2, etc. Looking at Loc1 first, I want to either fill up the DataFrame containing Loc1 and Loc2 with the relevant values or compute a new dataframe that has columns Calc1 and Calc2.
The calculation:
I want to start with the 1994 value of Loc1 and calculate a new value for 1993 by taking Loc1 1993 = Loc1 1994 + (Loc1 1994 * P Change_1 1993). With the values filled in it would be 2.5415 +(-0.313341 * 2.5415) which equals about 1.74514.
This 1.74514 value will replace the NaN value in 1993, and then I want to use that calculated value to get a value for 1992. This means we now compute Loc1 1992 = Loc1 1993 + (Loc1 1993 * P Change_1 1992). I want to carry out this operation row-wise until it gets the earliest value in the timeseries.
What is the best way to go about implementing this row-wise equation? I hope this makes some sense and any help is greatly appreciated!
df = pd.merge(df1, df2, how='inner', right_index=True, left_index=True) # merging dataframes on date index
df['count'] = range(len(df)) # creating a column, count for easy operation
# divides dataframe in two part, one part above the not NaN row and one below
da1 = df[df['count']<=df.dropna().iloc[0]['count']]
da2 = df[df['count']>=df.dropna().iloc[0]['count']]
da1.sort_values(by=['count'],ascending=False, inplace=True)
g=[da1,da2]
num_col=len(df1.columns)
for w in range(len(g)):
list_of_col=[]
count = 0
list_of_col=[list() for i in range(len(g[w]))]
for item, rows in g[w].iterrows():
n=[]
if count==0:
for p in range(1,num_col+1):
n.append(rows[f'Loc{p}'])
else:
for p in range(1,num_col+1):
n.append(list_of_col[count-1][p-1]+ list_of_col[count-1][p-1]* rows[f'P Change_{p}'])
list_of_col[count].extend(n)
count+=1
tmp=[list() for i in range(num_col)]
for d_ in range(num_col):
for x_ in range(len(list_of_col)):
tmp[d_].append(list_of_col[x_][d_])
z1=[]
z1.extend(tmp)
for i in range(num_col):
g[w][f'Loc{i+1}']=z1[i]
da1.sort_values(by=['count'] ,inplace=True)
final_df = pd.concat([da1, da2[1:]])
calc_df = pd.DataFrame()
for i in range(num_col):
calc_df[f'Calc{i+1}']=final_df[f'Loc{i+1}']
print(calc_df)
I have tried to include all the obscure thing I have done in the comment. I have edited my code to let initial dataframes remain unaffected.
[Edited] : I have edited the code to include any number of columns in the given dataframe.
[Edited:]If the name of columns are arbitrary in df1 and df2, please run this block of code before running the upper code. I have renamed the columns name using list comprehension!
df1.columns = [f'P Change_{i+1}' for i in range(len(df1.columns))]
df2.columns = [f'Loc{i+1}' for i in range(len(df2.columns))]
[EDITED] Perhaps there are better/more elegant ways to do this, but this worked fine for me:
def fill_values(df1, df2, cols1=None, cols2=None):
if cols1 is None: cols1 = df1.columns
if cols2 is None: cols2 = df2.columns
for i in reversed(range(df2.shape[0]-1)):
for col1, col2 in zip(cols1, cols2):
if np.isnan(df2[col2].iloc[i]):
val = df2[col2].iloc[i+1] + df2[col2].iloc[i+1] * df1[col1].iloc[i]
df2[col2].iloc[i] = val
return df1, df2
df1, df2 = fill_values(df1, df2)
print(df2)
Loc1 Loc2
1983-12-31 0.140160 0.136329
1984-12-31 0.169291 0.177413
1985-12-31 0.252212 0.235614
1986-12-31 0.300550 0.261526
1987-12-31 0.554444 0.261457
1988-12-31 0.544976 0.524925
1989-12-31 0.837202 0.935388
1990-12-31 0.809117 0.902741
1991-12-31 1.384158 1.544128
1992-12-31 1.745144 2.631024
1993-12-31 2.541500 3.212600
This assumes that the rows in df1 and df2 corresponds perfectly (I'm not querying the index, but only the location). Hope it helps!
Just to be clear, what you need is Loc1[year]=Loc1[next_year] + PChange[year]*Loc1[next_year], right?
The below loop will do what you are looking for, but it just assumes that the number of rows in both df's is always equal, etc. (instead of matching the value in the index). From your description, I think this works for your data.
for i in range(df2.shape[0]-2,-1,-1):
df2.Loc1[i]=df2.Loc1[i+1] + (df1.PChange_1[i]*df2.Loc1[i+1])
Hope this helps :)
I have a df called df_world with the following shape:
Cases Death Delta_Cases Delta_Death
Country/Region Date
Brazil 2020-01-22 0.0 0 NaN NaN
2020-01-23 0.0 0 0.0 0.0
2020-01-24 0.0 0 0.0 0.0
2020-01-25 0.0 0 0.0 0.0
2020-01-26 0.0 0 0.0 0.0
... ... ... ...
World 2020-05-12 4261747.0 291942 84245.0 5612.0
2020-05-13 4347018.0 297197 85271.0 5255.0
2020-05-14 4442163.0 302418 95145.0 5221.0
2020-05-15 4542347.0 307666 100184.0 5248.0
2020-05-16 4634068.0 311781 91721.0 4115.0
I'de like to sort the country index by the value of the columns 'Cases' on the last recording i.e. comparing the cases values on 2020-05-16 for all countries and return the sorted country list
I thought about creating another df with only the 2020-05-16 values and then use the df.sort_values() method but I am sure there has to be a more efficient way.
While I'm at it, I've also tried to only select the countries that have a number of cases on 2020-05-16 above a certain value and the only way I found to do it was to iterate over the Country index:
for a_country in df_world.index.levels[0]:
if df_world.loc[(a_country, last_date), 'Cases'] < cut_off_val:
df_world = df_world.drop(index=a_country)
But it's quite a poor way to do it.
If anyone has an idea on how the improve the efficiency of this code I'de be very happy.
Thank you :)
You can first group thee dataset by "Country/Region", then sort each group by "Date", take the last one, and sort again by "Cases".
Faking some data myself (data types are different but you see my point):
df = pd.DataFrame([['a', 1, 100],
['a', 2, 10],
['b', 2, 55],
['b', 3, 15],
['c', 1, 22],
['c', 3, 80]])
df.columns = ['country', 'date', 'cases']
df = df.set_index(['country', 'date'])
print(df)
# cases
# country date
# a 1 100
# 2 10
# b 2 55
# 3 15
# c 1 22
# 3 80
Then,
# group them by country
grp_by_country = df.groupby(by='country')
# for each group, aggregate by sorting by data and taking the last row (latest date)
latest_per_grp = grp_by_country.agg(lambda x: x.sort_values(by='date').iloc[-1])
# sort again by cases
sorted_by_cases = latest_per_grp.sort_values(by='cases')
print(sorted_by_cases)
# cases
# country
# a 10
# b 15
# c 80
Stay safe!
last_recs = df_world.reset_index().groupby('Country/Region').last()
sorted_countries = last_recs.sort_values('Cases')['Country/Region']
As I don't have your raw data, I can't test it but this should do what you need. All methods are self-explanatory I believe.
you may need to sort df_world by the dates in the first line if it isn't the case.
My df matrix looks like this:
rating
id 10153337 10183250 10220967 ... 99808270 99816554 99821259
user_id ...
10003869 NaN 8.0 NaN ... NaN NaN NaN
10022889 NaN NaN 3.0 ... NaN 1.0 NaN
I can't get a column that I need because it returns an 'indices out of bounds' error
specificID = ratings_matrix[[99816554]]
...
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
Why is it not searching the values given for columns?
Some runnable code:
ratings = pd.read_json(
''.join(
['{"columns":["id","rating","user_id"],"index":[0,1,2],"data":[[',
'67728134,4,10003869],[57495823,9,10060085],[99816554,1,10022889]]}']
), orient='split')
ratings
ratings.dtypes
ratings_matrix = ratings.pivot_table(index=['user_id'], columns=['id'], values=['rating'])
ratings_matrix.columns.map(type)
ratings_matrix[[67728134]] #here! searches column numbers rather than values
Notice that when you created your pivot, you passed a list to the values parameter:
ratings_matrix = ratings.pivot_table( # |<--- here --->|
index=['user_id'], columns=['id'], values=['rating'])
This told pandas to create a pd.MultiIndex. That's why you have to levels of columns with rating on top in your result.
option 1
use the multiindex
specificID = ratings_matrix[[('rating', 99816554)]]
option 2
don't create the multiindex
ratings_matrix = ratings.pivot_table( # see what I did?
index=['user_id'], columns=['id'], values='rating')
Then
specificID = ratings_matrix[[99816554]]
setup
df = pd.read_json(
''.join(
['{"columns":["id","rating","user_id"],"index":[0,1,2],"data":[[',
'67728134,4,10003869],[57495823,9,10060085],[99816554,1,10022889]]}']
), orient='split'
)
df
ratings_matrix = ratings.pivot_table( # |<--- here --->|
index=['user_id'], columns=['id'], values=['rating'])
ratings_matrix[[('rating', 67728134)]]
ratings_matrix = ratings.pivot_table( # see what I did?
index=['user_id'], columns=['id'], values='rating')
ratings_matrix[[67728134]]