I have a similar dataframe to the following (but it has hundreds of stocks rather than A and B). I also do not know how many stocks will be in the dataframe. I am trying to dividend the Index row by all stocks matched by Date column (stock A on Date 5/15/2020 dividend by INDEX on 5/15/2020 then Stock A on Date 5/16/2020 divided by INDEX on 5/16/2020, etc then Stock B on Date 5/15/2020 dividend by INDEX on 5/15/2020, etc.). I add the answer I want in the DESIRED column but do not know how to get it.
d = {'Stock' : pd.Series(['A', 'A', 'A','B', 'B', 'B', 'INDEX', 'INDEX', 'INDEX']),
'Date' : pd.Series(['5/15/2020', '5/16/2020', '5/17/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/15/2020','5/16/2020','5/17/2020']),
'Price' : pd.Series([10,20,30,20,40,60,2,5,10]),
'DESIRED' : pd.Series([5,4,3,10,8,6,1,1,1])}
df = pd.DataFrame(d)
df
import pandas as pd
d = {'Stock' : pd.Series(['A', 'A', 'A','B', 'B', 'B', 'INDEX', 'INDEX', 'INDEX']),
'Date' : pd.Series(['5/15/2020', '5/16/2020', '5/17/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/15/2020','5/16/2020','5/17/2020']),
'Price' : pd.Series([10,20,30,20,40,60,2,5,10]),
'DESIRED' : pd.Series([5,4,3,10,8,6,1,1,1])}
df = pd.DataFrame(d)
Here's a possible solution:
#First we build a dataframe containing only index rows
df_index = df[df.Stock == 'INDEX']
#and we get rid of those rows from the original dataframe
df = df[df.Stock != 'INDEX']
#now we merge them
df = df.merge(df_index[['Date', 'Price']], on='Date', suffixes = ['', '_index'])
#and we simply create the new column
df['hooray!'] = df.Price/df.Price_index
#If you want you can delete the column
#del df['Price_index']
Output:
Stock Date Price DESIRED Price_index hooray!
0 A 5/15/2020 10 5 2 5.0
1 B 5/15/2020 20 10 2 10.0
2 A 5/16/2020 20 4 5 4.0
3 B 5/16/2020 40 8 5 8.0
4 A 5/17/2020 30 3 10 3.0
5 B 5/17/2020 60 6 10 6.0
This should do the trick:
import pandas as pd
#data (NOTE: i've removed the desired column)
d = {'Stock' : pd.Series(['A', 'A', 'A','B', 'B', 'B', 'INDEX', 'INDEX', 'INDEX']),
'Date' : pd.Series(['5/15/2020', '5/16/2020', '5/17/2020','5/15/2020', \
'5/16/2020', '5/17/2020','5/15/2020','5/16/2020','5/17/2020']),
'Price' : pd.Series([10,20,30,20,40,60,2,5,10])}
#create dataframe
df = pd.DataFrame(d)
#create emoty desired column
df['DESIRED'] = ''
#create sub dataframes for stocks and indices
stocksDf = df.loc[df['Stock'] != 'INDEX'].reset_index(drop=True)
indexDf = df.loc[df['Stock'] == 'INDEX'].reset_index(drop=True)
#loop over stocks dataframe
for i, row in stocksDf.iterrows():
#define needed values
stocks = stocksDf.at[i, 'Stock']
price = stocksDf.at[i, 'Price']
date = stocksDf.at[i, 'Date']
#get index matching date of stock
matchingIndex = indexDf.loc[indexDf['Date'] == date].reset_index(drop=True)
#if doesn't exists just print no matching index
if len(matchingIndex)==0:
df['DESIRED'].loc[(df['Stock'] == stocks) & (df['Price'] == price) & (df['Date'] == date)] = 'No Matching Index'
else:
#if exists calculate Desired as Price of stock / price of index
indexPrice = matchingIndex.at[0,'Price']
df['DESIRED'].loc[(df['Stock'] == stocks) & (df['Price'] == price) & (df['Date'] == date)] = df['Price'] / indexPrice
#for indices just set desired as 1
df['DESIRED'].loc[df['Stock'] == 'INDEX'] = 1
print(df)
Stock Date Price DESIRED
0 A 5/15/2020 10 5
1 A 5/16/2020 20 4
2 A 5/17/2020 30 3
3 B 5/15/2020 20 10
4 B 5/16/2020 40 8
5 B 5/17/2020 60 6
6 INDEX 5/15/2020 2 1
7 INDEX 5/16/2020 5 1
8 INDEX 5/17/2020 10 1
Related
Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val
I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!
Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0
Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)
I have the a df,
date amount code id
2018-01-01 50 12 1
2018-02-03 100 12 1
2017-12-30 1 13 2
2017-11-30 2 14 2
I want to groupby id, while in each group the date is also sorted in ascending or descending order, so I can do the following,
grouped = df.groupby('id')
a = np.where(grouped['code'].transform('nunique') == 1, 20, 0)
b = np.where(grouped['amount'].transform('max') > 100, 20, 0)
c = np.where(grouped['date'].transform(lambda x: x.diff().dropna().sum()).dt.days < 5, 30, 0)
You can sort the data within each group by using apply and sort_values:
grouped = df.groupby('id').apply(lambda g: g.sort_values('date', ascending=True))
Adding to the previous answer, if you wish indexes to remain as they were, you might consider the following :
import pandas as pd
df = {'a':[1,2,3,0,5], 'b':[2,2,3,2,5], 'c':[22,11,11,42,12]}
df = pd.DataFrame(df)
e = (df.groupby(['c','b', 'a']).size()).reset_index()
e = e[['a', 'b', 'c']]
e = e.sort_values(['c','a'])
print(e)
I have 2 dataframes - players (only has playerid) and dates (only has date). I want new dataframe which will contain for each player each date. In my case, players df contains about 2600 rows and date df has 1100 rows. I used 2 for loops to do this, but it is really slow, is there a way to do it faster via some function? thx
my loop:
player_elo = pd.DataFrame(columns = ['PlayerID','Date'])
for row in players.itertuples():
idx = row.Index
pl = players.at[idx,'PlayerID']
for i in dates.itertuples():
idd = row.Index
dt = dates.at[idd, 0]
new = {'PlayerID': [pl], 'Date': [dt]}
new = pd.DataFrame(new)
player_elo = player_elo.append(new)
If you have a key that is repeated for each df, you can come up with the cartesian product you are looking for using pd.merge().
import pandas as pd
players = pd.DataFrame([['A'], ['B'], ['C']], columns=['PlayerID'])
dates = pd.DataFrame([['12/12/2012'],['12/13/2012'],['12/14/2012']], columns=['Date'])
dates['Date'] = pd.to_datetime(dates['Date'])
players['key'] = 1
dates['key'] = 1
print(pd.merge(players, dates,on='key')[['PlayerID', 'Date']])
Output
PlayerID Date
0 A 2012-12-12
1 A 2012-12-13
2 A 2012-12-14
3 B 2012-12-12
4 B 2012-12-13
5 B 2012-12-14
6 C 2012-12-12
7 C 2012-12-13
8 C 2012-12-14
I am trying to use a loop function to create a matrix of whether a product was seen in a particular week.
Each row in the df (representing a product) has a close_date (the date the product closed) and a week_diff (the number of weeks the product was listed).
import pandas
mydata = [{'subid' : 'A', 'Close_date_wk': 25, 'week_diff':3},
{'subid' : 'B', 'Close_date_wk': 26, 'week_diff':2},
{'subid' : 'C', 'Close_date_wk': 27, 'week_diff':2},]
df = pandas.DataFrame(mydata)
My goal is to see how many alternative products were listed for each product in each date_range
I have set up the following loop:
for index, row in df.iterrows():
i = 0
max_range = row['Close_date_wk']
min_range = int(row['Close_date_wk'] - row['week_diff'])
for i in range(min_range,max_range):
col_head = 'job_week_' + str(i)
row[col_head] = 1
Can you please help explain why the "row[col_head] = 1" line is neither adding a column, nor adding a value to that column for that row.
For example, if:
row A has date range 1,2,3
row B has date range 2,3
row C has date range 3,4,5'
then ideally I would like to end up with
row A has 0 alternative products in week 1
1 alternative products in week 2
2 alternative products in week 3
row B has 1 alternative products in week 2
2 alternative products in week 3
&c..
You can't mutate the df using row here to add a new column, you'd either refer to the original df or use .loc, .iloc, or .ix, example:
In [29]:
df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3))
df
Out[29]:
a b c
0 -1.525011 0.778190 -1.010391
1 0.619824 0.790439 -0.692568
2 1.272323 1.620728 0.192169
3 0.193523 0.070921 1.067544
4 0.057110 -1.007442 1.706704
In [30]:
for index,row in df.iterrows():
df.loc[index,'d'] = np.random.randint(0, 10)
df
Out[30]:
a b c d
0 -1.525011 0.778190 -1.010391 9
1 0.619824 0.790439 -0.692568 9
2 1.272323 1.620728 0.192169 1
3 0.193523 0.070921 1.067544 0
4 0.057110 -1.007442 1.706704 9
You can modify existing rows:
In [31]:
# reset the df by slicing
df = df[list('abc')]
for index,row in df.iterrows():
row['b'] = np.random.randint(0, 10)
df
Out[31]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
But adding a new column using row won't work:
In [35]:
df = df[list('abc')]
for index,row in df.iterrows():
row['d'] = np.random.randint(0,10)
df
Out[35]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
row[col_head] = 1 ..
Please try the below line:
df.at[index,col_head]=1