Merging data frames based on value in row and column name

Merging data frames based on value in row and column name - python

I work with financial data and try to merge two pandas data frames.
In the first data frame I have the information of company name, ticker code, and date.
Date Ticker Company
0 2020-01-15 CHR.CO Chr. Hansen
1 2020-01-15 PNDORA.CO Pandora A/S
In my second df, I have a date and closing prices for stocks on some given dates.
Date CHR.CO COLO-B.CO DANSKE.CO PNDORA.CO VWS.CO
0 2020-01-15 89.5 89.5 187.39 54.4 552.0
1 2020-01-16 90 88.0 184.61 55.2 550.0
How could I merge these two data frames so I could get the closing stock price in the first dataframe?
Here's the desired output:
Date Ticker Company Close_price
0 2020-01-15 CHR.CO Chr. Hansen 89.5
1 2020-01-15 PNDORA.CO Pandora A/S 54.4
Using the below line I merge the two dataframes on date, but also get all the tickers and the close price for all companies.
full = new_df.merge(stocks_close, on = "Date")

Add DataFrame.melt before merge and also specify both columns ["Date",'Ticker'] in parameter on:
df = stocks_close.melt(id_vars='Date', var_name='Ticker', value_name='Close_price')
full = new_df.merge(df, on = ["Date",'Ticker'])
print (full)
Date Ticker Company Close_price
0 2020-01-15 CHR.CO Chr. Hansen 89.5
1 2020-01-15 PNDORA.CO Pandora A/S 54.4

Related

Join in Pandas dataframe (Auto conversion from date to date-time?)

I tried to use the join function to combine the close price of all 500 stocks in 5 year period (2013-02-08 to 2018-02-07), where each column represent a stock, with the index of the dataframe being the dates.
But the join function in pandas seems to automatically change the date format (index), rendering all the entries in the combined dataframe to be NaN.
The code to import and preview the file:
import pandas as pd
df= pd.read_csv('all_stocks_5yr.csv')
df.head()
(https://i.stack.imgur.com/29Wq4.png)
# df.info()
df['Name'].unique().shape #There are 505 stock names in total
# dates = pd.date_range(df['date'].min(), df['date'].max()) #check the date range
Single out the close prices:
close_prices = pd.DataFrame(index=dates) #Make the index column to be the dates
# close_prices.head()
symbols = df['Name'].unique(). #Denote the stock names as an array
So I tried to test the result for each stock using the first 3 stocks:
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
print(df_tmp) #print the temporary dataframes
i += 1
if i >3: break
And the results were expected, a dataframe indexed by date and only one stock:
AAL
date
2013-02-08 14.75
2013-02-11 14.46
2013-02-12 14.27
2013-02-13 14.66
2013-02-14 13.99
... ...
2018-02-01 53.88
2018-02-02 52.10
2018-02-05 49.76
2018-02-06 51.18
2018-02-07 51.40
[1259 rows x 1 columns]
AAPL
date
2013-02-08 67.8542
2013-02-11 68.5614
2013-02-12 66.8428
2013-02-13 66.7156
2013-02-14 66.6556
... ...
2018-02-01 167.7800
2018-02-02 160.5000
2018-02-05 156.4900
2018-02-06 163.0300
2018-02-07 159.5400
[1259 rows x 1 columns]
...
Now here's part I find very confusing: I checked what happens when combining first 3 stock dataframes using join function, with index 'date':
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
close_prices = close_prices.join(df_tmp)
i += 1
if i >3: break
close_prices.head()
(https://i.stack.imgur.com/MqVDo.png)
Somehow the index changed from date to date-time format, and therefore naturally "join" function will recognize that none of the entries matches with the new index, and automatically put a NA there for every single entry.
What caused the date to have changed to date-time?

You can use pivot:
df = pd.read_csv('all_stocks_5yr.csv')
out = df1.pivot(index='date', columns='Name', values='close')
Output:
>>> df.iloc[:, :8]
Name A AAL AAP AAPL ABBV ABC ABT ACN
date
2013-02-08 45.08 14.75 78.90 67.8542 36.25 46.89 34.41 73.31
2013-02-11 44.60 14.46 78.39 68.5614 35.85 46.76 34.26 73.07
2013-02-12 44.62 14.27 78.60 66.8428 35.42 46.96 34.30 73.37
2013-02-13 44.75 14.66 78.97 66.7156 35.27 46.64 34.46 73.56
2013-02-14 44.58 13.99 78.84 66.6556 36.57 46.77 34.70 73.13
... ... ... ... ... ... ... ... ...
2018-02-01 72.83 53.88 117.29 167.7800 116.34 99.29 62.18 160.46
2018-02-02 71.25 52.10 113.93 160.5000 115.17 96.02 61.69 156.90
2018-02-05 68.22 49.76 109.86 156.4900 109.51 91.90 58.73 151.83
2018-02-06 68.45 51.18 112.20 163.0300 111.20 91.54 58.86 154.69
2018-02-07 68.06 51.40 109.93 159.5400 113.62 94.22 58.67 155.15
[1259 rows x 8 columns]
Source 'all_stocks_5yr.csv': Kaggle

Pandas: compute average and standard deviation by clock time

I have a DataFrame like this:
date time value
0 2019-04-18 07:00:10 100.8
1 2019-04-18 07:00:20 95.6
2 2019-04-18 07:00:30 87.6
3 2019-04-18 07:00:40 94.2
The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:
df["hour"] = df["time"].astype(str).str[:2]
Then I have tried to calculate standard deviation by:
df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()
But that won't work, could I have some advise on the problem?

We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:
hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])
If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:
g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')
date time value mean std
0 2019-04-18 07:00:10 100.8 94.55 5.434151
1 2019-04-18 07:00:20 95.6 94.55 5.434151
2 2019-04-18 07:00:30 87.6 94.55 5.434151
3 2019-04-18 07:00:40 94.2 94.55 5.434151

have synthesized data. Start by generating a true datetime column
groupby() hour
use describe() to get mean & std
merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})
df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
time=df.datetime.dt.strftime("%H:%M:%S"))
# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)
# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
.apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
.droplevel(1)
)
# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
datetime value mean std
0 2019-01-01 00:00:00 86.014209 80.043364 5.777724
1 2019-01-01 00:00:10 77.241141 80.043364 5.777724
2 2019-01-01 00:00:20 71.650739 80.043364 5.777724
3 2019-01-01 00:00:30 71.066332 80.043364 5.777724
4 2019-01-01 00:00:40 77.203291 80.043364 5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237 80.009751 5.773007
3144956 2019-12-30 23:59:20 82.154883 80.009751 5.773007
3144957 2019-12-30 23:59:30 82.131952 80.009751 5.773007
3144958 2019-12-30 23:59:40 85.346724 80.009751 5.773007
3144959 2019-12-30 23:59:50 78.122761 80.009751 5.773007

Pandas groupby keep rows according to ranking

I have this dataframe:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-01 0.603989 S2B-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
I would like to groupby by the date columns where I keep the rows according to a ranking/importance of my chooseing from the source columns. For example, my ranking is L8-SR>S2B-SR>GP6_r, meaning that for all rows with the same date, keep the row where source==L8-SR, if none contain L8-SR, then keep the row where source==S2B-SR etc. How can I accomplish that in pandas groupby
Output should look like this:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-11 0.717264 S2B-SR
4 2020-04-02 0.737118 L8-SR

Let's try category dtype and drop_duplicates:
orders = ['L8-SR','S2B-SR','GP6_r']
df.source = df.source.astype('category')
df.source.cat.set_categories(orders, ordered=True)
df.sort_values(['date','source']).drop_duplicates(['date'])
Output:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR

TRY below code for the group by operation. For ordering after this operation you can perform sortby:
# Import pandas library
import pandas as pd
# Declare a data dictionary contains the data mention in table
pandasdata_dict = {'date':['2020-02-14', '2020-02-15', '2020-03-01', '2020-03-01', '2020-03-11', '2020-04-02'],
'value':[0.438767, 0.422867, 0.657453, 0.603989, 0.717264, 0.737118],
'source':['L8-SR', 'S2A-SR', 'L8-SR', 'S2B-SR', 'S2B-SR', 'L8-SR']}
# Convert above dictionary data to the data frame
df = pd.DataFrame(pandasdata_dict)
# display data frame
df
# Convert date field to datetime
df["date"] = pd.to_datetime(df["date"])
# Once conversion done then do the group by operation on the data frame with date field
df.groupby([df['date'].dt.date])

how to retrieve the 3 months from each quarter hence increase df row number by 3 times. Pandas, Python

I have a quite silly task but haven't found a way to do it,
I have a huge df, here is the head
Deal Date Period Name Price Quarter Start Quarter End
0 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999
1 2011-11-01 2012-Q1 30.95 2012-01-01 2012-03-31 23:59:59.999999999
2 2011-11-01 2012-Q2 30.67 2012-04-01 2012-06-30 23:59:59.999999999
3 2011-11-01 2012-Q3 29.87 2012-07-01 2012-09-30 23:59:59.999999999
4 2011-11-01 2012-Q4 29.49 2012-10-01 2012-12-31 23:59:59.999999999
I wish to have an additional column which shows "month", the above 5 rows will become 15 rows, for example the initial row 0 will repeat twice
Deal Date Period Name Price Quarter Start Quarter End Month
0 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999 10
1 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999 11
2 2011-11-01 2011-Q4 30.76 2011-10-01 2011-12-31 23:59:59.999999999 12
as there are these 3 months included in Q4...
similar for the rest of rows.
Is there an easy way to achieve this? Thanks

You can extract the quarter value from period, then perform pandas.merge with a dataframe with only 12 rows containing quarter -> month mapping.
Simplified example code:
import pandas as pd
df_test = pd.DataFrame({'quart':[1,2,3,4,1,2], 'val': ['a','b','c','d','e','f']})
df_quart_to_month = pd.DataFrame({'quart':[1,1,1,2,2,2,3,3,3,4,4,4], 'month': [1,2,3,4,5,6,7,8,9,10,11,12]})
df_with_months = df_test.merge(df_quart_to_month ,on='quart', how='outer')
If you want to keep the original order:
df_with_months = df_test.reset_index().merge(df_quart_to_month ,on='quart', how='outer').set_index('index')
df_sorted = df_with_months.sort_values(['index', 'month'], ascending=[True, True])
Alternatively you could split your dataset into 4 DataFrames based on their quarter, copy each sub-dataframe twice and add the corresponding month. Then concatenate the resulting 12 sub-dataframes together.

Sorting values for every level 1 in pandas multiindex

I'm having a dataframe with multiindex, the first level is an company_ID and the second level is a timestamp. How can I get a rank of all companies depending on their scores, every month?
Score
company_idx timestamp
10006 2010-01-31 69.875394
2010-11-30 73.640693
2010-12-31 73.286248
2011-01-31 73.660052
2011-02-28 74.615564
2011-03-31 73.535187
2011-04-30 72.491390
2012-01-31 72.162768
2012-02-29 61.637952
2012-03-31 59.445419
2012-04-30 25.685615
2012-05-31 8.047693
2012-06-30 58.341200
...
9981 2016-12-31 51.011261
2018-05-31 54.462832
2018-06-30 57.126250
2018-07-31 54.695835
2018-08-31 63.758145
2018-09-30 63.255583
2018-10-31 62.069697
2018-11-30 62.795650
2018-12-31 63.045329
2019-01-31 60.276990
2019-02-28 56.666379
2019-03-31 57.903213
2019-04-30 57.558973
2019-05-31 52.260287
I've tried to do:
df2 = df.sort_index(by='Score', ascending=False)
But it's not getting me what i want.
Would you be able to help? I'm quite new with multilevel dataframes.
Many thanks!

You should swap the index levels to have the month first, then sort by timestamp ascending and Score descending:
df.index = df.index.swaplevel()
df.sort_values(['timestamp', 'Score'], ascending=[True, False], inplace=True)
It does not give interesting result with your sample value, because only one company has Score value for one month.
To extract the values for one month, you can use df.xs(month_value, level=0) that will drop one level in the multi-index, or df.xs(month_value, level=0, drop_level=False) that will keep it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging data frames based on value in row and column name - python

Related

Join in Pandas dataframe (Auto conversion from date to date-time?)

Pandas: compute average and standard deviation by clock time

Pandas groupby keep rows according to ranking

how to retrieve the 3 months from each quarter hence increase df row number by 3 times. Pandas, Python

Sorting values for every level 1 in pandas multiindex

Categories

Resources