consolidating duplicate index values into single row (single index) pandas [duplicate] - python

This question already has answers here:
Pandas pivot table with multiple columns at once
(2 answers)
Closed 5 years ago.
Have a df like the one below, and looking to compress duplicate index values into a single row:
ask bid
date
2011-01-03 0.32 0.30
2011-01-03 1.03 1.01
2011-01-03 4.16 4.11
and expected output is to have (column names not important for now will manually set it):
ask bid ask1 bid1 ask2 bid2
date
2011-01-03 0.32 0.30 1.03 1.01 4.16 4.11

Something like below can be done to get the output you looking for:
import pandas as pd
df_1=pd.DataFrame({'date':['2011-01-03','2011-01-03','2011-01-03'],'ask':[0.31,1.05,4.17],'bid':[0.40,1.41,5.11]})
dfs=list()
df_count=1
while df_1['date'].duplicated().any()==True:
df_count+=1
b=df_1.drop_duplicates(subset='date',keep='first')
dfs.append(b)
df_1=df_1.merge(b,how='outer',on=['date','ask','bid'],indicator=True)
df_1=df_1[df_1['_merge']=='left_only']
del df_1['_merge']
dfs.append(df_1)
df_final = reduce(lambda left,right: pd.merge(left,right,on='date',suffixes=('_1','_2')), dfs)
input:
ask bid date
0 0.31 0.40 2011-01-03
1 1.05 1.41 2011-01-03
2 4.17 5.11 2011-01-03
Output :
ask_1 bid_1 date ask_2 bid_2 ask bid
0 0.31 0.4 2011-01-03 1.05 1.41 4.17 5.11

Related

How to loop over unique dates in a pandas dataframe producing new dataframes in each iteration?

I have a dataframe like below and need to create (1) a new dataframe for each unique date and (2) create a new global variable with the date of the new dataframe as the value. This needs to be in a loop.
Using the dataframe below, I need to iterate through 3 new dataframes, one for each date value (202107, 202108, and 202109). This loop occurs within an existing function that then uses the new dataframe and its respective global variable of each iteration in further calculations. For example, the first iteration would yield a new dataframe consisting of the first two rows of the below dataframe and a value for the new global variable of "202107." What is the most straightforward way of doing this?
Date
Col1
Col2
202107
1.23
6.72
202107
1.56
2.54
202108
1.78
7.54
202108
1.53
7.43
202108
1.58
2.54
202109
1.09
2.43
202109
1.07
5.32
Loop over the results of .groupby:
for _, new_df in df.groupby("Date"):
print(new_df)
print("-" * 80)
Prints:
Date Col1 Col2
0 202107 1.23 6.72
1 202107 1.56 2.54
--------------------------------------------------------------------------------
Date Col1 Col2
2 202108 1.78 7.54
3 202108 1.53 7.43
4 202108 1.58 2.54
--------------------------------------------------------------------------------
Date Col1 Col2
5 202109 1.09 2.43
6 202109 1.07 5.32
--------------------------------------------------------------------------------
Then you can store new_df to a list or a dictionary and use it afterwards.
You can extract the unique date values y the .unique() method, and then store your new dataframes and dates in a dict to access then easily like :
unique_dates = init_df.Date.unique()
df_by_date = {
str(date): init_df[init_df['Date'] == date] for date in unique_dates
}
you use the dict like :
for date in unique_dates:
print(date, ': \n', df_by_date[str(date)])
output:
202107 :
Date Col1 Col2
0 202107 1.23 6.72
1 202107 1.56 2.54
202108 :
Date Col1 Col2
2 202108 1.78 7.54
3 202108 1.53 7.43
4 202108 1.58 2.54
202109 :
Date Col1 Col2
5 202109 1.09 2.43
6 202109 1.07 5.32

Pandas pivots index as columns using groupby-apply

I have come across some strange behavior in Pandas groupby-apply that I am trying to figure out.
Take the following example dataframe:
import pandas as pd
import numpy as np
index = range(1, 11)
groups = ["A", "B"]
idx = pd.MultiIndex.from_product([index, groups], names = ["index", "group"])
np.random.seed(12)
df = pd.DataFrame({"val": np.random.normal(size=len(idx))}, index=idx).reset_index()
print(df.tail().round(2))
index group val
15 8 B -0.12
16 9 A 1.01
17 9 B -0.91
18 10 A -1.03
19 10 B 1.21
And using this framework (which allows me to execute any arbitrary function within a groupby-apply):
def add_two(x):
return x + 2
def pd_groupby_apply(df, out_name, in_name, group_name, index_name, function):
def apply_func(df):
if index_name is not None:
df = df.set_index(index_name).sort_index()
df[out_name] = function(df[in_name].values)
return df[out_name]
return df.groupby(group_name).apply(apply_func)
Whenever I call pd_groupby_apply with the following inputs, I get a pivoted DataFrame:
df_out1 = pd_groupby_apply(df=df,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out1.head().round(2))
index 1 2 3 4 5 6 7 8 9 10
group
A 2.47 2.24 2.75 2.01 1.19 1.40 3.10 3.34 3.01 0.97
B 1.32 0.30 0.47 1.88 4.87 2.47 0.78 1.88 1.09 3.21
However, as soon as my dataframe does not contain full group-index pairs, and I call my pd_groupby_apply function again, I do recieve my dataframe back in the way that I want (i.e. not pivoted):
df_notfull = df.iloc[:-1]
df_out2 = pd_groupby_apply(df=df_notfull,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out2.head().round(2))
group index
A 1 2.47
2 2.24
3 2.75
4 2.01
5 1.19
Why is this? And more importantly, how can I prevent Pandas from pivoting my dataframe when I have full index-group pairs in my dataframe?

Extract values based on timestamps diff conditions in pandas

I have a dataset with timestamps and values for each ID. The number of rows, for each ID, is different and I need a double for loop like this:
for ids in IDs:
for index in Date:
Now, I would like to find the difference between timestamps, for each ID, in these ways:
values between 2 days
values between 7 days
In particular, for each ID
if, from the first value, in the next 2 days there is an increment of at least 0.3 fromt the first value
OR
if, from the first value, in the next 7 days there is a value equals to 1.5*first value
I store that ID in a dataframe, otherwise I store that ID in another dataframe.
Now, my code is the following:
yesDf = pd.DataFrame()
noDf = pd.DataFrame()
for ids in IDs:
for index in Date:
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 2):
if (df.iloc[index]['Val'] - df.iloc[index - 1]['Val'] >= 0.3):
yesDf += IDs['ID']
noDf += IDs['ID']
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 7):
if(df.iloc[Date - 1]['Val'] >= df.iloc[index]['Val'] * 1.5):
yesDf += IDs['ID']
noDf += IDs['ID']
print(yesDf)
print(noDf)
I get these errors:
TypeError: incompatible type for a datetime/timedelta operation [sub]
and
pandas.errors.NullFrequencyError: Cannot shift with no freq
How can I solve this problem?
Thank you
Edit: my dataframe
Val ID Date
2199 0.90 0000.0 2017-12-26 11:00:01
2201 1.35 0001.0 2017-12-26 11:00:01
63540 0.72 0001.0 2018-08-10 11:53:01
68425 0.86 0001.0 2018-10-14 08:33:01
42444 0.99 0002.0 2018-02-01 09:25:53
41474 1.05 0002.0 2018-04-01 08:00:04
42148 1.19 0002.0 2018-07-01 08:50:00
24291 1.01 0004.0 2017-01-01 08:12:02
for exampled: for ID 0001.0 the first value is 1.35, and in the next 2 days I don't have an increment of at least 0.3 from the start value, and in the next 7 days I don't have an increment 1.5times the firsrt value, so it goes in the noDf Dataframe.
Also the dtypes:
Val float64
ID object
Date datetime64[ns]
Surname object
Name object
dtype: object
edit:
after the modified code the results are:
Val ID Date Date_diff_cumsum Val_diff
24719 2.08 0118.0 2017-01-15 08:16:05 1.0 0.36
24847 2.17 0118.0 2017-01-16 07:23:04 1.0 0.45
25233 2.45 0118.0 2017-01-17 08:21:03 2.0 0.73
24749 2.95 0118.0 2017-01-18 09:49:09 3.0 1.23
17042 1.78 0129.0 2018-02-05 22:48:17 0.0 0.35
And it is correct. Now I only need to add the single ID into a dataframe
This answer should work assuming you start from the first value of an ID, so the first timestamp.
First, I added the 'Date_diff_cumsum' column, which stores the difference in days between the first date for the ID and the row's date:
df['Date_diff_cumsum'] = df.groupby('ID').Date.diff().dt.days
df['Date_diff_cumsum'] = df.groupby('ID').Date_diff_cumsum.cumsum().fillna(0)
Then, I add the 'Value_diff' column, which is the difference between the first value for an ID and the row's value:
df['Val_diff'] = df.groupby('ID')['Val'].transform(lambda x:x-x.iloc[0])
Here is what I get after adding the columns for your sample DataFrame:
Val ID Date Date_diff_cumsum Val_diff
0 0.90 0.0 2017-12-26 11:00:01 0.0 0.00
1 1.35 1.0 2017-12-26 11:00:01 0.0 0.00
2 0.72 1.0 2018-08-10 11:53:01 227.0 -0.63
3 0.86 1.0 2018-10-14 08:33:01 291.0 -0.49
4 0.99 2.0 2018-02-01 09:25:53 0.0 0.00
5 1.05 2.0 2018-04-01 08:00:04 58.0 0.06
6 1.19 2.0 2018-07-01 08:50:00 149.0 0.20
7 1.01 4.0 2017-01-01 08:12:02 0.0 0.00
And finally, return the rows which satisfy the conditions in your question:
df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))]
In this case, it will return no rows.
yesDf = df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
noDf = df[~((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
yesDf contains the IDs that satisfy the condition, and noDf the ones that don't
I hope this answers your question !

matplotlib US Treasury yield curve

I'm currently trying to build a dataframe consisting of daily US Treasury Rates. As you can see, pandas automatically formats the columns so that they're in order, which clearly I do not want. Here's some of my code. I only needed to do a small example in order to show the problem I'm having.
import quandl
import matplotlib.pyplot as plt
One_Month = quandl.get('FRED/DGS1MO')
^^ Repeated for all rates
Yield_Curve = pd.DataFrame({'1m': One_Month['Value'], '3m': Three_Month['Value'], '1yr': One_Year['Value']})
Yield_Curve.loc['2017-06-22'].plot()
plt.show()
Yield_Curve.tail()
1m 1yr 3m
Date
2017-06-16 0.85 1.21 1.03
2017-06-19 0.85 1.22 1.02
2017-06-20 0.88 1.22 1.01
2017-06-21 0.85 1.22 0.99
2017-06-22 0.80 1.22 0.96
As I said, I only added three rates to the dataframe but obviously the two year, three year, and five year rates will cause a problem as well.
I did some searching and saw this post:
Plotting Treasury Yield Curve, how to overlay two yield curves using matplotlib
While using the code in the last post clearly works, I'd rather be able to keep my current datasets (One_Month, Three_Month....) to do this since I use them for other analyses as well.
Question: Is there a way for me to lock the column order?
Thanks for your help!
If you're looking to define the column ordering, you can use reindex_axis():
df = df.reindex_axis(labels=['1m', '3m', '1yr'], axis=1)
df
1m 3m 1yr
Date
2017-06-16 0.85 1.03 1.21
2017-06-19 0.85 1.02 1.22
2017-06-20 0.88 1.01 1.22
2017-06-21 0.85 0.99 1.22
2017-06-22 0.80 0.96 1.22
With pandas-datareader you can specify the symbols as one list. And in addition to using reindex_axis as suggested by #Andrew L, you can also just pass a list of ordered columns with two brackets, see final line below, to specify column order.
from pandas_datareader.data import DataReader as dr
syms = ['DGS10', 'DGS5', 'DGS2', 'DGS1MO', 'DGS3MO']
yc = dr(syms, 'fred') # could specify start date with start param here
names = dict(zip(syms, ['10yr', '5yr', '2yr', '1m', '3m']))
yc = yc.rename(columns=names)
yc = yc[['1m', '3m', '2yr', '5yr', '10yr']]
print(yc)
1m 3m 2yr 5yr 10yr
DATE
2010-01-01 NaN NaN NaN NaN NaN
2010-01-04 0.05 0.08 1.09 2.65 3.85
2010-01-05 0.03 0.07 1.01 2.56 3.77
2010-01-06 0.03 0.06 1.01 2.60 3.85
2010-01-07 0.02 0.05 1.03 2.62 3.85
... ... ... ... ...
2017-06-16 0.85 1.03 1.32 1.75 2.16
2017-06-19 0.85 1.02 1.36 1.80 2.19
2017-06-20 0.88 1.01 1.36 1.77 2.16
2017-06-21 0.85 0.99 1.36 1.78 2.16
2017-06-22 0.80 0.96 1.34 1.76 2.15
yc.loc['2016-06-01'].plot(label='Jun 1')
yc.loc['2016-06-02'].plot(label='Jun 2')
plt.legend(loc=0)
If you don't want to change original column order in spite of you need sorted column along to finance notation, I guess you should make your own customized column order like below.
fi_col = df.columns.str.extract('(\d)(\D+)', expand=True).sort_values([1, 0]).reset_index(drop=True)
fi_col = fi_col[0] + fi_col[1]
print(df[fi_col])
1m 3m 1yr
Date
2017-06-16 0.85 1.03 1.21
2017-06-19 0.85 1.02 1.22
2017-06-20 0.88 1.01 1.22
2017-06-21 0.85 0.99 1.22
2017-06-22 0.80 0.96 1.22
You can also pull all the historical rates directly from the US Treasury's website (updated daily):
from bs4 import BeautifulSoup
import requests
import pandas as pd
soup = BeautifulSoup(requests.get('https://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData').text,'lxml')
table = soup.find_all('m:properties')
tbondvalues = []
for i in table:
tbondvalues.append([i.find('d:new_date').text[:10],i.find('d:bc_1month').text,i.find('d:bc_2month').text,i.find('d:bc_3month').text,i.find('d:bc_6month').text,i.find('d:bc_1year').text,i.find('d:bc_2year').text,i.find('d:bc_3year').text,i.find('d:bc_5year').text,i.find('d:bc_10year').text,i.find('d:bc_20year').text,i.find('d:bc_30year').text])
ustcurve = pd.DataFrame(tbondvalues,columns=['date','1m','2m','3m','6m','1y','2y','3y','5y','10y','20y','30y'])
ustcurve.iloc[:,1:] = ustcurve.iloc[:,1:].apply(pd.to_numeric)/100
ustcurve['date'] = pd.to_datetime(ustcurve['date'])

Finding duration between events

I want to compute the duration (in weeks between change). For example, p is the same for weeks 1,2,3 and changes to 1.11 in period 4. So duration is 3. Now the duration is computed in a loop ported from R. It works but it is slow. Any suggestion how to improve this would be greatly appreciated.
raw['duration']=np.nan
id=raw['unique_id'].unique()
for i in range(0,len(id)):
pos1= abs(raw['dp'])>0
pos2= raw['unique_id']==id[i]
pos= np.where(pos1 & pos2)[0]
raw['duration'][pos[0]]=raw['week'][pos[0]]-1
for j in range(1,len(pos)):
raw['duration'][pos[j]]=raw['week'][pos[j]]-raw['week'][pos[j-1]]
The dataframe is raw, and values for a particular unique_id looks like this.
date week p change duration
2006-07-08 27 1.05 -0.07 1
2006-07-15 28 1.05 0.00 NaN
2006-07-22 29 1.05 0.00 NaN
2006-07-29 30 1.11 0.06 3
... ... ... ... ...
2010-06-05 231 1.61 0.09 1
2010-06-12 232 1.63 0.02 1
2010-06-19 233 1.57 -0.06 1
2010-06-26 234 1.41 -0.16 1
2010-07-03 235 1.35 -0.06 1
2010-07-10 236 1.43 0.08 1
2010-07-17 237 1.59 0.16 1
2010-07-24 238 1.59 0.00 NaN
2010-07-31 239 1.59 0.00 NaN
2010-08-07 240 1.59 0.00 NaN
2010-08-14 241 1.59 0.00 NaN
2010-08-21 242 1.61 0.02 5
##
Computing duratiosn once you have your list in date order is trivial: iterate over the list, keeping track of how long since the last change to p. If the slowness comes from how you get that list, you haven't provided nearly enough info for help with that.
You can simply get the list of weeks where there is a change, then compute their differences, and finally join those differences back onto your original DataFrame.
weeks = raw.query('change != 0.0')[['week']]
weeks['duration'] = weeks.week.diff()
pd.merge(raw, weeks, on='week', how='left')
raw2=raw.ix[raw['change'] !=0,['week','unique_id']]
data2=raw2.groupby('unique_id')
raw2['duration']=data2['week'].transform(lambda x: x.diff())
raw2.drop('unique_id',1)
raw=pd.merge(raw,raw2,on=['unique_id','week'],how='left')
Thank you all. I modified the suggestion and got this to give the same answer as the complicated loop. For 10,000. observations, it is not a whole lot faster but the code seems more compact.
I put no change to Nan because the duration seems to be undefined when no change is made. But zero will work too. With the above code, the NaN is put in automatically by merge. In any case,
I want to compute statistics for the non-change group separately.

Categories