I have a timeseries dataframe that is similar to:
ts = pd.DataFrame([['Jan 2000','WidgetCo',0.5, 2], ['Jan 2000','GadgetCo',0.3, 3], ['Jan 2000','SnazzyCo',0.2, 4],
['Feb 2000','WidgetCo',0.4, 2], ['Feb 2000','GadgetCo',0.5, 2.5], ['Feb 2000','SnazzyCo',0.1, 4],
], columns=['month','company','share','price'])
Which looks like:
month company share price
0 Jan 2000 WidgetCo 0.5 2.0
1 Jan 2000 GadgetCo 0.3 3.0
2 Jan 2000 SnazzyCo 0.2 4.0
3 Feb 2000 WidgetCo 0.4 2.0
4 Feb 2000 GadgetCo 0.5 2.5
5 Feb 2000 SnazzyCo 0.1 4.0
I can pivot this table like so:
pd.pivot_table(ts,index='month', columns='company')
Which gets me:
share price
company GadgetCo SnazzyCo WidgetCo GadgetCo SnazzyCo WidgetCo
month
Feb 2000 0.5 0.1 0.4 2.5 4 2
Jan 2000 0.3 0.2 0.5 3.0 4 2
This is what I want except that I need to collapse the MultiIndex so that the company is used as a prefix for share and price like so:
WidgetCo_share WidgetCo_price GadgetCo_share GadgetCo_price ...
month
Jan 2000 0.5 2 0.3 3.0
Feb 2000 0.4 2 0.5 2.5
I came up with this function to do just that but it seems like a poor solution:
def pivot_table_to_flat(df, column, index):
res = df.set_index(index)
cols = res.drop(column, axis=1).columns.values
resulting_cols = []
for prefix in res[column].unique():
for col in cols:
new_col_name = prefix + '_' + col
res[new_col_name] = res[res[column] == prefix][col]
resulting_cols.append(new_col_name)
return res[resulting_cols]
pivot_table_to_flat(ts, index='month', column='company')
What is a better way of accomplishing a pivot resulting in a columns with prefixes as opposed to a MultiIndex?
This seems even simpler:
df.columns = [' '.join(col).strip() for col in df.columns.values]
It takes a df with a multiindex column and flattens the column labels, with the df remaining in place.
(ref: #andy-haden Python Pandas - How to flatten a hierarchical index in columns )
I figured it out. Using the data on the MultiIndex makes for a pretty clean solution:
def flatten_multi_index(df):
mi = df.columns
suffixes, prefixes = mi.levels
col_names = [prefixes[i_p] + '_' + suffixes[i_s] for (i_s, i_p) in zip(*mi.labels)]
df.columns = col_names
return df
flatten_multi_index(pd.pivot_table(ts,index='month', columns='company'))
The above version only handles a 2D MultiIndex but it could be generalized if needed.
An update (as of early 2017 and pandas 0.19.2). You can use .values on a MultiIndex. So, this snippet should flatten MultiIndexs for those in need. The snippet is both too clever but not clever enough: it can handle either the row index or column names from the DataFrame, but it will blow up if the result of getattr(df,way) isn't nested (i.e., a MultiIndex).
def flatten_multi(df, way='index'): # or way='columns'
assert way in {'index', 'columns'}, "I'm sorry Dave."
mi = getattr(df, way)
flat_names = ["_".join(s) for s in mi.values]
setattr(df, way, flat_names)
return df
Related
I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
combos = list(combinations(df.columns, 2))
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0
I'm having a dataset that looks as follows:
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32]}
df = pd.DataFrame(data)
Now what I aim to do is to add a new column "SMB" and calculate it as follows:
Subset data based on year and quarter, e. g. get all values where year = 2012, and quarter = 2
Sort the subset based on column MC and split it based on the size into small and big (0.5 Quantile)
If the value in MC is lower than 0.5 quantile add value "small" to the newly created column "SMB", if it is higher than the 0.5 quantile add value "big"
Repeat the process for all rows where quarter = 2
For all other rows add np.nan
so the output should look like that
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32],
'SMB': ['Small', 'Small', 'Big', 'Big', np.NaN, np.NaN]}
df = pd.DataFrame(data)
I tried to create a loop but I was unable to properly merge it back into the previous dataframe as I need other quarter values for further calculation. Using below code I sort of achieved what I wanted to have, but I had to merge back the data into the original dataset.
I'm sure there is a much nicer way on how to achieve this.
# Quantile 0.5 for MC sorting (small & big)
smbQuantile = 0.5
Years = df['Year'].unique()
dataframes_list = []
# Calculate Small and Big and merge back into dataFrame
for i in Years:
df_temp = df.loc[(df_sb['Year'] == i) & (df['Quarter'] == 2)]
df_temp['SMB'] = ''
#Assign factor size based on market cap
df_temp.SMB[df_temp.MKT_CAP <= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Small'
df_temp.SMB[df_temp.MKT_CAP >= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Big'
dataframes_list.append(df_temp)
df = pd.concat(dataframes_list)
You can use groupby.rank and groupby.transform('size') combined with numpy.select:
g = df.groupby(['Year', 'Quarter'])['MC']
df['SMB'] = np.select([g.rank(pct=True).le(0.5),
g.transform('size').ge(2)],
['Small', 'Big'], np.nan)
output:
Year Quarter ID MC PB SMB
0 2012 2 CH7744 3348.22 2.74 Small
1 2013 2 US4652 8542.55 0.95 Small
2 2012 2 CA47441 11851.20 1.57 Big
3 2013 2 CH1147 15718.10 2.13 Big
4 2014 3 DE7487 29914.70 0.54 nan
5 2013 1 US5174 8731.78 5.32 nan
I have two dataframes which contain data collected at two different frequencies.
I want to update the label of df2, to that of df1 if it falls into the duration of an event.
I created a nested for-loop to do it, but it takes a rather long time.
Here is the code I used:
for i in np.arange(len(df1)-1):
for j in np.arange(len(df2)):
if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])):
df2.loc[j,"label"] = df1.loc[i,"label"]
Is there a more efficient way of doing this?
df1 size (367, 4)
df2 size (342423, 9)
short example data:
import numpy as np
import pandas as pd
data1 = {'timestamp': [1,2,3,4,5,6,7,8,9],
'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5],
'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh']
}
df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label'])
data2 = {'timestamp': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5],
'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc']
}
df2 = pd.DataFrame (data2, columns = ['timestamp','label'])
I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines.
Code could be:
df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same
temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2')
df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y
It gives for df2:
timestamp label
0 1.0 inh
1 1.5 inh
2 2.0 exh
3 2.5 plc
4 3.0 inh
5 3.5 inh
6 4.0 exh
7 4.5 plc
8 5.0 inh
9 5.5 plc
10 6.0 exh
11 6.5 exh
12 7.0 inh
13 7.5 plc
14 8.0 exh
15 8.5 exh
16 9.0 inh
17 9.5 inh
I'm ttrying to pick all the values from a dataframe column I have and apply them to a mathematical function.
Heres how the data looks:
Year % PM
1 2002 3
2 2002 2.3
I am trying to apply this function :
M = 100000
t = (THE PERCENTAGE FROM THE DATAFRAME)/12
n = 15*12
PM = M*((t*(1+t)**n)/(((1+t)**n)-1))
print(PM)
And my goal is to do it to all the rows and append the value of each result to PM in the dF
You can just add the formula as a column directly to the DF, creating t_div_12 as a vector from the column as below:
M = 100000
n = 15*12
t_div_12 = df["%"]/12
df["PM"] = M*((t_div_12 *(1+t_div_12 )**n)/(((1+t_div_12)**n)-1))
First, I would avoid using constants, which are not repeated in the code. You can apply this function to your dataframe by using this code snippet:
dF = pd.DataFrame([[2002, 3], [2002, 2.3]], columns=["Year", "%"])
dF['PM'] = 100000*((dF["%"]/12*(1+dF["%"]/12)**(15*12))/(((1+dF["%"]/12)**(15*12))-1))
It will give you:
Year % PM
0 2002 3.0 25000.000000
1 2002 2.3 19166.666667
df['PM'] = df['%'].map(lambda t: M*(((t/12)*(1+(t/12))**n)/(((1+(t/12))**n)-1)))
I have a dataframe like:
I would like to substract the values like:
minus
What I tried so far (the dataframe: http://pastebin.com/PydRHxcz):
index = pd.MultiIndex.from_tuples([key for key in dfdict], names = ['a','b','c','d'])
dfl = pd.DataFrame([dfdict[key] for key in dfdict],index=index)
dfl.columns = ['data']
dfl.sort(inplace=True)
d = dfl.unstack(['a','b'])
I can do:
d[0:5] - d[0:5]
And I get zeros for all values.
But If I do:
d[0:5] - d[5:]
I get Nans for all values. Any Ideas how I can perform such an operation ?
EDIT:
What works is
dfl.unstack(['a','b'])['data'][5:] - dfl.unstack(['a','b'])['data'][0:5].values
But it feels a bit clumsy
You can use loc to select all rows that correspond to one label in the first level like this:
In [8]: d.loc[0]
Out[8]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 11.098909 9.223784 8.003650 10.014445 13.231898 10.372040
0.3 14.349606 11.420565 9.053073 10.252542 26.342501 25.219403
0.5 1.336937 2.522929 3.875139 11.161803 3.168935 6.287555
0.7 0.379158 1.061104 2.053024 12.358577 0.678352 2.133887
1.0 0.210244 0.631631 1.457333 15.117805 0.292904 1.053916
So doing the subtraction looks like:
In [11]: d.loc[0] - d.loc[1000]
Out[11]:
data ...
a 0.17 1.00
b 0 5 10 500 0 5
d
0.0 -3.870946 -3.239915 -3.504068 -0.722377 -2.335147 -2.460035
0.3 -65.611418 -42.225811 -25.712668 -1.028758 -65.106473 -44.067692
0.5 -84.494748 -55.186368 -34.184425 -1.619957 -89.356417 -69.008567
0.7 -92.681688 -61.636548 -37.386604 -4.227343 -110.501219 -78.925078
1.0 -101.071683 -61.758741 -37.080222 -3.081782 -103.779698 -80.337487