Creating DataFrame with Hierarchical Columns - python

What is the easiest way to create a DataFrame with hierarchical columns?
I am currently creating a DataFrame from a dict of names -> Series using:
df = pd.DataFrame(data=serieses)
I would like to use the same columns names but add an additional level of hierarchy on the columns. For the time being I want the additional level to have the same value for columns, let's say "Estimates".
I am trying the following but that does not seem to work:
pd.DataFrame(data=serieses,columns=pd.MultiIndex.from_tuples([(x, "Estimates") for x in serieses.keys()]))
All I get is a DataFrame with all NaNs.
For example, what I am looking for is roughly:
l1 Estimates
l2 one two one two one two one two
r1 1 2 3 4 5 6 7 8
r2 1.1 2 3 4 5 6 71 8.2
where l1 and l2 are the labels for the MultiIndex

This appears to work:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.concat({"Estimates": pd.DataFrame(data)}, axis=1, names=["l1", "l2"])
l1 Estimates
l2 a b c
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400

I know the question is really old but for pandas version 0.19.1 one can use direct dict-initialization:
d = {('a','b'):[1,2,3,4], ('a','c'):[5,6,7,8]}
df = pd.DataFrame(d, index=['r1','r2','r3','r4'])
df.columns.names = ('l1','l2')
print df
l1 a
l2 b c
r1 1 5
r2 2 6
r3 3 7
r4 4 8

Im not sure but i think the use of a dict as input for your DF and a MulitIndex dont play well together. Using an array as input instead makes it work.
I often prefer dicts as input though, one way is to set the columns after creating the df:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.DataFrame(np.array(data.values()).T, index=['r1','r2','r3','r4'])
tups = zip(*[['Estimates']*len(data),data.keys()])
df.columns = pd.MultiIndex.from_tuples(tups, names=['l1','l2'])
l1 Estimates
l2 a c b
r1 1 10 100
r2 2 20 200
r3 3 30 300
r4 4 40 400
Or when using an array as input for the df:
data_arr = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
tups = zip(*[['Estimates']*data_arr.shape[0],['a','b','c'])
df = pd.DataFrame(data_arr.T, index=['r1','r2','r3','r4'], columns=pd.MultiIndex.from_tuples(tups, names=['l1','l2']))
Which gives the same result.

The solution by Rutger Kassies worked in my case, but I have
more than one column in the "upper level" of the column hierarchy.
Just want to provide what worked for me as an example since it is a more general case.
First, I have data with that looks like this:
> df
(A, a) (A, b) (B, a) (B, b)
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
2 8.51 9.60 66.67 50.70
3 0.03 508.99 56.00 8.58
I would like it to look like this:
> df
A B
a b a b
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
...
The solution is:
tuples = df.transpose().index
new_columns = pd.MultiIndex.from_tuples(tuples, names=['Upper', 'Lower'])
df.columns = new_columns
This is counter-intuitive because in order to create columns, I have to do it through index.

Related

How to decode column value from rare label by matching column names

I have two dataframes like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
My objective is to
a) Replace the Rare or rare values in feature column of tdf dataframe by original value from cdf dataframe.
b) To identify original value, we can make use of the string before = Rare or =rare or = rare etc. That string represents the column name in cdf dataframe (from where original value to replace rare can be found)
I was trying something like the below but not sure how to go from here
replace_df = cdf.merge(tdf,how='inner',on='Id')
replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))])
I have to apply this on a big data where I have million rows and more than 1000 replacements to be made like this.
I expect my output to be like as shown below
Here is one possible approach using MultiIndex.map to substitute values from cdf into tdf:
s = tdf['feature'].str.split('=')
m = s.str[1].isin(['rare', 'Rare'])
v = tdf[m].set_index(['Id', s[m].str[0]]).index.map(cdf.set_index('Id').stack())
tdf.loc[m, 'feature'] = s[m].str[0] + '=' + v.astype(str)
print(tdf)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
# list comprehension to find where rare is in the feature col
tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
# create a mask where feature is in columns of cdf
mask = tdf['feature'].isin(cdf.columns)
# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
right_index=True, left_on=['Id', 'feature'])[0].astype(str)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
My feeling is there's no need to look for Rare values.
Extract the column name from tdf to lookup in cdf. After, flatten your cdf dataframe to extract the right values:
r = tdf.set_index('Id')['feature'].str.split('=').str[0].str.lower()
tdf['feature'] = r.values + '=' + cdf.set_index('Id').unstack() \
.loc[zip(r.values, r.index)] \
.astype(str).values
Output:
>>> tdf
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=A 3.70
>>> r
Id # <- the index is the row of cdf
1 grade # <- the values are the column of cdf
1 dash
1 dumma
1 dumeel
3 dash
3 dumma
3 grade
Name: feature, dtype: object

Using custom function for Pandas Rolling Apply that depends on colname

Using Pandas 1.1.5, I have a test DataFrame like the following:
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': ['a0','a0','a0','a1','a1','a1','a2','a2'],
'a': [4,5,6,1,2,3,7,9],
'b': [3,4,5,3,2,4,1,3],
'c': [7,4,3,8,9,7,4,6],
'denom_a': [7,8,9,7,8,9,7,8],
'denom_b': [10,11,12,10,11,12,10,11]})
I would like to apply the following custom aggregate function on a rolling window where the function's calculation depends on the column name as so:
def custom_func(s, df, colname):
if 'a' in colname:
denom = df.loc[s.index, "denom_a"]
calc = s.sum() / np.max(denom)
elif 'b' in colname:
denom = df.loc[s.index, "denom_b"]
calc = s.sum() / np.max(denom)
else:
calc = s.mean()
return calc
df.groupby('id')\
.rolling(2, 1)\
.apply(lambda x: custom_func(x, df, x.name))
This results in TypeError: argument of type 'NoneType' is not iterable because the windowed subsets of each column do not retain the names of the original df columns. That is, x.name being passed in as an argument is in fact passing None rather than a string of the original column name.
Is there some way of making this approach work (say, retaining the column name being acted on with apply and passing that into the function)? Or are there any suggestions for altering it? I consulted the following reference for having the custom function utilize multiple columns within the same window calculation, among others:
https://stackoverflow.com/a/57601839/6464695
I wouldn't be surprised if there's a "better" solution, but I think could at least be a "good start" (I don't do a whole lot with .rolling(...)).
With this solution, I make two critical assumptions:
All denom_<X> have a corresponding <X> column.
Everything you do with the (<X>, denom_<X>) pairs is the same. (This should be straightforward to customize as needed.)
With that said, I do the .rolling within the function, rather than outside, in part because it seems like .apply(...) on a RollingGroupBy can only work column-wise, which isn't too helpful here (imo).
def cust_fn(df: pd.DataFrame, rolling_args: Tuple) -> pd.Series:
cols = df.columns
denom_cols = ["id"] # the whole dataframe is passed, so place identifiers / uncomputable variables here.
for denom_col in cols[cols.str.startswith("denom_")]:
denom_cols += [denom_col, denom_col.replace("denom_", "")]
col = denom_cols[-1] # sugar
df[f"calc_{col}"] = df[col].rolling(*rolling_args).sum() / df[denom_col].max()
for col in cols[~cols.isin(denom_cols)]:
print(col, df[col])
df[f"calc_{col}"] = df[col].rolling(*rolling_args).mean()
return df
Then the way you'd go about running this is the following (and you get the corresponding output):
>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1))
id a b c denom_a denom_b calc_a calc_b calc_c
0 a0 4 3 7 7 10 0.444444 0.250000 7.0
1 a0 5 4 4 8 11 1.000000 0.583333 5.5
2 a0 6 5 3 9 12 1.222222 0.750000 3.5
3 a1 1 3 8 7 10 0.111111 0.250000 8.0
4 a1 2 2 9 8 11 0.333333 0.416667 8.5
5 a1 3 4 7 9 12 0.555556 0.500000 8.0
6 a2 7 1 4 7 10 0.875000 0.090909 4.0
7 a2 9 3 6 8 11 2.000000 0.363636 5.0
If you need dynamically state which non-numeric/computable columns exist, then it might make sense to define cust_fn as follows:
def cust_fn(df: pd.DataFrame, rolling_args: Tuple, index_cols: List = []) -> pd.Series:
cols = df.columns
denon_cols = index_cols
# ... the rest is unchanged
Then you would adapt your calling of cust_fn as follows:
>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1), index_cols=["id"])
Of course, comment on this if you run into issues adapting it to your uses. 🙂

How to plot transition among multiple groups in python

I want to plot a transition between multiple groups in python. Say I have three groups A, B and C at a given datetime x. Now at datetime y > x I want to visualize what % of elements of A transitioned into group B, what % to C. Similarly for B and C. I can for now assume that there are a fixed number of elements. Also can I extend this to multiple dates like x < y < z and visualize the changes ?
A sample dataframe of my usecase can be generated using this code
elements = [f'e{i}' for i in range(10)]
x = pd.DataFrame({'element': elements, 'group': np.random.choice(['A', 'B', 'C'], size=10), 'date': pd.to_datetime('2021-04-01')})
y = pd.DataFrame({'element': elements, 'group': np.random.choice(['A', 'B', 'C'], size=10), 'date': pd.to_datetime('2021-04-10')})
df = x.append(y)
Now from the above dataframe I want to visualize for the 2 dates how did the transition from groups A, B and C happened.
My main issue is I don't know what plot to use in python to visualize this, any leads will be really helpful.
Here's an approach to get what you need, i.e. shift from one date to another:
# pivot the data so dates become columns
s = df.pivot(index='element', columns='date', values='group')
which gives s as:
date 2021-04-01 2021-04-10
element
e0 A A
e1 A C
e2 B B
e3 B B
e4 C C
e5 A C
e6 B B
e7 C A
e8 C A
e9 C A
Next,
# compare the two consecutive dates
pairwise = pd.get_dummies(s.iloc[:, 1]).T # pd.get_dummies(s.iloc[:,0])
which gives you pairwise as:
A B C
A 1 0 3
B 0 3 0
C 2 0 1
That means, e.g. first column says that there are 3 A's on the first date, one stays A and 2 change to C on the second date. Finally, you can easily compute the percentage with
pairwise / pairwise.sum()
Output, which you can use something like sns.heatmap to visualize:
A B C
A 0.333333 0.0 0.75
B 0.000000 1.0 0.00
C 0.666667 0.0 0.25
Note as to the extended question, you would have a series of these matrices for each pair (day1, day2), (day2, day3),.... It's up to you to decide how to visualize them.

Create a new dataframe from an old dataframe where the new dataframe contains row-wise avergae of columns at different locations in the old dataframe

I have a dataframe called "frame" with 16 columns and 201 rows. A screenshot is attached that provides an example dataframe
enter image description here
Please note the screenshot is just an example, the original dataframe is much larger.
I would like to find an efficient way (maybe using for loop or writing a function) to row-wise average different columns in the dataframe. For instance, to find an average of column "rep" and "rep1" and column "repcycle" and "repcycle1" (similarly for set and setcycle) and save in a new dataframe with only averaged columns.
I have tried writing a code using iloc
newdf= frame[['sample']].copy()
newdf['rep_avg']=frame.iloc[:, [1,5]].mean(axis=1) #average row-wise
newdf['repcycle_avg']=frame.iloc[:, [2,6]].mean(axis=1)
newdf['set_avg']=frame.iloc[:, [3,7]].mean(axis=1) #average row-wise
newdf['setcycle_avg']=frame.iloc[:, [4,8]].mean(axis=1)
newdf.columns = ['S', 'Re', 'Rec', 'Se', 'Sec']
The above code does the job, but it is tedious to note the locations for every column. I would rather like to automate this process since this is repeated for other data files too.
based on your desire "I would rather like to automate this process since this is repeated for other data files too"
what I can think of is this below:
in [1]: frame = pd.read_csv('your path')
result shown below, now as you can see what you want to average are columns 1,5 and 2,6 and so on.
out [1]:
sample rep repcycle set setcycle rep1 repcycle1 set1 setcycle1
0 66 40 4 5 3 40 4 5 3
1 78 20 5 6 3 20 5 6 3
2 90 50 6 9 4 50 6 9 4
3 45 70 7 3 2 70 7 7 2
so, we need to create 2 lists
in [2]: import numpy as np
list_1 = np.arange(1,5,1).tolist()
in [3]: list_1
out[3]: [1,2,3,4]
this for the first half you want to average[rep,repcycle,set,setcycle]
in [4]: list_2 = [x+4 for x in list_1]
in [5]: list_2
out[5]: [5,6,7,8]
this for the second half you want to average[rep1,repcycle1,set1,setcycle1]
in [6]: result = pd.concat([frame.iloc[:, [x,y].mean(axis=1) for x, y in zip(list_1,list_2)],axis=1)
in [7]: result.columns = ['Re', 'Rec', 'Se', 'Sec']
and now you get what you want, and it's automate, all you need to do is change the two lists from above.
in [8]: result
out[8]:
Re Rec Se Sec
0 40.0 4.0 5.0 3.0
1 20.0 5.0 6.0 3.0
2 50.0 6.0 9.0 4.0
3 70.0 7.0 5.0 2.0

Pandas group by cumsum of lists - Preparation for lstm

Using the same example from here but just changing the 'A' column to be something that can easily be grouped by:
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df["A"] = pd.Series([1]*3+ [2]*8)
df.head()
whose output now is:
Date A B C D E F G
0 2008-03-18 1 164.93 114.73 26.27 19.21 28.87 63.44
1 2008-03-19 1 164.89 114.75 26.22 19.07 27.76 59.98
2 2008-03-20 1 164.63 115.04 25.78 19.01 27.04 59.61
3 2008-03-25 2 163.92 114.85 27.41 19.61 27.84 59.41
4 2008-03-26 2 163.45 114.84 26.86 19.53 28.02 60.09
5 2008-03-27 2 163.46 115.40 27.09 19.72 28.25 59.62
6 2008-03-28 2 163.22 115.56 27.13 19.63 28.24 58.65
Doing the cumulative sums (code from the linked question) works well when we're assuming it's a single list:
# Put your inputs into a single list
input_cols = ["B", "C"]
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors1'] = df["single_input_vector"].cumsum()
but how do I cumsum the lists in this case grouped by 'A'? I expected this to work but it doesnt:
df['cumu'] = df.groupby("A")["single_input_vector"].apply(lambda x: list(x)).cumsum()
Instead of [[164.93, 114.73, 26.27], [164.89, 114.75, 26.... I get some rows filled in, others are NaN's. This is what I want (cols [B,C] accumulated into groups of col A):
A cumu
0 1 [[164.93,114.73], [164.89,114.75], [164.63,115.04]]
0 2 [[163.92,114.85], [163.45,114.84], [163.46,115.40], [163.22, 115.56]]
Also, how do I do this in an efficient manner? My dataset is quite big (about 2 million rows).
It doesn't look like your doing arithmetic sum, more like a concat along axis=1
First groupby and concat
temp_series = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])])
0 [[164.93, 114.73], [164.89, 114.75], [164.63, ...
1 [[163.92, 114.85], [163.45, 114.84], [163.46, ...
then convert back to a dataframe
df = temp_series.reset_index().rename(columns={0: 'cumsum'})
In one line
df = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])]).reset_index().rename(columns={0: 'cumsum'})

Categories