Multilevel Slicing Pandas DataFrame - python

I have 3 DataFrames:
import pandas as pd
df1 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
df2 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
df3 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
I concatenate them creating a DataFrame with multi levels:
df_c = pd.concat([df1, df2, df3], axis = 1, keys = ["df1", "df2", "df3"])
Swap levels and sort:
df_c.columns = df_c.columns.swaplevel(0,1)
df_c = df_c.reindex_axis(sorted(df_c.columns), axis = 1)
ipdb> df_c
2010-01-01 2010-01-02
df1 df2 df3 df1 df2 df3
A -0.798407 0.124091 0.271089 0.754759 -0.575769 1.501942
B 0.602091 -0.415828 0.152780 0.530525 0.118447 0.057240
C -0.440619 -1.074837 -0.618084 0.627520 -1.298814 1.029443
D -0.242851 -0.738948 -1.312393 0.559021 0.196936 -1.074277
I would like to slice it to get values for individual rows, but so far I have only achieved such a degree of slicing:
cols = df_c.T.index.get_level_values(0)
ipdb> df_c.xs(cols[0], axis = 1, level = 0)
df1 df2 df3
A -0.798407 0.124091 0.271089
B 0.602091 -0.415828 0.152780
C -0.440619 -1.074837 -0.618084
D -0.242851 -0.738948 -1.312393
The only way I found to get the values for each raw is to define a new dataframe,
slcd_df = df_c.xs(cols[0], axis = 1, level = 0)
and then select rows using the usual proceadure:
ipdb> slcd_df.ix["A", :]
df1 -0.798407
df2 0.124091
df3 0.271089
But I was wondering whether there exists a better (meaning faster and more elegant) way to slice multilevel Dataframes.

You can use pd.IndexSlice:
idx = pd.IndexSlice
sliced = df_c.loc["A", idx["2010-01-01", :]]
print(sliced)
2010-01-01 df1 0.199332
df2 0.887018
df3 -0.346778
Name: A, dtype: float64
Or you may also use slice(None):
print(df_c.loc["A", ("2010-01-01", slice(None))])
2010-01-01 df1 0.199332
df2 0.887018
df3 -0.346778
Name: A, dtype: float64

Related

Insert one dataframe into another in Python

Hi I have the following two DataFrame's (index level == 2):
df1 = pd.DataFrame()
df1["Index1"] = ["A", "AA"]
df1["Index2"] = ["B", "BB"]
df1 = df1.set_index(["Index1", "Index2"])
df1["Value1"] = 1
df1["Value2"] = 2
df1
df2 = pd.DataFrame()
df2["Index1"] = ["X", "XX"]
df2["Index2"] = ["Y", "YY"]
df2["Value1"] = 3
df2["Value2"] = 4
df2 = df2.set_index(["Index1", "Index2"])
df2
I would like to create the following DataFrame with 3-level index where the first level indicates from which DataFrame the values are taken. Note all DataFrames have exactly the same columns:
How can I do this in the most automatic way? Ideally I would like to have the following solution:
# start with empty dataframe
res = pd.DataFrame(index = pd.MultiIndex(levels = [[], [], []],
codes = [[],[],[]],
names = ["Df number", "Index1", "Index2"]),
columns = ["Value1", "Value2"])
res = AddDataFrameAtIndex(index = "DF1", level = 0, dfToInsert = df1)
res = AddDataFrameAtIndex(index = "DF2", level = 0, dfToInsert = df2)
A possible solution, based on pandas.concat:
pd.concat([df1, df2], keys=['DF1', 'DF2'], names=['DF number'])
Output:
Value1 Value2
DF number Index1 Index2
DF1 A B 1 2
AA BB 1 2
DF2 X Y 3 4
XX YY 3 4

Separate dataframe into multiple ones and add them as columns

my dataset look similiar to this (but with a couple of more rows):
The aim is to get this:
What I tried to do is:
# Identify names that are in the dataset
names = df['name'].unique().tolist()
# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_'+str(names[0]), 'number':'number_'+str(names[0])}, axis=1)
# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
for i in range(1, len(names)):
df2 = pd.DataFrame()
df2 = df[(df == names[i]).any(axis=1)]
df2 = df2.drop(['name'], axis=1)
df2 = df2.rename({'color':'color_'+str(names[i]), 'number':'number_'+str(names[i])}, axis=1)
df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
In the end I get this result for df_merged:
As you can see the columns color_Donald and number_Donald are missing. Does anyone know why and how to improve the code? It seems as if the loop somehow skips or overwrites Donald.
Thanks in advance!
sample df
import pandas as pd
data = {'name': {'2020-01-01 00:00:00': 'Justin', '2020-01-02 00:00:00': 'Justin', '2020-01-03 00:00:00': 'Donald'}, 'color': {'2020-01-01 00:00:00': 'blue', '2020-01-02 00:00:00': 'red', '2020-01-03 00:00:00': 'green'}, 'number': {'2020-01-01 00:00:00': 1, '2020-01-02 00:00:00': 2, '2020-01-03 00:00:00': 9}}
df = pd.DataFrame(data)
print(f"{df}\n")
name color number
2020-01-01 00:00:00 Justin blue 1
2020-01-02 00:00:00 Justin red 2
2020-01-03 00:00:00 Donald green 9
final df
df = (
df
.reset_index(names="date")
.pivot(index="date", columns="name", values=["color", "number"])
.fillna("")
)
df.columns = ["_".join(x) for x in df.columns.values]
print(df)
color_Donald color_Justin number_Donald number_Justin
date
2020-01-01 00:00:00 blue 1
2020-01-02 00:00:00 red 2
2020-01-03 00:00:00 green 9
The problem is the line:
df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
where df_merged will be set in the loop always to the join of df1 with current df2.
The result after the loop is therefore a join of df1 with the last df2 and the Donald gets lost in this process.
To fix this problem first join empty df_merged with df1 and then in the loop join df_merged with df2.
Here the full code with the changes (not tested):
# Identify names that are in the dataset
names = df['name'].unique().tolist()
# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_'+str(names[0]), 'number':'number_'+str(names[0])}, axis=1)
# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
df_merged = df_merged.join(df1) # <- add options if necessary
for i in range(1, len(names)):
df2 = pd.DataFrame()
df2 = df[(df == names[i]).any(axis=1)]
df2 = df2.drop(['name'], axis=1)
df2 = df2.rename({'color':'color_'+str(names[i]), 'number':'number_'+str(names[i])}, axis=1)
# join the current df2 to df_merged:
df_merged = df_merged.join(df2, lsuffix="_left", rsuffix="_right", how='left')

How to filter column names from multiindex dataframe for a specific condition?

df1 = pd.DataFrame(
{
"empid" : [1,2,3,4,5,6],
"empname" : ['a', 'b','c','d','e','f'],
"empcity" : ['aa','bb','cc','dd','ee','ff']
})
df1
df2 = pd.DataFrame(
{
"empid" : [1,2,3,4,5,6],
"empname" : ['a', 'b','m','d','n','f'],
"empcity" : ['aa','bb','cc','ddd','ee','fff']
})
df2
df_all = pd.concat([df1.set_index('empid'),df2.set_index('empid')],axis='columns',keys=['first','second'])
df_all
df_final = df_all.swaplevel(axis = 'columns')[df1.columns[1:]]
df_final
orig = df1.columns[1:].tolist()
print (orig)
['empname', 'empcity']
df_final = (df_all.stack()
.assign(comparions=lambda x: x['first'].eq(x['second']))
.unstack()
.swaplevel(axis = 'columns')
.reindex(orig, axis=1, level=0))
print (df_final)
How to filter level[0] column name list where comparions = False from the dataframe df_final(consider there are more than 300 column like this at level 0)
First test if in level comparions are all Trues by DataFrame.xs with DataFrame.all:
s = df_final.xs('comparions', level=1, axis=1).all()
And then invert mask for test at least one False with filter indices:
L = s.index[~s].tolist()
print (L)
['empname', 'empcity']

Issues with append, merge and join for 3 different dataframe outputs from pandas with 1 index

I have 10000 data that I'm sorting into a dictionary and then exporting that to a csv using pandas. I'm sorting temperatures, pressures and flow associated with a key. But when doing this I find: https://imgur.com/a/aNX7RHf
but I want something like this:https://imgur.com/a/ZxJgPv4
I'm transposing my dataframe so the index can be rows but in this case I want only 3 rows 1,2, & 3, and all the data populate those rows.
flow_dictionary = {'200:P1F1':[5.5, 5.5, 5.5]}
pres_dictionary = {'200:PT02':[200,200,200],
'200:PT03':[200,200,200],
'200:PT06':[66,66,66],
'200:PT07':[66,66,66]}
temp_dictionary = {'200:TE02':[27,27,27],
'200:TE03':[79,79,79],
'200:TE06':[113,113,113],
'200:TE07':[32,32,32]}
df = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df = df.append(df2, ignore_index=False, sort=True)
df = df.append(df3, ignore_index=False, sort=True)
df.to_csv('processedSegmentedData.csv')
SOLUTION:
df1 = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df4 = pd.concat([df1,df2,df3], axis=1)
df4.to_csv('processedSegmentedData.csv')

Pandas concatenate alternating columns

I have two dataframes as follows:
df2 = pd.DataFrame(np.random.randn(5,2),columns=['A','C'])
df3 = pd.DataFrame(np.random.randn(5,2),columns=['B','D'])
I wish to get the columns in an alternating fashion such that I get the result below:
df4 = pd.DataFrame()
for i in range(len(df2.columns)):
df4[df2.columns[i]]=df2[df2.columns[i]]
df4[df3.columns[i]]=df3[df3.columns[i]]
df4
A B C D
0 1.056889 0.494769 0.588765 0.846133
1 1.536102 2.015574 -1.279769 -0.378024
2 -0.097357 -0.886320 0.713624 -1.055808
3 -0.269585 -0.512070 0.755534 0.855884
4 -2.691672 -0.597245 1.023647 0.278428
I think I'm being really inefficient with this solution. What is the more pythonic/ pandic way of doing this?
p.s. In my specific case the column names are not A,B,C,D and aren't alphabetically arranged. Just so know which two dataframes I want to combine.
If you need something more dynamic, first zip both columns names of both DataFrames and then flat it:
df5 = pd.concat([df2, df3], axis=1)
print (df5)
A C B D
0 0.874226 -0.764478 1.022128 -1.209092
1 1.411708 -0.395135 -0.223004 0.124689
2 1.515223 -2.184020 0.316079 -0.137779
3 -0.554961 -0.149091 0.179390 -1.109159
4 0.666985 1.879810 0.406585 0.208084
#http://stackoverflow.com/a/10636583/2901002
print (list(sum(zip(df2.columns, df3.columns), ())))
['A', 'B', 'C', 'D']
print (df5[list(sum(zip(df2.columns, df3.columns), ()))])
A B C D
0 0.874226 1.022128 -0.764478 -1.209092
1 1.411708 -0.223004 -0.395135 0.124689
2 1.515223 0.316079 -2.184020 -0.137779
3 -0.554961 0.179390 -0.149091 -1.109159
4 0.666985 0.406585 1.879810 0.208084
How about this?
df4 = pd.concat([df2, df3], axis=1)
Or do they have to be in a specific order? Anyway, you can always reorder them:
df4 = df4[['A','B','C','D']]
And without writing out the columns:
df4 = df4[[item for items in zip(df2.columns, df3.columns) for item in items]]
You could concat and then reindex_axis.
df = pd.concat([df2, df3], axis=1)
df.reindex_axis(df.columns[::2].tolist() + df.columns[1::2].tolist(), axis=1)
Append even indices to df2 columns and odd indices to df3 columns. Use these new levels to sort.
df2_ = df2.T.set_index(np.arange(len(df2.columns)) * 2, append=True).T
df3_ = df3.T.set_index(np.arange(len(df3.columns)) * 2 + 1, append=True).T
df = pd.concat([df2_, df3_], axis=1).sort_index(1, 1)
df.columns = df.columns.droplevel(1)
df

Categories