How to sort multindex columns in Pandas - python

I am trying to manipulate a bunch of multiindex Pandas array. Each column is a time series with different categorical groupings. I would like to sort through the data and then parse through all the categories and then do some additional data manipulation. Here is a sample code of what I was attempting but didn't work
import pandas as pd
import numpy as np
df=pd.DataFrame({'t': range(1,11)})
df.set_index(['t'],inplace=True)
for num in range(2):
labely = (str(num),'A','y')
labelx = (str(num),'A','x')
labelbx = (str(num),'B','x')
df[labelx]= np.random.randn(10)
df[labelbx]= np.random.randn(10)
df[labely]= np.random.randn(10)+range(1,11)
df.columns = pd.MultiIndex.from_tuples(df.columns, names=['ID','Location','Direction'])
df[('0','A','tot')]=df[('0','A','y')]+df[('0','A','x')]
df.sort_index(level='ID',inplace=True)
df.head()
This doesn't sort. This is the result with the total not being grouped with the other 0 ID and the Locations not being grouped together:
ID 0 ... 1 0
Location A B A ... B A A
Direction x x y ... x y tot
t ...
1 0.430386 -0.121109 0.263314 ... 0.243839 0.313505 0.693700
2 -1.262746 -0.678889 1.289814 ... -0.893230 0.373103 0.027068
3 0.245483 -0.565859 3.766628 ... 0.012933 1.652484 4.012111
4 1.518357 0.447032 5.649877 ... -1.205161 5.513507 7.168233
5 -0.095216 -0.571333 6.794958 ... -0.777933 4.073334 6.699741
I have 2 questions associated with this.
How to I sort the columns so that the organized by each of the
levels
How do I efficiently parse through the dataframe to do
additional data manipulation .
This is some sudo code for the second questions
for id in ID:
for loc in Location:
df[(id,loc,'tot')=df[(id,loc,'x')]+df[(id,loc,'y')]

To sort by columns, as Ian answered axis=1:
df.sort_index(level='ID',axis=1,inplace=True)
To get a list of the tuples of the unique column names to parse, I needed columns.values and then I resorted after calculations.
for id,loc,dir in df.columns.values:
df[(id,loc,'tot')]=(df[(id,loc,'x')]**2+df[(id,loc,'y')]**2)**.5
df.sort_index(level='ID',axis=1,inplace=True)
Since this is basic column calculations I think this method would be efficient.

Related

How to do a double bucle to complete missing information using different dataframe

I am trying to complete missing information in some rows from a column in a dataframe, using another dataframe. I have in the first df(dfPivote), two columns of interest 'Entrega' and 'Transportador' which is the one with missing information. I have a second df (dfTransportadoEntregadoFaltante) with two columns of interest 'EntregaBusqueda' which is the key to my other df, and 'Transportador' with the information missing from the other df. I have the following code, and it is not working. How could I solve this problem?
I would recommend using dataframe operations to fill in missing values. If I've followed your example code correctly, I think you're trying to do something like this:
import pandas as pd
import numpy as np
# Create fake data
# "dfPivote" dataframe with an empty string in the "Transportador" column:
dfPivote = pd.DataFrame({'Entrega':[1,2,3],'Transportador':['a','','c']})
# "dfTransportadoEntregadoFaltante" lookup dataframe
dfTransportadoEntregadoFaltante = pd.DataFrame({'EntregaBusqueda':[1,2,3], 'Transportador':['a','b','c']})
# 1. Replace empty strings in dfPivote['Transportador'] with np.nan values:
dfPivote['Transportador'] = dfPivote['Transportador'].apply(lambda x: np.nan if len(x)==0 else x)
# 2. Merge the two dataframes together on the "Entrega" and "EntregaBusqueda" columns respectively:
df = dfPivote.merge(dfTransportadoEntregadoFaltante, left_on='Entrega', right_on='EntregaBusqueda', how='left')
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 NaN 2 b
# 3 c 3 c
# 3. Fill NaNs in "Transportador_x" column with corresponding values in "Transportador_y" column:
df['Transportador_x'] = df['Transportador_x'].fillna(df['Transportador_y'])
# Entrega Transportador_x EntregaBusqueda Transportador_y
# 1 a 1 a
# 2 b 2 b
# 3 c 3 c

Element-by-element division in pandas dataframe with "/"?

Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.

Converting several 2D DataFrames into one 3D DataFrame

I am trying to convert a list of 2d-dataframes into one large dataframe. Lets assume I have the following example, where I create a set of dataframes, each one having the same columns / index:
import pandas as pd
import numpy as np
frames = []
names = []
frame_columns = ['DataPoint1', 'DataPoint2']
for i in range(5):
names.append("DataSet{0}".format(i))
frames.append(pd.DataFrame(np.random.randn(3, 2), columns=frame_columns))
I would like to convert this set of dataframes into one dataframe df which I can access using df['DataSet0']['DataPoint1'].
This dataset would have to have a multi-index consisting of the product of ['DataPoint1', 'DataPoint2'] and the index of the individual dataframes (which is of course the same for all individual frames).
Conversely, the columns would be given as the product of ['Dataset0', ...] and ['DataPoint1', 'DataPoint2'].
In either case, I can create a corresponding MultiIndex and derive an (empty) dataframe based on that:
mux = pd.MultiIndex.from_product([names, frames[0].columns])
frame = pd.DataFrame(index=mux).T
However, I would like to have the contents of the dataframes present rather than having to then add them.
Note that a similar question has been asked here. However, the answers seem to revolve around the Panel class, which is, as of now, deprecated.
Similarly, this thread suggests a join, which is not really what I need.
You can use concat with keys:
total_frame = pd.concat(frames, keys=names)
Output:
DataPoint1 DataPoint2
DataSet0 0 -0.656758 1.776027
1 -0.940759 1.355495
2 0.173670 0.274525
DataSet1 0 -0.744456 -1.057482
1 0.186901 0.806281
2 0.148567 -1.065477
DataSet2 0 -0.980312 -0.487479
1 2.117227 -0.511628
2 0.093718 -0.514379
DataSet3 0 0.046963 -0.563041
1 -0.663800 -1.130751
2 -1.446891 0.879479
DataSet4 0 1.586213 1.552048
1 0.196841 1.933362
2 -0.545256 0.387289
Then you can extract Dataset0 by:
total_frame.loc['DataSet0']
If you really want to use MultiIndex columns instead, you can add axis=1 to concat:
total_frame = pd.concat(frames, axis=1, keys=names)

How to fill pandas dataframes in a loop?

I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.
The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
Using contains
df[df.columns.str.contains('|'.join(wells))]

Taking a proportion of a dataframe based on column values

I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.
You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])
Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

Categories