MultiIndex (multilevel) column names from Dataframe rows - python

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))

You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)

Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

Related

How to convert cells into columns in pandas? (python) [duplicate]

The problem is, when I transpose the DataFrame, the header of the transposed DataFrame becomes the Index numerical values and not the values in the "id" column. See below original data for examples:
Original data that I wanted to transpose (but keep the 0,1,2,... Index intact and change "id" to "id2" in final transposed DataFrame).
DataFrame after I transpose, notice the headers are the Index values and NOT the "id" values (which is what I was expecting and needed)
Logic Flow
First this helped to get rid of the numerical index that got placed as the header: How to stop Pandas adding time to column title after transposing a datetime index?
Then this helped to get rid of the index numbers as the header, but now "id" and "index" got shuffled around: Reassigning index in pandas DataFrame & Reassigning index in pandas DataFrame
But now my id and index values got shuffled for some reason.
How can I fix this so the columns are [id2,600mpe, au565...]?
How can I do this more efficiently?
Here's my code:
DF = pd.read_table(data,sep="\t",index_col = [0]).transpose() #Add index_col = [0] to not have index values as own row during transposition
m, n = DF.shape
DF.reset_index(drop=False, inplace=True)
DF.head()
This didn't help much: Add indexed column to DataFrame with pandas
If I understand your example, what seems to happen to you is that you transpose takes your actual index (the 0...n sequence as column headers. First, if you then want to preserve the numerical index, you can store that as id2.
DF['id2'] = DF.index
Now if you want id to be the column headers then you must set that as an index, overriding the default one:
DF.set_index('id',inplace=True)
DF.T
I don't have your data reproduced, but this should give you the values of id across columns.

Read 1st column, 2nd column, and nth column to last column of panda dataframe

I have a pandas dataframe df.
There are 27 columns in df.
I want to read the 1st, 2nd and 10th to the last columns of df. I can do this df.iloc[0,1,9,10,11,.....,26] but this is too tedious to type if the dataframe has many columns. What is a more elegant way to read the columns?
I am using python v3.7
If you like to select columns by their numerical index, iloc is the right thing to use. You can use np.arange add a range of columns (such as between the 10th to the last one).
import pandas as pd
import numpy as np
cols = [0, 1]
cols.extend(np.arange(10, df.shape[1]))
df.iloc[:,cols]
Alternatively, you can use numpy's r_ slicing trick:
df.iloc[:,np.r_[0:2, 10:df.shape[1]]]
You can use "list" and "range":
df.iloc[:,[0,1]+list(range(9,27))]
Or numpy way:
df.iloc[:,np.append([0,1],np.arange(9,27))]
If you know the column names, you can try :
df = df[['col1', 'col2', 'coln']]
If you don't know the exact column names, you can try this :
list_of_columns_index = [1,2,3, n]
df = df[[df.columns[i] for i in list_of_columns_index]]
Suppose you know the name of the starting column or name of column 10th in your context. Assume name is starting_column_name.
Using name of column will make the code more readable and you save the trouble of counting columns to get to the right one.
num_columns = df.shape[1] # number of columns in dataframe
starting_column = df.columns.get_loc(starting_column_name)
features = df.iloc[:, np.r_[0:2, starting_column:num_columns]]

While using reindex method of pandas on a data frame why the original values are lost?

This is the original Dataframetols :
What I wanted : I wanted to convert this above data-frame into this multi-indexed column data-frame :
I managed to do it by this piece of code :
# tols : original dataframe
cols = pd.MultiIndex.from_product([['A','B'],['Y','X']
['P','Q']])
tols.set_axis(cols, axis = 1, inplace = False)
What I tried : I tried to do this with the reindex method like this :
cols = pd.MultiIndex.from_product([['A','B'],['Y','X'],
['P','Q']])
tols.reindex(cols, axis = 'columns')
it resulted in an output like this :
My problem :
As you could see in the output above all my original numerical values go missing on employing the reindex method. In the documentation page it was clearly mentioned :
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one. So i don't understand:
Where did i particularly err in employing the reindex method to lose my original values
How should i have employed the reindex method correctly to get my desired output
You need to assign new columns names, only necessary same length of columns in original DataFrame with length of MultiIndex:
tols.columns = pd.MultiIndex.from_product([['A','B'],['Y','X'], ['P','Q']])
Problem with DataFrame.reindex here is pandas is looking for values of cols in original columns names and because they're not found so they're set to missing values.
It is the intended behaviour, from the documentation:
Conform DataFrame to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index

accessing specific columns of a dataframe, index specified by idxmax()

I have a dataframe row that I would like to access specific columns of. The index for this row is specified from a idxmax command.
idx_interest=(df['colA']==matchingstring).idxmax()
Using this index, I want to access specific columns, namely colB and colD of the df # index=idx_interest
df.loc[idx_interest,['colB','colD']].reset_index().values.tolist()
however, doing so gave me the error: cannot perform reduce on flexible type. How do I go about accessing columns of a df # index given from an idxmax command>
You need to first apply your filter to your dataframe df correctly, in order to return idx_interest. If your original dataframe is a MultiIndex, then be mindful that this will return a tuple:
idx_interest = df[df['colA']==matchingstring].idxmax()
Now that you have idx_interest, you can limit your dataframe to the columns you want and then call .iloc() to specify a row index:
df[['colB','colD']].iloc[idx_interest].values.tolist()
The code you provide above will also work assuming that idx_interest returns an int:
df.loc[idx_interest,['colB','colD']].reset_index().values.tolist()

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

Categories