accessing specific columns of a dataframe, index specified by idxmax() - python

I have a dataframe row that I would like to access specific columns of. The index for this row is specified from a idxmax command.
idx_interest=(df['colA']==matchingstring).idxmax()
Using this index, I want to access specific columns, namely colB and colD of the df # index=idx_interest
df.loc[idx_interest,['colB','colD']].reset_index().values.tolist()
however, doing so gave me the error: cannot perform reduce on flexible type. How do I go about accessing columns of a df # index given from an idxmax command>

You need to first apply your filter to your dataframe df correctly, in order to return idx_interest. If your original dataframe is a MultiIndex, then be mindful that this will return a tuple:
idx_interest = df[df['colA']==matchingstring].idxmax()
Now that you have idx_interest, you can limit your dataframe to the columns you want and then call .iloc() to specify a row index:
df[['colB','colD']].iloc[idx_interest].values.tolist()
The code you provide above will also work assuming that idx_interest returns an int:
df.loc[idx_interest,['colB','colD']].reset_index().values.tolist()

Related

Group By and ILOC Errors

I'm getting the following error when trying to groupby and sum by dataframe by specific columns.
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
I've checked other solutions and it's not a double column name header issue.
See df3 below which I want to group by on all columns except last two, I want to sum()
dfs head shows that if I just group by the columns names it works fine but not with iloc which I know to be the correct formula to pull back column I want to group by.
I need to use ILOC as final dataframe will have many more columns.
df.iloc[:,0:3] returns a dataframe. So you are trying to group dataframe with another dataframe.
But you just need a column list.
can you try this:
dfs = df3.groupby(list(df3.iloc[:,0:3].columns))['Churn_Alive_1','Churn_Alive_0'].sum()

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Averaging data of dataframe columns based on redundancy of another column

I want to average the data of one column in a pandas dataframe is they share the same 'id' which is stored in another column in the same dataframe. To make it simple i have:
and i want:
Were is clear that 'nx' and 'ny' columns' elements have been averaged if for them the value of 'nodes' was the same. The column 'maille' on the other hand has to remain untouched.
I'm trying with groupby but couldn't manage till now to keep the column 'maille' as it is.
Any idea?
Use GroupBy.transform with specify columns names in list for aggregates and assign back:
cols = ['nx','ny']
df[cols] = df.groupby('nodes')[cols].transform('mean')
print (df)
Another idea with DataFrame.update:
df.update(df.groupby('nodes')[cols].transform('mean'))
print (df)

Subsetting dataframe via a list

I dummified one column in my data frame using get_dummies but that produced an additional 400 columns. The issue is that I would like to subset the data frame which now has over 700 columns to run below operation
replace([np.inf, -np.inf], np.nan).dropna()
I tried isolating the new columns generated by get_dummies by storing them in a list which I initializaed as col1.
When I tried to subset the df using
df = df[['var1','var2','var3',[col1] ]]
I got an error msg saying " ValueError: setting an array element with a sequence''
Is there a way to go about subsetting the new dummies without having to type them all out when subsetting?
You can use an asterisk to unpack your list in the column selection
Otherwise you're passing in your list as a sublist into the column list. Your current method becomes:
df[['var1','var2','var3',['sub1','sub2','sub3']]]
But :
df = df[['var1','var2','var3',*col1]]
is unpacked to
df[['var1','var2','var3','sub1','sub2','sub3']]

Categories