Pandas dataframe select row by index and column by name - python

Is there any way to select the row by index (i.e. integer) and column by column name in a pandas data frame?
I tried using loc but it returns an error, and I understand iloc only works with indexes.
Here is the first rows of the data frame df. I am willing to select the first row, column named 'Volume' and tried using df.loc[0,'Volume']

Use get_loc method of Index to get integer location of a column name.
Suppose this dataframe:
>>> df
A B C
10 1 2 3
11 4 5 6
12 7 8 9
You can use .iloc like this:
>>> df.iloc[1, df.columns.get_loc('B')]
5

Related

How to add new column to existing dataframe (no headers) using iloc?

I want to create third column to existing dataframe, having the same values of 2nd column using iloc pandas method.
What are the options do I have ?
df = pd.DataFrame([*zip([1,2,3],[4,5,6])])
here is one way to do it
df.insert(len(df.columns), # position of new column, it points to column after last column
len(df.columns), # naming could be the location of column
value=df.iloc[:,1] # grab the values from last column
)
df
0 1 2
0 1 4 4
1 2 5 5
2 3 6 6

How to create summary of multiple columns from multiple pandas dataframes?

i am trying to check any loss of data in categorical columns (such as data for an entire category) after data cleansing. i have 2 series that contains unique values of each categorical column in the dataframes.
Before Data Cleansing
dataframe1.nunique()
Column 1
10
Column 2
20
After Data Cleansing
dataframe2.nunique()
Column 1
10
Column 2
15
Any idea how to get a table in the following format for better presentation ? Both dataframe has same columns, but not same row count.
Column 1
10
10
Column 2
20
15
You can use concat() method:
df=pd.concat([df1,df2],axis=1)
df.columns=['Unique Value Count_before','Unique Value Count_after']
OR
via to_frame() and merge() method
df=df1.to_frame().merge(df2.to_frame(),on='Column Name',suffixes=('_before','_after'))
Output:
Column Name Unique Value Count_Before Unique Value Count_After
Column 1 10 10
Column 2 20 15

Create a multiindex DataFrame from existing delimited column names

I have a pandas DataFrame that looks like the following
A_value A_avg B_value B_avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
and my goal is to create a multiindex Dataframe that looks like that:
A B
value avg value avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
So the part of the column name before the '-' should be the first level of the column index and the part afterwards the second level. The first part is unstructured, the second is always the same (4 endings).
I tried to solve it with pd.wide_to_long() but I think that is the wrong path, as I don't want to change the df itself. The real df is much larger, so creating it manually is not an option. I'm stuck here and did not find a solution.
You can split the columns by the delimier and expand to create Multiindex:
df.columns=df.columns.str.split("_",expand=True)

Pandas: handle one-row dataframes with iloc

I am using a dataframe (df1) inside a loop to store information that I read from another dataframe (df2). df1 can have different number of rows in every iteration. I store the data row by row using df1.loc[row_number]. This could be an example:
a b c
0 9 2 3
1 8 5 6
2 3 8 9
Then I need to read the value of the first column and the first row, which I perform as
df1['a'].iloc[0]
9
The problem arises when df1 is a one row dataframe:
a 9
b 2
c 3
Name: 0, dtype: int64
It seems that with only one row, pandas stores the dataframe as a pandas series object. Trying to access the value in the same way ( df1['a'].iloc[0] ) I get the error:
AttributeError: 'numpy.int64' object has no attribute 'iloc'
Is there a way to solve this in a general case, with no need to handle the 1-row dataframe separately?
df1['a'] might be the column 'a' of the dataframe/series, which in the error case doesn't exist (no column named 'a').... Try to use df.iloc[0] directly

Multi Column DDPLY/R function in Pandas/Python

I have the following statement in R
library(plyr)
filteredData <- ddply(data, .(ID1, ID2), businessrule)
I am trying to use Python and Pandas to duplicate the action.
I have tried...
data['judge'] = data.groupby(['ID1','ID2']).apply(lambda x: businessrule(x))
But this provides error...
incompatible index of inserted column with frame index
The error message can be reproduced with
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['ID1', 'ID2', 'val'])
df['new'] = df.groupby(['ID1', 'ID2']).apply(lambda x: x.values.sum())
# TypeError: incompatible index of inserted column with frame index
It is likely that your code raises an error for the same reason this toy example does.
The right-hand side is a Series with a 2-level MultiIndex:
ID1 ID2
0 1 3
3 4 12
6 7 21
9 10 30
dtype: int64
df['new'] = ... tells Pandas to assign this Series to a column in df.
But df has a single-level index:
ID1 ID2 val
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
Because the single-level index is incompatible with the 2-level MultiIndex, the
assignment fails. It is in general never correct to assign the result of
groupby/apply to a columns of df unless the columns or levels you group by
also happen to be valid index keys in the original DataFrame, df.
Instead, assign the Series to a new variable, just like what the R code does:
filteredData = data.groupby(['ID1','ID2']).apply(businessrule)
Note that lambda x: businessrule(x) can be replaced with businessrule.

Categories