How do i drop all columns that include '_id' - Python

How do i drop all columns that include '_id' - Python - python

I have a dataframe with 247 columns. Many of the column names contain "_id" in the column name. How do I drop all columns that contain "_id"??

This is pretty straight forward as well. Select the columns that contain "_id" and then invert it, use .loc to restrict the columns, and you're done.
df = df.loc[:, ~df.columns.str.contains("_id")]

Try this:
df = df[df.columns.drop(list(df.filter(like='_id')), axis = 1, inplace = True)]
What this code does is:
To filter all those columns which will have _id anywhere in its name and then dropping all those columns.
let me know if you didn't understand or need any help in this regard.

Related

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))

You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)

Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

Edit DataFrame to merge sub rows only

I have the following dataframe :
And I was wondering how to get :
As you can see blue rows are subrows and the idea is to group them together depending on the name :
I tried :
DFTest= pd.read_excel("XXXXXXXXXXX/Test.xlsx")
DFTest.groupby(['Name'], as_index=False).sum().reset_index(drop=True)
But This does delete the blank rows (0,1,2,5,6,7).
How would I group subrows together and keep Blank rows as they are ?

This does the job:
grouped_df = df.groupby("Name", as_index = False)
df_sum = grouped_df.agg(np.sum)
pd.concat([df[df["Numb2"].isna()], df_sum])
Firstly I get the sum of all the values of the Numb2 column and then concatenate this new dataframe with the rows that have an NaN value in the Numb2 column.
This dataframe won't be the same as the one in the image you shared. But I don't think that'll be any problem.
But if is a problem then use the code below to get the dataframe sorted,
new_df.sort_values(by = "Name")
I hope this helped you!

While using reindex method of pandas on a data frame why the original values are lost?

This is the original Dataframetols :
What I wanted : I wanted to convert this above data-frame into this multi-indexed column data-frame :
I managed to do it by this piece of code :
# tols : original dataframe
cols = pd.MultiIndex.from_product([['A','B'],['Y','X']
['P','Q']])
tols.set_axis(cols, axis = 1, inplace = False)
What I tried : I tried to do this with the reindex method like this :
cols = pd.MultiIndex.from_product([['A','B'],['Y','X'],
['P','Q']])
tols.reindex(cols, axis = 'columns')
it resulted in an output like this :
My problem :
As you could see in the output above all my original numerical values go missing on employing the reindex method. In the documentation page it was clearly mentioned :
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one. So i don't understand:
Where did i particularly err in employing the reindex method to lose my original values
How should i have employed the reindex method correctly to get my desired output

You need to assign new columns names, only necessary same length of columns in original DataFrame with length of MultiIndex:
tols.columns = pd.MultiIndex.from_product([['A','B'],['Y','X'], ['P','Q']])
Problem with DataFrame.reindex here is pandas is looking for values of cols in original columns names and because they're not found so they're set to missing values.

It is the intended behaviour, from the documentation:
Conform DataFrame to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index

Use multiple rows as column header for pandas

I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.

df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]

How to feed new columns every time in a loop to a spark dataframe?

I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.

This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows

You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do i drop all columns that include '_id' - Python - python

I have a dataframe with 247 columns. Many of the column names contain "_id" in the column name. How do I drop all columns that contain "_id"??

This is pretty straight forward as well. Select the columns that contain "_id" and then invert it, use .loc to restrict the columns, and you're done. df = df.loc[:, ~df.columns.str.contains("_id")]

Try this: df = df[df.columns.drop(list(df.filter(like='_id')), axis = 1, inplace = True)] What this code does is: To filter all those columns which will have _id anywhere in its name and then dropping all those columns. let me know if you didn't understand or need any help in this regard.

Related

MultiIndex (multilevel) column names from Dataframe rows

Edit DataFrame to merge sub rows only

While using reindex method of pandas on a data frame why the original values are lost?

Use multiple rows as column header for pandas

How to feed new columns every time in a loop to a spark dataframe?

Categories

Resources