Convert groupyby attributes into specific columns - python

I have a dataframe which I have created out of a groupby operation (see code and output below):
Index=right_asset_df.groupby(['Year','Asset_Type']).agg({'Titre':'count','psqm':'median','Superficie':'median', 'Montant':'median'}).reset_index()
The output is:
Instead of having Asset_Type as rows, I would like to have it as columns which means that the new output Index should have one column for each Asset_Type (Appart and Villa).
Here is an example of the output:
As you can see, the groupby attributes Asset_Type as one specific column each.
How can I do that in python please? Thanks

Quick way is to use Pivot:
# Pivot table with (index, columns, values)
df = Index.pivot(['Year'], 'Asset_Type',['Titre','psqm','Superficie','Montant']).reset_index()
# Instack multi level in columns
df.columns = ['_'.join(col) for col in df.columns]

Related

swap pandas column to values in another column

I have a pandas dataframe- got it from API so don't have much control over the structure of it- similar like this:
I want to have datetime a column and value as another column. Any hints?
you can use T to transform the dataframe and then reseindex to create a new index column and keep the current column you may need to change its name form index
df = df.T.reset_index()
df.columns = df.iloc[0]
df = df[1:]

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

Change specific column order without using columns names in a Python dataframe

My DF has the following columns:
df.columns = ['not-changing1', 'not-changing2', 'changing1', 'changing2', 'changing3', 'changing4']
I want to swap the last 4 columns WITHOUT USING COLUMNS NAMES, but using their index instead.
So, the final column order would be:
result.columns = ['not-changing1', 'not-changing2', 'changing1', 'changing3', 'changing2', 'changing4']
How do I do that?

Pandas merge/faltten dataframe based on specific column

i have following data sample i am trying to flatten it out using pandas, i wanna flatten this data over Candidate_Name.
This is my implementation,
df= df.merge(df,on=('Candidate_Name'))
but i am not getting desired result. My desired output is as follows. So basically have all the rows that match Candidate_Name in a single row, where duplicate column names may suffix with _x
I think you need GroupBy.cumcount with DataFrame.unstack and then flatten MultiIndex with same values for first groups and added numbers for another levels for avoid duplicated columns names:
df = df.set_index(['Candidate_Name', df.groupby('Candidate_Name').cumcount()]).unstack()
df.columns = [a if b == 0 else f'{a}_{b}' for a, b in df.columns]

How to keep indexes when sum by columns based on grouped_by in pandas

I have a dataset where each ID has 6 corresponding rows. I want to this dataset grouped by the column ID and sum aggregate using sum. I wrote this piece of code:
col = [col for col in train.columns if col not in ['Month', 'ID']]
train.groupby('ID')[col].sum().reset_index()
Everything works fine except that I lose column ID. Now, Unique ID from my initial database disappeared and instead I have just enumerated ids from 0 up to the number of rows in the resulting dataset. I want to keep initial indexes, because I will need to merge this dataset with another further. How I can deal with this problem? Thanks for helping very much!
P.S: deleting reset_index() has no effect
P.S: You can see two problems on the images. On first image there is original database. You can see 6 entries for each ID. On the second image there is a databased which is a result from the grouped statement. First problem: IDs are not the same as in the original table. Second problem: the sum over 6 months for each ID is not correct.
Instead of using reset_index() you can simply use the keyword argument as_index: df.groupby('ID', as_index=False)
This will preserve column ID in the resulting DataFrameGroupBy, as described in groupby()'s doc.
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
When you group a data frame by some columns, those columns become your new index.
import pandas as pd
import numpy as np
# Create data
n = 6; m = 3
col_id = np.hstack([['id-'+str(i)] * n for i in range(m)]).reshape(-1, 1)
np.random.shuffle(col_id)
data = np.random.rand(m*n, m)
columns = ['v'+str(i+1) for i in range(m)]
df = pd.DataFrame(data, columns=columns)
df['ID'] = col_id
# Group by ID
print(df.groupby('ID').sum())
Will simply give you
v1 v2 v3
ID
id-0 2.099219 2.708839 2.766141
id-1 2.554117 2.183166 3.914883
id-2 2.485505 2.739834 2.250873
If you just want the column ID back, you just have to reset_index()
print(df.groupby('ID').sum().reset_index())
which will leave you with
ID v1 v2 v3
0 id-0 2.099219 2.708839 2.766141
1 id-1 2.554117 2.183166 3.914883
2 id-2 2.485505 2.739834 2.250873
Note:
groupby will sort the resulting DataFrame by its index. If you don't want that for any reason just set sorted=False (see also the documentation)
print(df.groupby('ID', sorted=false).sum())

Categories