how to reindex the 'multi - groupbyed' dataframe? - python

I have a dataframe contains 4 columns, the first 3 columns are numerical variables which indicate the feature of the variable at the last column, and the last column are strings.
I want to merge the last string column by the previous 3 columns through the groupby function. Then it works(I mean the string which shares the same feature logged by the first three columns had been merged successfully)
Previously the length of the dataframe was 1200, and the length of the merged dataframe is 1100. I found the later df is multindexed. Which only contain 2 columns.(hierarchical index ) Thus I tried the reindex method by a generated ascending numerical list. Sadly I failed.
df1.columns
*[Out]Index(['time', 'column','author', 'text'], dtype='object')
series = df1.groupby(['time', 'column','author'])
['body_text'].sum()#merge the last column by the first 3 columns
dfx = series.to_frame()# get the new df
dfx.columns
*[Out]Index(['author', 'text'], dtype='object')
len(dfx)
*[Out]1100
indexs = list(range(1100))
dfx.reindex(index = indexs)
*[Out]Exception: cannot handle a non-unique multi-index!

Reindex here is not necessary, better is use DataFrame.reset_index or add parameter as_index=False to DataFrame.groupby
dfx = df1.groupby(['time', 'column','author'])['body_text'].sum().reset_index()
Or:
dfx = df1.groupby(['time', 'column','author'], as_index=False)['body_text'].sum()

Related

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

compare columns different dataframes

I got two DataFrames I would like to merge, but I would prefer to check if the one column that exists in both dfs has the exact same values in each row.
for genereal merging I tried several solutions in the comment you see the shape
df = pd.concat([df_b, df_c], axis=1, join='inner') # (245131, 40)
df = pd.concat([df_b, df_c], axis=1).reindex(df_b.index) # (245131, 40)
df = pd.merge(df_b, df_c, on=['client_id'], how='inner') # (420707, 39)
df = pd.concat([df_b, df_c], axis=1) # (245131, 40)
The original df_c is (245131, 14) and df_b is (245131, 26)
By that I assume that the column client_id has the exact values, since in three approaches I have a shape of 245131 rows.
I would like to compare the client_ids in a new_df, tried it with .loc, but it did not work out. Tried also df.rename(columns={ df.columns[20]: "client_id_1" }, inplace=True) but it renamed both columns
I tried
df_test = df_c.client_id
df_test.append(df_b.client_id, ignore_index=True)
but I only receive one index and one client_id column but the shape says 245131 rows.
If I can be sure that the values are exact the same, should I drop the client_id in one df and do the concat/merge after that? So that I got the correct shape of (245131, 39)
is there a mangle_dupe_cols command for merge or compare like for read_csv?
Chris if you wish to check if 2 columns of 2 separate dataframes are exactly the same, you can try the following:
tuple(df1['col'].values) == tuple(df2['col'].values)
This should return a bool value
If you want to merge 2 dataframes ensure all the rows for your column of interest has unique values as duplicates will cause addition of rows
Else use concat if you want to join the dataframes along the axis

Change specific column order without using columns names in a Python dataframe

My DF has the following columns:
df.columns = ['not-changing1', 'not-changing2', 'changing1', 'changing2', 'changing3', 'changing4']
I want to swap the last 4 columns WITHOUT USING COLUMNS NAMES, but using their index instead.
So, the final column order would be:
result.columns = ['not-changing1', 'not-changing2', 'changing1', 'changing3', 'changing2', 'changing4']
How do I do that?

Pandas merge/faltten dataframe based on specific column

i have following data sample i am trying to flatten it out using pandas, i wanna flatten this data over Candidate_Name.
This is my implementation,
df= df.merge(df,on=('Candidate_Name'))
but i am not getting desired result. My desired output is as follows. So basically have all the rows that match Candidate_Name in a single row, where duplicate column names may suffix with _x
I think you need GroupBy.cumcount with DataFrame.unstack and then flatten MultiIndex with same values for first groups and added numbers for another levels for avoid duplicated columns names:
df = df.set_index(['Candidate_Name', df.groupby('Candidate_Name').cumcount()]).unstack()
df.columns = [a if b == 0 else f'{a}_{b}' for a, b in df.columns]

Pandas: Merge 2 dataframes based on a column values; for mulitple rows containing same column value, append those to different columns

I have two dataframes, dataframe1 and dataframe2. They both share the same data in a particular column for both, lets call this column 'share1' and 'share2' for dataframe1 and dataframe2 respectively.
The issue is, there are instances where in dataframe1 , there is only one row in 'share1' with a particular value (lets call it 'c34z'), but in dataframe2 there are multiple rows with the value 'c34z' in the 'share2' column.
What I would like to do is, in the new merged dataframe, when there are new values, I would just like to place them in a new column.
So the number of columns in the new dataframe will be the maximum number of duplicates for a particular value in 'share2' . And for rows where there was only a unique value in 'share2', the rest of the added columns will be blank, for that row.
You can using cumcount create the additional key then, pivot df2
newdf2=df2.assign(key=df2.groupby('share2').cumcount(),v=df2.share2).pivot_table(index='share2',columns='key',values='v',aggfunc='first')
After this ,I am using .loc or reindex concat df2 to df1
df2=df2.reindex(df1.share1)
df2.index=df1.index
yourdf=pd.concat([df1,df2],axis=1)
Loading Data:
import pandas as pd
df1 = {'key': ['c34z', 'c34z_2'], 'value': ['x', 'y']}
df2 = {'key': ['c34z', 'c34z_2', 'c34z_2'], 'value': ['c34z_value', 'c34z_2_value', 'c34z_2_value']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
Convert df2 by grouping and pivoting
df2_pivot = df2.groupby('key')['value'].apply(lambda df: df.reset_index(drop=True)).unstack().reset_index()
merge df1 and df2_pivot
df_merged = pd.merge(df1, df2_pivot, on='key')

Categories