how to concat specific rows through a pandas dataframe - python

so, i have this situation:
there is a dataframe like this:
Number
Description
10001
name 2
1002
name2(pt1)
NaN
name2(pt2)
1003
name3
1004
name4(pt1)
NaN
name4(pt2)
1005
name5
So, i need to concat the name (part1 and part2) together into junt one field and then drop the NaN rows but i have no clue how to do this because the rows do not follown a specific interval pattern

Try with groupby aggregate on a series based on the notna Number values.
Groups are created from:
df['Number'].notna().cumsum()
0 1
1 2
2 2
3 3
4 4
5 4
6 5
Name: Number, dtype: int32
Then aggregate taking the 'first' Number (since first value is guaranteed to be notna) and doing some operation to combine Descriptions like join:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': ''.join})
)
new_df:
Number Description
0 10001.0 name 2
1 1002.0 name2(pt1)name2(pt2)
2 1003.0 name3
3 1004.0 name4(pt1)name4(pt2)
4 1005.0 name5
Or comma separated join:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': ','.join})
)
new_df:
Number Description
0 10001.0 name 2
1 1002.0 name2(pt1),name2(pt2)
2 1003.0 name3
3 1004.0 name4(pt1),name4(pt2)
4 1005.0 name5
Or as list:
new_df = (
df.groupby(df['Number'].notna().cumsum(), as_index=False)
.aggregate({'Number': 'first', 'Description': list})
)
new_df:
Number Description
0 10001.0 [name 2]
1 1002.0 [name2(pt1), name2(pt2)]
2 1003.0 [name3]
3 1004.0 [name4(pt1), name4(pt2)]
4 1005.0 [name5]

Related

Summarize rows in pandas dataframe by column value and append specific column values as columns [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed last month.
I have a dataframe as follows with multiple rows per id (maximum 3).
dat = pd.DataFrame({'id':[1,1,1,2,2,3,4,4], 'code': ["A","B","D","B","D","A","A","D"], 'amount':[11,2,5,22,5,32,11,5]})
id code amount
0 1 A 11
1 1 B 2
2 1 D 5
3 2 B 22
4 2 D 5
5 3 A 32
6 4 A 11
7 4 D 5
I want to consolidate the df and have only one row per id so that it looks as follows:
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11 B 2 D 5
1 2 B 22 D 5 NaN NaN
2 3 A 32 NaN NaN NaN NaN
3 4 A 11 D 5 NaN NaN
How can I acheive this in pandas?
Use GroupBy.cumcount for counter with reshape by DataFrame.unstack and DataFrame.sort_index, last flatten MultiIndex and convert id to column by DataFrame.reset_index:
df = (dat.set_index(['id',dat.groupby('id').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1, sort_remaining=False))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11.0 B 2.0 D 5.0
1 2 B 22.0 D 5.0 NaN NaN
2 3 A 32.0 NaN NaN NaN NaN
3 4 A 11.0 D 5.0 NaN NaN

How to compare 2 non-identical dataframes in python

I have two dataframes with the same column order but different column names and different rows. df2 rows vary from df1 rows.
df1= col_id num name
0 1 3 linda
1 2 4 James
df2= id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:
result col_id num name
0 1 was 3| now 2 was linda| now granpa
1 2 was 4| now 6 was James| now linda
2 was | now 3 was | now 7 was | now sam
Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.
However,
DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
So we just need to align the row/column indexes, either via merge or reindex.
Align via merge
Outer-merge the two dfs:
merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 sam
Divide the merged frame into left/right frames and align their columns with set_axis:
cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN
right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 sam
compare the aligned left/right frames (use keep_equal=True to show equal cells):
left.compare(right, keep_shape=True, keep_equal=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
left.compare(right, keep_shape=True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Align via reindex
If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.
In your example, df1 is a subset of df2, so reindex df1:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Nullable integers
Normally int cannot mix with nan, so pandas converts to float. To keep the int values as int (like the examples above):
Ideally we'd convert the int columns to nullable integers with astype('Int64') (capital I).
However, there is currently a comparison bug with Int64, so just use astype(object) for now.
If I understand correctly, you want something like this:
new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')
Output:
>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam

How to drop a row in one dataframe if missing value in another dataframe?

I have two DataFrames (example below). I would like to delete any row in df1 with a value equal to df2[patnum] if df2[city] is 'nan'.
For example: I would want to drop rows 2 and 3 in df1 since they contain '4' and patnum '4' in df2 has a missing value in df2['city'].
How would I do this?
df1
Citer Citee
0 1 2
1 2 4
2 3 5
3 4 7
df2
Patnum City
0 1 new york
1 2 amsterdam
2 3 copenhagen
3 4 nan
4 5 sydney
expected result:
df1
Citer Citee
0 1 2
1 3 5
IIUC stack isin and dropna
the idea is to return a True/False boolean based on matches then drop those rows after we unstack the dataframe.
val = df2[df2['City'].isna()]['Patnum'].values
df3 = df1.stack()[~df1.stack().isin(val)].unstack().dropna(how="any")
Citer Citee
0 1.0 2.0
2 3.0 5.0
Details
df1.stack()[~df1.stack().isin(val)]
0 Citer 1
Citee 2
1 Citer 2
2 Citer 3
Citee 5
3 Citee 7
dtype: int64
print(df1.stack()[~df1.stack().isin(val)].unstack())
Citer Citee
0 1.0 2.0
1 2.0 NaN
2 3.0 5.0
3 NaN 7.0

Dropping Columns which have same values but different names

Currently merging two data frames where some of the columns of the two dataframes are the same but not all.
df = pd.merge(df_1, df_2, how='inner', on='name' )
This returns:
index name val1_x val2_x val1_y val2_y
0 name1 1 2 1 3
2 name2 12 14 12 34
3 name3 14 3 14 96
But I would like:
index name val1_x val2_x val2_y
0 name1 1 2 3
2 name2 12 14 34
3 name3 14 3 96
How could you get this result? Either with the merge command or after?
--------- Extension - outer merge -------------
With an inner merge
df = pd.merge(df_1, df_2, how='name', on='address').T.drop_duplicates().T
works as suggested in the solutions
However with an outer merge
df = pd.merge(df_1, df_2, how='outer', on='name' )
It does not work since there are nan values. It returns
index name val1_x val2_x val1_y val2_y
0 name1 1 2 nan 3
2 name2 12 14 12 34
3 name3 14 3 14 96
But I would like:
index name val1_x val2_x val2_y
0 name1 1 2 3
2 name2 12 14 34
3 name3 14 3 96
How can one achieve this?
Use drop_duplicates
df = pd.merge(df_1, df_2, how='inner', on='name' ).T.drop_duplicates().T
index name val1_x val2_x val2_y
0 0 name1 1 2 3
1 2 name2 12 14 34
2 3 name3 14 3 96
This is a complicated aggregation, so you can just write your own function to resolve the groups. This method will only work to resolve numeric (datetime and Bool also work) data. With strings, you'll need to fall back to a much slower pd.nunique call over the rows.
For each group, we check if the columns are completely duplicated (using np.unique, after filling) and then either return the original group or the deduplicated grouping.
Starting Data
index name val1_x val2_x val1_y val2_y
0 0 name1 1 2 NaN 3
1 2 name2 12 14 12.0 34
2 3 name3 14 3 14.0 96
Code
l = []
for idx, gp in df.groupby(df.columns.str.split('_').str[0], axis=1):
if any(gp.dtypes == 'O') | (gp.shape[1] == 1): # Can't/don't resolve these types
l.append(gp)
else:
arr = np.unique(gp.ffill(axis=1).bfill(axis=1).to_numpy(), axis=1)
if arr.shape[1] == 1:
l.append(pd.DataFrame(index=gp.index, columns=[idx], data=arr))
else:
l.append(gp)
df = pd.concat(l, axis=1)
index name val1 val2_x val2_y
0 0 name1 1.0 2 3
1 2 name2 12.0 14 34
2 3 name3 14.0 3 96

Pandas dataframe merge not working as expected with multiple column equality checks

I am trying to merge based on two columns being equal to each other for two Dataframes.
Here is the code:
>>> df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'))
Name Age
0 1 2
1 3 4
2 4 5
>>> df
Name Age
0 1 2
1 3 4
0 4 5
>>> df1
Name Age
0 5 6
1 3 4
0 4 7
What I actually expected from the merge was
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Why does pandas think that there all three matching rows for this merge?
So you mean merge on Name right ?
df.merge(df1, how='left', on='Name', suffixes=('', '_#'))
Out[120]:
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Using indicator to see what is your output
df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'),indicator=True)
Out[121]:
Name Age _merge
0 1 2 left_only
1 3 4 both
2 4 5 left_only
Since you df and df1 have the same columns and all of the columns had been used as merge key , so there is not other columns indicate whether they share the same items in df or not (since you using the left , so that the default is show all left items in the result ).

Categories