Merging two dataframes, with different lengths, and repeating values

Merging two dataframes, with different lengths, and repeating values - python

I have two dataframes with the same col 'A' that I want to merge on. However, in df2 col A is replicated a random number of times. This replication is important to my problem and I cannot drop it. I want the final dataframe to look like df3. Where Col A merges Col B values to each replication.
df1 df2
Col A Col B Col A Col B
1 v 1 a
2 w 2 b
3 x 2 c
4 y 3 d
3 e
4 f
df3
Col A Col B Col C
1 a v
2 b w
2 c w
3 d x
3 e x
4 f y

Use merge:
df2.merge(df1, on='Col A')
Out:
Col A Col B_x Col B_y
0 1 a v
1 2 b w
2 2 c w
3 3 d x
4 3 e x
5 4 f y
And if necessary, rename afterwards:
df = df2.merge(df1, on='Col A')
df.columns = ['Col A', 'Col B', 'Col C']
for more info, see the Pandas Documentation on merging and joining.

I believe you need map by Series created by set_index:
print (df1.set_index('Col A')['Col B'])
Col A
1 v
2 w
3 x
4 y
Name: Col B, dtype: object
df2['Col C'] = df2['Col A'].map(df1.set_index('Col A')['Col B'])
print (df2)
Col A Col B Col C
0 1 a v
1 2 b w
2 2 c w
3 3 d x
4 3 e x
5 4 f y

Related

Concat two dataframes with different indices

I am trying to concatenate two dataframes. I've tried using merge(), join(), concat() in pandas, but none gave me my desired output.
df1:
Index
value
0
a
1
b
2
c
3
d
4
e
df2:
Index
value
1
f
2
g
3
h
4
i
5
j
desired output:
Index
col1
col2
0
a
f
1
b
g
2
c
h
3
d
i
4
e
j
Thanks in advance!

You can just use pd.merge and specify the index left join as follows:
import pandas as pd
df1 = pd.DataFrame(data={'value': list('ABCDE')})
df2 = pd.DataFrame(data={'value': list('FGHIJ')}, index=range(1, 6))
pd.merge(df1.rename(columns={'value': 'col1'}), df2.reset_index(drop=True).rename(columns={'value': 'col2'}), how='left', left_index=True, right_index=True)
-----------------------------------
col1 col2
0 A F
1 B G
2 C H
3 D I
4 E J
-----------------------------------

Does resetting the index of df2 work for your use case?
pd.concat([df1,df2.reset_index(drop=True)], axis=1) \
.set_axis(['Col1', 'Col2'], axis=1, inplace=False)
Result
Col1 Col2
0 a f
1 b g
2 c h
3 d i
4 e j

String split using a delimiter on pandas column to create new columns

I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?

Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D

Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need

Pandas groupby concat ungrouped column into comma separated string

I have the following example df:
col1 col2 col3 doc_no
0 a x f 0
1 a x f 1
2 b x g 2
3 b y g 3
4 c x t 3
5 c y t 4
6 a x f 5
7 d x t 5
8 d x t 6
I want to group by the first 3 columns (col1, col2, col3), concatenate the fourth column (doc_no) into a line of strings based on the groupings of the first 3 columns, as well as also generate a sorted count column of the 3 column grouping (count). Example desired output below (column order doesn't matter):
col1 col2 col3 count doc_no
0 a x f 3 0, 1, 5
1 d x t 2 5, 6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
How would I go about doing this? I used the below line to get just the grouping and the count:
grouped_df = df.groupby(['col1','col2','col3']).size().reset_index(name='count')\
.sort_values(['count'], ascending=False).reset_index()
But I'm not sure how to also get the concatenated doc_no column in the same code line.

Try groupby and agg like so:
(df.groupby(['col1', 'col2', 'col3'])['doc_no']
.agg(['count', ('doc_no', lambda x: ','.join(map(str, x)))])
.sort_values('count', ascending=False)
.reset_index())
col1 col2 col3 count doc_no
0 a x f 3 0,1,5
1 d x t 2 5,6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
agg is simple to use because you can specify a list of reducers to run on a single column.

Let us do
df.doc_no=df.doc_no.astype(str)
s=df.groupby(['col1','col2','col3']).doc_no.agg(['count',','.join]).reset_index()
s
col1 col2 col3 count join
0 a x f 3 0,1,5
1 b x g 1 2
2 b y g 1 3
3 c x t 1 3
4 c y t 1 4
5 d x t 2 5,6

Another way
df2=df.groupby(['col1','col2','col3']).doc_no.agg(doc_no=('doc_no',list)).reset_index()
df2['doc_no']=df2['doc_no'].astype(str).str[1:-1]

Aggregate over difference of levels of factor in Pandas DataFrame?

Given df1:
A B C
0 a 7 x
1 b 3 x
2 a 5 y
3 b 4 y
4 a 5 z
5 b 3 z
How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b:
C D
0 x 4
1 y 1
2 z 2

I'd use a pivot table:
df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C')
df2 = pd.DataFrame({'D': df['a'] - df['b']})
The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C

Create and fill new columns using values in rows pandas

I have two dataframes:
Dataframe A:
Col1 Col2 Value
A X 1
A Y 2
B X 3
B Y 2
C X 5
C Y 4
Dataframe B:
Col1
A
B
C
What I need is to add to Dataframe B one column for each value in Col2 of Dataframe A (in this case, X and Y), and filling them with the values in column "Value" after having merged the two dataframes on Col1. Here is it:
Col1 X Y
A 1 2
B 3 2
C 5 4
Thank you very much for your help!

B['X'] = A.loc[A['Col2'] == 'X', 'Value'].reset_index(drop = True)
B['Y'] = A.loc[A['Col2'] == 'Y', 'Value'].reset_index(drop = True)
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
If you are going to have 100s of distinct values in Col2 then you call the above two lines in a loop, like this:
for t in A['Col2'].unique():
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B
You get the same output:
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two dataframes, with different lengths, and repeating values - python

Use merge: df2.merge(df1, on='Col A') Out: Col A Col B_x Col B_y 0 1 a v 1 2 b w 2 2 c w 3 3 d x 4 3 e x 5 4 f y And if necessary, rename afterwards: df = df2.merge(df1, on='Col A') df.columns = ['Col A', 'Col B', 'Col C'] for more info, see the Pandas Documentation on merging and joining.

I believe you need map by Series created by set_index: print (df1.set_index('Col A')['Col B']) Col A 1 v 2 w 3 x 4 y Name: Col B, dtype: object df2['Col C'] = df2['Col A'].map(df1.set_index('Col A')['Col B']) print (df2) Col A Col B Col C 0 1 a v 1 2 b w 2 2 c w 3 3 d x 4 3 e x 5 4 f y

Related

Concat two dataframes with different indices

String split using a delimiter on pandas column to create new columns

Pandas groupby concat ungrouped column into comma separated string

Aggregate over difference of levels of factor in Pandas DataFrame?

Create and fill new columns using values in rows pandas

Categories

Resources