I have two dataframes with the same col 'A' that I want to merge on. However, in df2 col A is replicated a random number of times. This replication is important to my problem and I cannot drop it. I want the final dataframe to look like df3. Where Col A merges Col B values to each replication.
df1 df2
Col A Col B Col A Col B
1 v 1 a
2 w 2 b
3 x 2 c
4 y 3 d
3 e
4 f
df3
Col A Col B Col C
1 a v
2 b w
2 c w
3 d x
3 e x
4 f y
Use merge:
df2.merge(df1, on='Col A')
Out:
Col A Col B_x Col B_y
0 1 a v
1 2 b w
2 2 c w
3 3 d x
4 3 e x
5 4 f y
And if necessary, rename afterwards:
df = df2.merge(df1, on='Col A')
df.columns = ['Col A', 'Col B', 'Col C']
for more info, see the Pandas Documentation on merging and joining.
I believe you need map by Series created by set_index:
print (df1.set_index('Col A')['Col B'])
Col A
1 v
2 w
3 x
4 y
Name: Col B, dtype: object
df2['Col C'] = df2['Col A'].map(df1.set_index('Col A')['Col B'])
print (df2)
Col A Col B Col C
0 1 a v
1 2 b w
2 2 c w
3 3 d x
4 3 e x
5 4 f y
Related
I am trying to concatenate two dataframes. I've tried using merge(), join(), concat() in pandas, but none gave me my desired output.
df1:
Index
value
0
a
1
b
2
c
3
d
4
e
df2:
Index
value
1
f
2
g
3
h
4
i
5
j
desired output:
Index
col1
col2
0
a
f
1
b
g
2
c
h
3
d
i
4
e
j
Thanks in advance!
You can just use pd.merge and specify the index left join as follows:
import pandas as pd
df1 = pd.DataFrame(data={'value': list('ABCDE')})
df2 = pd.DataFrame(data={'value': list('FGHIJ')}, index=range(1, 6))
pd.merge(df1.rename(columns={'value': 'col1'}), df2.reset_index(drop=True).rename(columns={'value': 'col2'}), how='left', left_index=True, right_index=True)
-----------------------------------
col1 col2
0 A F
1 B G
2 C H
3 D I
4 E J
-----------------------------------
Does resetting the index of df2 work for your use case?
pd.concat([df1,df2.reset_index(drop=True)], axis=1) \
.set_axis(['Col1', 'Col2'], axis=1, inplace=False)
Result
Col1 Col2
0 a f
1 b g
2 c h
3 d i
4 e j
I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?
Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need
I have the following example df:
col1 col2 col3 doc_no
0 a x f 0
1 a x f 1
2 b x g 2
3 b y g 3
4 c x t 3
5 c y t 4
6 a x f 5
7 d x t 5
8 d x t 6
I want to group by the first 3 columns (col1, col2, col3), concatenate the fourth column (doc_no) into a line of strings based on the groupings of the first 3 columns, as well as also generate a sorted count column of the 3 column grouping (count). Example desired output below (column order doesn't matter):
col1 col2 col3 count doc_no
0 a x f 3 0, 1, 5
1 d x t 2 5, 6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
How would I go about doing this? I used the below line to get just the grouping and the count:
grouped_df = df.groupby(['col1','col2','col3']).size().reset_index(name='count')\
.sort_values(['count'], ascending=False).reset_index()
But I'm not sure how to also get the concatenated doc_no column in the same code line.
Try groupby and agg like so:
(df.groupby(['col1', 'col2', 'col3'])['doc_no']
.agg(['count', ('doc_no', lambda x: ','.join(map(str, x)))])
.sort_values('count', ascending=False)
.reset_index())
col1 col2 col3 count doc_no
0 a x f 3 0,1,5
1 d x t 2 5,6
2 b x g 1 2
3 b y g 1 3
4 c x t 1 3
5 c y t 1 4
agg is simple to use because you can specify a list of reducers to run on a single column.
Let us do
df.doc_no=df.doc_no.astype(str)
s=df.groupby(['col1','col2','col3']).doc_no.agg(['count',','.join]).reset_index()
s
col1 col2 col3 count join
0 a x f 3 0,1,5
1 b x g 1 2
2 b y g 1 3
3 c x t 1 3
4 c y t 1 4
5 d x t 2 5,6
Another way
df2=df.groupby(['col1','col2','col3']).doc_no.agg(doc_no=('doc_no',list)).reset_index()
df2['doc_no']=df2['doc_no'].astype(str).str[1:-1]
Given df1:
A B C
0 a 7 x
1 b 3 x
2 a 5 y
3 b 4 y
4 a 5 z
5 b 3 z
How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b:
C D
0 x 4
1 y 1
2 z 2
I'd use a pivot table:
df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C')
df2 = pd.DataFrame({'D': df['a'] - df['b']})
The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C
I have two dataframes:
Dataframe A:
Col1 Col2 Value
A X 1
A Y 2
B X 3
B Y 2
C X 5
C Y 4
Dataframe B:
Col1
A
B
C
What I need is to add to Dataframe B one column for each value in Col2 of Dataframe A (in this case, X and Y), and filling them with the values in column "Value" after having merged the two dataframes on Col1. Here is it:
Col1 X Y
A 1 2
B 3 2
C 5 4
Thank you very much for your help!
B['X'] = A.loc[A['Col2'] == 'X', 'Value'].reset_index(drop = True)
B['Y'] = A.loc[A['Col2'] == 'Y', 'Value'].reset_index(drop = True)
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
If you are going to have 100s of distinct values in Col2 then you call the above two lines in a loop, like this:
for t in A['Col2'].unique():
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B
You get the same output:
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4