I have a DataFrame df1:
| A | B | C | D |
-----------------
| 0 | 1 | 3 | 4 |
| 2 | 1 | 8 | 4 |
| 0 | 2 | 3 | 1 |
and a DataFrame df2:
| A | D |
---------
| 2 | 2 |
| 3 | 2 |
| 1 | 9 |
I want to replace column A and D of df1 with the equivalent columns of df2.
Surely I could do something like
df1['A'] = df2['A']
df1['D'] = df2['D']
But I need a solution for doing this automatically since I have thousands of columns.
You can use combine_first:
df2.combine_first(df1)
# A B C D
#0 2 1.0 3.0 2
#1 3 1.0 8.0 2
#2 1 2.0 3.0 9
The way to do this is with pd.DataFrame.update
Update will modify a dataframe in place with information in another dataframe.
df1.update(df2)
The advantage of this is that your dtypes in df1 are preserved.
df1
A B C D
0 2 1 3 2
1 3 1 8 2
2 1 2 3 9
Another way to have done this with out updating in place would be to have used pd.DataFrame.assign and dictionary unpacking on pd.DataFrame.iteritems. However, this would include new additional columns if they existed in df2.
df1.assign(**dict(df2.iteritems()))
A B C D
0 2 1 3 2
1 3 1 8 2
2 1 2 3 9
a simple for loop should suffice:
for c in df2.columns:
df1[c] = df2[c]
for col in df1.columns:
if col in df2.columns.tolist():
df1[col] = df2[col]
Related
Hi I have a DataFrame which have values like
| ID| Value| comments |
| 1 | a | |
| 2 | b | |
| 3 | a;b;c| |
| 4 | b;c | |
| 5 | d;a;c| |
I need to transfer to a and b from Value to Comments for all the rows they are in. so that only value other that a and b will remain in data.
the new df would look like this
| ID| Value| comments |
| 1 | | a |
| 2 | | b |
| 3 | c | a;b |
| 4 | c | b |
| 5 | d;c | a |
Can you give me a direction where should i look for the answer to this
(i) Use str.split to split on ';' and explode the "Value" column
(ii) Use boolean indexing to filter rows where 'a' or 'b' exist, take them out and groupby index and join them with ';' as separators
exploded_series = df['Value'].str.split(';').explode()
mask = exploded_series.isin(['a','b'])
df['comments'] = exploded_series[mask].groupby(level=0).apply(';'.join)
df['Value'] = exploded_series[~mask].groupby(level=0).apply(';'.join)
df = df.fillna('')
Output:
ID Value comments
0 1 a
1 2 b
2 3 c a;b
3 4 c b
4 5 d;c a
Explode your Value column then label it to the right column:
out = df.assign(Value=df['Value'].str.split(';')).explode('Value')
out['col'] = np.where(out['Value'].isin(['a', 'b']), 'comments', 'Value')
print(out)
# Intermediate output
ID Value comments col
0 1 a NaN comments
1 2 b NaN comments
2 3 a NaN comments
2 3 b NaN comments
2 3 c NaN Value
3 4 b NaN comments
3 4 c NaN Value
4 5 d NaN Value
4 5 a NaN comments
4 5 c NaN Value
Now pivot your dataframe:
out = out.pivot_table(index='ID', columns='col', values='Value', aggfunc=';'.join) \
.fillna('').reset_index().rename_axis(columns=None)
print(out)
# Final output
ID Value comments
0 1 a
1 2 b
2 3 c a;b
3 4 c b
4 5 d;c a
I have 3 dataframes A,B,C:
import pandas as pd
A = pd.DataFrame({"id": [1,2],
"connected_to_B_id1":["A","B"],
"connected_to_B_id2":["B","C"],
"connected_to_B_id3":["C", np.nan],
# entry can have multiple ids from B
})
B = pd.DataFrame({"id": ["A","B","C"],
"connected_to_C_id1":[1,1,2],
"connected_to_C_id2":[2,2,np.nan],
# entry can have multiple ids from C
})
C = pd.DataFrame({"id": [1,2],
"name":["a","b"],
})
#Output should be D:
D = pd.DataFrame({"id_A": [1,1,1,1,1,2,2,2],
"id_B": ["A","A","B","B","C","B","B","C"],
"id_C": [1,2,1,2,2,1,2,1],
"name": ["a","b","a","b","b","a","b","a"]
})
I want to use the IDs stored in the "connected_to_X" columns of each dataframe to create a dataframe, which contains all relationships recorded in the three individual dataframes.
What is the most elegant way to combine the dataframes to A, B and C to D?
Currently I am using dicts,lists and for loops and its messy and complicated.
D:
|idx |id_A|id_B|id_C|name|
|---:|--:|--:|--:|--:|
| 0 | 1 | A | 1 | a |
| 1 | 1 | A | 2 | b |
| 2 | 1 | B | 1 | a |
| 3 | 1 | B | 2 | b |
| 4 | 1 | C | 2 | b |
| 5 | 2 | B | 1 | a |
| 6 | 2 | B | 2 | b |
| 7 | 2 | C | 1 | a |
You just need to unpivot A and B then you can join the tables up.
(A.
melt(id_vars='id').
merge(B.melt(id_vars='id'), left_on = 'value', right_on='id', how='left').
merge(C, left_on = 'value_y', right_on='id').
drop(columns = ['variable_x', 'variable_y', 'value_x']).
sort_values(['id_x', 'id_y']).
reset_index(drop=True).
reset_index()
)
index id_x id_y value_y id name
0 0 1 A 1.0 1 a
1 1 1 A 2.0 2 b
2 2 1 B 1.0 1 a
3 3 1 B 2.0 2 b
4 4 1 C 2.0 2 b
5 5 2 B 1.0 1 a
6 6 2 B 2.0 2 b
7 7 2 C 2.0 2 b
any help here?!
let's suppose I have a dataframe with two columns:
A | B
1 | b
1 | b
1 | a
2 | a
2 | b
3 | b
3 | c
3 | d
I want to get the first occurrence for each value of the colA
it would be something like
A | B
1 | b
2 | a
3 | b
then catch the second occurrence
something like that:
A | B
1 | b
2 | b
3 | c
after 3 occurrence
A | B
1 | a
2 | NULL
3 | d
any tips on how to do this??
IIUC, here's one way:
df1 = df.pivot_table(index = 'A', columns = df.groupby('A').cumcount(), values = 'B', aggfunc = sum)
result = [df1[col].reset_index(name='B') for col in df1.columns] #this will give you the list of df's
OUTPUT:
[ A B
0 1 b
1 2 a
2 3 b,
A B
0 1 b
1 2 b
2 3 c,
A B
0 1 a
1 2 NaN
2 3 d]
I got a Data Frame like this. In the values column there is a list of numbers per row. In the categories column there is a list of categories per row. Values are of Type int and Categories of Type string. Each value in the values column always fits the category value in the same position oft the list in the categories column. You can think of it as recipes. For example: For the recipe in the first row you need 2 of a, 4 of c, 3 of d and 5 of e.
| values | categories |
| ------ | ---------- |
| [2,4,3,5] | ['a','c','d','e'] |
| [1,6,7] | ['b','c','e'] |
| [3,5] | ['c','f'] |
I need to create a new Data Frame with pandas/ python so that it takes the distinct categories as columns and fills the rows with the corresponding values. So that it looks like this:
| a | b | c | d | e | f |
| - | - | - | - | - | - |
| 2 | 0 | 4 | 3 | 5 | 0 |
| 0 | 1 | 6 | 0 | 7 | 0 |
| 0 | 0 | 3 | 0 | 0 | 5 |
Thank you for your help.
Another option with explode and pivot:
df.apply(pd.Series.explode).pivot(columns='categories').fillna(0)
Output:
values
categories a b c d e f
0 2 0 4 3 5 0
1 0 1 6 0 7 0
2 0 0 3 0 0 5
Use list comprehension for list of dictionaries, pass to DataFrame constructor and then replace missing values to 0 with sorting columns names:
L = [dict(zip(a, b)) for a, b in zip(df['categories'], df['values'])]
df = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
print (df)
a b c d e f
0 2 0 4 3 5 0
1 0 1 6 0 7 0
2 0 0 3 0 0 5
Another idea is create dictionary by all uniwue sorted columns names and use {**dict1, **dict2} merge trick:
d = dict.fromkeys(sorted(set([y for x in df['categories'] for y in x])), 0)
L = [{ **d, **dict(zip(a, b))} for a, b in zip(df['categories'], df['values'])]
df = pd.DataFrame(L, index=df.index)
print (df)
a b c d e f
0 2 0 4 3 5 0
1 0 1 6 0 7 0
2 0 0 3 0 0 5
I have a dataframe like this:
| a | b | c |
0 | 0 | 0 | 0 |
1 | 5 | 5 | 5 |
I have a dataframe row (or series) like this:
| a | b | c |
0 | 1 | 2 | 3 |
I want to subtract the row from the entire dataframe to obtain this:
| a | b | c |
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
Any help is appreciated, thanks.
Use DataFrame.add or DataFrame.sub with convert one row DataFrame to Series - e.g. by DataFrame.iloc for first row:
df = df1.add(df2.iloc[0])
#alternative select by row label
#df = df1.add(df2.loc[0])
print (df)
a b c
0 1 2 3
1 6 7 8
Detail:
print (df2.iloc[0])
a 1
b 2
c 3
Name: 0, dtype: int64
You can convert the second dataframe to numpy array:
df1 + df2.values
Output:
a b c
0 1 2 3
1 6 7 8