I have two Pandas DataFrames; df1 has two columns called A and B and looks like:
df1=pd.DataFrame(data={'A':[10,30], 'B':[5,4]})
A B
10 5
30 4
df2 has two columns B and C, and looks like:
df2=pd.DataFrame(data={'B':[4,7], 'B':[10,20]})
B C
4 10
7 20
I want to modify df1.A based on if df1.B matches df2.B. If so, df1.A should divide df2.C. Namely, I want to get the following with the aforementioned df1 and df2:
A B
10 5
3 4
Is there a one-line solution in Python?
This is essentially merge with some manipulation:
(df1.merge(df2, on='B', how='left')
.assign(C=lambda x: x.C.fillna(1)) # those don't match has `C` value `1`
.assign(A=lambda x: x.A/x.C) # divide by `C` value
.drop('C', axis=1) # remove the `C` column
)
Output:
A B
0 10.0 5
1 3.0 4
map
d = dict(zip(df2.B, df2.C))
f = lambda x: d.get(x, 1)
df1.assign(A=df1.A / df1.B.map(f))
A B
0 10.0 5
1 3.0 4
Related
I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.
I have a dataset that includes different types of tags. Each column has a string that contains a list of tags.
How am I supposed to explode selected columns at the same time ?
Unnamed: id Tag1 Tag2
0 A a,b,c d,e
1 B m,n x
to this:
Unnamed: id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
6 A c e
7 B m x
8 B n x
First, split the string values of each Tag column into lists, using Series.apply + Series.str.split. I'm using DataFrame.filter to select only the columns which starts with 'Tag'.
Then, use DataFrame.explode in a loop to explode sequentially each Tag column of the df, turning the values of each list into new rows.
tag_cols = df.filter(like='Tag').columns
df[tag_cols] = df[tag_cols].apply(lambda col: col.str.split(','))
for col in tag_cols:
df = df.explode(col, ignore_index=True)
print(df)
Output:
id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
5 A c e
6 B m x
7 B n x
Note that using just df.apply(lambda col: col.str.split(',').explode()) won't work in this case because some rows have strings/lists with a different number of elements. Therefore the rows can't be correctly aligned after exploding them, and apply will complain.
Consider this dataframe.
df = pd.DataFrame(data={'one': list('abcd'),
'two': list('efgh'),
'three': list('ajha')})
one two three
0 a e a
1 b f j
2 c g h
3 d h a
How can I output all duplicate values and their respective index? The output can look something like this.
id value
0 2 h
1 3 h
2 0 a
3 0 a
4 3 a
Try .melt + .duplicated:
x = df.reset_index().melt("index")
print(
x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
.reset_index(drop=True)
.rename(columns={"index": "id"})
)
Prints:
id value
0 0 a
1 3 h
2 0 a
3 2 h
4 3 a
We can stack the DataFrame, use Series.loc to keep only where value is Series.duplicated then Series.reset_index to convert to a DataFrame:
new_df = (
df.stack() # Convert to Long Form
.droplevel(-1).rename_axis('id') # Handle MultiIndex
.loc[lambda x: x.duplicated(keep=False)] # Filter Values
.reset_index(name='value') # Make Series a DataFrame
)
new_df:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a
I used here melt to reshape and duplicated(keep=False) to select the duplicates:
(df.rename_axis('id')
.reset_index()
.melt(id_vars='id')
.loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
.sort_values(by='id')
.reset_index(drop=True)
)
Output:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a
I have two dataframes:
df1
key value
A 1
B 2
C 2
D 3
df2
key value
C 3
D 3
E 5
F 7
I would like to merge this dataframes by their key and get a dataframe which looks like this one. So, I want to get only one column (no new column with suffixes) and remove the value of df2 if the values do not fit together.
df_merged
key value
A 1
B 2
C 2
D 3
E 5
F 7
How can I do this? Should I rather take join or concatenate? Thanks a lot!
Use concat with DataFrame.drop_duplicates by column key:
df = pd.concat([df1, df2], ignore_index=True).drop_duplicates('key')
print (df)
key value
0 A 1
1 B 2
2 C 2
3 D 3
6 E 5
7 F 7
Just adding to #jezrael's answer, you could also use groupby with first:
>>> pd.concat([df1, df2], ignore_index=True).groupby('key', as_index=False).first()
key value
0 A 1
1 B 2
2 C 2
3 D 3
4 E 5
5 F 7
>>>
I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5