I have two dataframes
df1 = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]], index = ['a','b','c', 'a'], columns = ['d','e'])
d e
a 1 2
b 3 4
c 5 6
a 7 8
df2 = pd.DataFrame([['a', 10],['b',20],['c',30],['f',40]])
0 1
0 a 10
1 b 20
2 c 30
3 f 40
i want my final dataframe to multiply rows of df1 to multiply by a factor corresponding to value in df2 (for eg. 20 for b)
so my output should look like
d e
a 10 20
b 60 80
c 150 180
a 70 80
Kindly provide a solution assuming df1 to be hundreds of rows in length. I could only think of looping through df1.index.
Use set_index and reindex to align df2 with df1 and then mul
In [1150]: df1.mul(df2.set_index(0).reindex(df1.index)[1], axis=0)
Out[1150]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
Create a mapping and call df.apply:
In [1128]: mapping = dict(df2.values)
In [1129]: df1.apply(lambda x: x * mapping[x.name], 1)
Out[1129]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
IIUC:
In [55]: df1 * pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[55]:
d e
a 10 20
a 70 80
b 60 80
c 150 180
Helper DF:
In [57]: pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[57]:
d e
0
a 10 10
b 20 20
c 30 30
This is straight forward. You just make sure they have a common axis, then you can combine them:
put the lookup column into the index
df2.set_index(0, inplace=True)
1
0
a 10
b 20
c 30
Now you can put that column into df1 very easily:
df1['multiplying_factor'] = df2[1]
Now you just want to multiply two columns:
df1['final_value'] = df1.e*df1.multiplying_factor
Now df1 looks like:
d e multiplying_factor final_value
a 1 2 10 20
b 3 4 20 80
c 5 6 30 180
a 7 8 10 80
Related
There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])
I have two dataframes, each row in dataframe A has a list of indices corresponding to entries in dataframe B and a set of other values. I want to join the two dataframes in a way so that each of the entries in B has the other values in A where the index of the entry in B is in the list of indices in the entry in A.
So far, I have found a way of extracting the rows in B for the list of indices in each row in A but only row-by-row from this answer but then I am not sure where to go from here? Also not sure if there's a better way of doing it with Pandas dynamically as the size of the list of indices can change.
import pandas as pd
import numpy as np
# Inputs
A = pd.DataFrame.from_dict({
"indices": [[0,1],[2,3],[4,5]],
"a1": ["a","b","c"],
"a2": [100,200,300]
})
print(A)
>> indices a1 a2
>> 0 [0, 1] a 100
>> 1 [2, 3] b 200
>> 2 [4, 5] c 300
B = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60]
})
print(B)
>> b
>> 0 10
>> 1 20
>> 2 30
>> 3 40
>> 4 50
>> 5 60
# This is the desired output
out = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60],
"a1": ["a","a", "b", "b", "c", "c"],
"a2": [100,100,200,200,300,300]
})
print(out)
>> b a1 a2
>> 0 10 a 100
>> 1 20 a 100
>> 2 30 b 200
>> 3 40 b 200
>> 4 50 c 300
>> 5 60 c 300
If you have pandas >=0.25, you can use explode:
C = A.explode('indices')
This gives:
indices a1 a2
0 0 a 100
0 1 a 100
1 2 b 200
1 3 b 200
2 4 c 300
2 5 c 300
Then do:
output = pd.merge(B, C, left_index = True, right_on = 'indices')
output.index = output.indices.values
output.drop('indices', axis = 1, inplace = True)
Final Output:
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
using pd.merge
df2 = pd.DataFrame(A.set_index(['a1','a2']).indices)
df = pd.DataFrame(df2.indices.values.tolist(), index=a.index).stack().reset_index().drop('level_2', axis=1).set_index(0)
pd.merge(B,df,left_index=True, right_index=True)
Output
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
Here you go:
helper = A.indices.apply(pd.Series).stack().reset_index(level=1, drop=True)
A = A.reindex(helper.index).drop(columns=['indices'])
A['indices'] = helper
B = B.merge(A, left_index=True, right_on='indices').drop(columns=['indices']).reset_index(drop=True)
Result:
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
You can also use melt instead of stack, but it's more complicated as you must drop columns you don't need:
import pandas as pd
import numpy as np
# Inputs
A = pd.DataFrame.from_dict({
"indices": [[0,1],[2,3],[4,5]],
"a1": ["a","b","c"],
"a2": [100,200,300]
})
B = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60]
})
AA = pd.concat([A.indices.apply(pd.Series), A], axis=1)
AA.drop(['indices'], axis=1, inplace=True)
print(AA)
0 1 a1 a2
0 0 1 a 100
1 2 3 b 200
2 4 5 c 300
AA = AA.melt(id_vars=['a1', 'a2'], value_name='val').drop(['variable'], axis=1)
print(AA)
a1 a2 val
0 a 100 0
1 b 200 2
2 c 300 4
3 a 100 1
4 b 200 3
5 c 300 5
pd.merge(AA.set_index(['val']), B, left_index=True, right_index=True)
Out[8]:
a1 a2 b
0 a 100 10
2 b 200 30
4 c 300 50
1 a 100 20
3 b 200 40
5 c 300 60
This solution will handle indices of varying lengths.
A = pd.DataFrame.from_dict({
"indices": [[0,1],[2,3],[4,5]],
"a1": ["a","b","c"],
"a2": [100,200,300]
})
A = A.indices.apply(pd.Series) \
.merge(A, left_index = True, right_index = True) \
.drop(["indices"], axis = 1)\
.melt(id_vars = ['a1', 'a2'], value_name = "index")\
.drop("variable", axis = 1)\
.dropna()
A = A.set_index('index')
B = pd.DataFrame.from_dict({
"b": [10,20,30,40,50,60]
})
B
B.merge(A,left_index=True,right_index=True)
Final Output:
b a1 a2
0 10 a 100
1 20 a 100
2 30 b 200
3 40 b 200
4 50 c 300
5 60 c 300
I have a dataframe that has intervals and a label associated with each. I need to group and aggregate rows separated by a given distance from others.
For example, groups rows whose start/end are within 3 units of the the start/end of other rows have their label fields concatenated:
In [16]: df = pd.DataFrame([
...: [ 1, 3,'a'], [ 4,10,'b'],
...: [15,17,'c'], [18,20,'d'],
...: [27,30,'e'], [31,40,'f'], [41,42,'g'],
...: [50,54,'h']],
...: columns=['start', 'end', 'label'])
...:
In [17]: df
Out[17]:
start end label
0 1 3 a
1 4 10 b
2 15 17 c
3 18 20 d
4 27 30 e
5 31 40 f
6 41 42 g
7 50 54 h
Desired output:
In [18]: df_desired = group_by_interval(df)
In [19]: df_desired
Out[19]:
start end label
0 1 10 a b
1 15 20 c d
2 27 30 e f g
3 50 54 h
How can I execute this sort of grouping by interval with a dataframe?
I have found one similar SO here, but it's a little different since I don't know where to cut a priori.
You can create a grouper based on the condition and aggregate
grouper = ((df['start'] - df['end'].shift()) > 3).cumsum()
df.groupby( grouper).agg({'start' : 'first', 'end' : 'last', 'label': lambda x: ' '.join(x)})
start end label
0 1 10 a b
1 15 20 c d
2 27 42 e f g
3 50 54 h
I would like to join 2 dataframes, so that the result will be the intersection on the two datasets on the key column.
By doing this:
result = pd.merge(df1,df2,on='key', how='inner')
I will get what I need, but with extra columns of df2. I only want df1 columns in the results. (I do not want to delete them later).
Any ideas?
Thanks,
Here is a generic solution which will work for one and for multiple keys (joining) columns:
Setup:
In [28]: a = pd.DataFrame({'a':[1,2,3,4], 'b':[10,20,30,40], 'c':list('abcd')})
In [29]: b = pd.DataFrame({'a':[3,4,5,6], 'b':[30,41,51,61], 'c':list('efgh')})
In [30]: a
Out[30]:
a b c
0 1 10 a
1 2 20 b
2 3 30 c
3 4 40 d
In [31]: b
Out[31]:
a b c
0 3 30 e
1 4 41 f
2 5 51 g
3 6 61 h
multiple joining keys:
In [32]: join_cols = ['a','b']
In [33]: a.merge(b[join_cols], on=join_cols)
Out[33]:
a b c
0 3 30 c
single joining key:
In [34]: join_cols = ['a']
In [35]: a.merge(b[join_cols], on=join_cols)
Out[35]:
a b c
0 3 30 c
1 4 40 d
I have a dataframe as listed below:
In []: dff = pd.DataFrame({'A': np.arange(8),
'B': list('aabbbbcc'),
'C':np.random.randint(100,size=8)})
which i have grouped based on column B
In []: grouped = dff.groupby('B')
Now, I want to filter the dff based on difference of values in column 'C'. For example, if the difference between any two points within the group in column C is greater than a threshold, remove that row.
If dff is:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
6 6 c 74
7 7 c 3
Then, a threshold of 10 for C will produce a final table like:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
here the grouped category c (small letter) is removed as the difference between the two is greater than 10, but category b has all the rows intact as they are all within 10 of each other.
I think I'd do the hard work in numpy:
In [11]: a = np.array([2, 3, 14, 15, 54])
In [12]: res = np.abs(a[:, np.newaxis] - a) < 10 # Note: perhaps you want <= 10.
In [13]: np.fill_diagonal(res, False)
In [14]: res.any(0)
Out[14]: array([ True, True, True, True, False], dtype=bool)
You could wrap this in a function:
In [15]: def has_close(a, n=10):
res = np.abs(a[:, np.newaxis] - a) < n
np.fill_diagonal(res, False)
return res.any(0)
In [16]: g = df.groupby('B', as_index=False)
In [17]: g.C.apply(lambda x: x[has_close(x.C.values)])
Out[17]:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
5 5 b 56