I have a dataframe that has intervals and a label associated with each. I need to group and aggregate rows separated by a given distance from others.
For example, groups rows whose start/end are within 3 units of the the start/end of other rows have their label fields concatenated:
In [16]: df = pd.DataFrame([
...: [ 1, 3,'a'], [ 4,10,'b'],
...: [15,17,'c'], [18,20,'d'],
...: [27,30,'e'], [31,40,'f'], [41,42,'g'],
...: [50,54,'h']],
...: columns=['start', 'end', 'label'])
...:
In [17]: df
Out[17]:
start end label
0 1 3 a
1 4 10 b
2 15 17 c
3 18 20 d
4 27 30 e
5 31 40 f
6 41 42 g
7 50 54 h
Desired output:
In [18]: df_desired = group_by_interval(df)
In [19]: df_desired
Out[19]:
start end label
0 1 10 a b
1 15 20 c d
2 27 30 e f g
3 50 54 h
How can I execute this sort of grouping by interval with a dataframe?
I have found one similar SO here, but it's a little different since I don't know where to cut a priori.
You can create a grouper based on the condition and aggregate
grouper = ((df['start'] - df['end'].shift()) > 3).cumsum()
df.groupby( grouper).agg({'start' : 'first', 'end' : 'last', 'label': lambda x: ' '.join(x)})
start end label
0 1 10 a b
1 15 20 c d
2 27 42 e f g
3 50 54 h
Related
There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])
I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7
I have two dataframes
df1 = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]], index = ['a','b','c', 'a'], columns = ['d','e'])
d e
a 1 2
b 3 4
c 5 6
a 7 8
df2 = pd.DataFrame([['a', 10],['b',20],['c',30],['f',40]])
0 1
0 a 10
1 b 20
2 c 30
3 f 40
i want my final dataframe to multiply rows of df1 to multiply by a factor corresponding to value in df2 (for eg. 20 for b)
so my output should look like
d e
a 10 20
b 60 80
c 150 180
a 70 80
Kindly provide a solution assuming df1 to be hundreds of rows in length. I could only think of looping through df1.index.
Use set_index and reindex to align df2 with df1 and then mul
In [1150]: df1.mul(df2.set_index(0).reindex(df1.index)[1], axis=0)
Out[1150]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
Create a mapping and call df.apply:
In [1128]: mapping = dict(df2.values)
In [1129]: df1.apply(lambda x: x * mapping[x.name], 1)
Out[1129]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
IIUC:
In [55]: df1 * pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[55]:
d e
a 10 20
a 70 80
b 60 80
c 150 180
Helper DF:
In [57]: pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[57]:
d e
0
a 10 10
b 20 20
c 30 30
This is straight forward. You just make sure they have a common axis, then you can combine them:
put the lookup column into the index
df2.set_index(0, inplace=True)
1
0
a 10
b 20
c 30
Now you can put that column into df1 very easily:
df1['multiplying_factor'] = df2[1]
Now you just want to multiply two columns:
df1['final_value'] = df1.e*df1.multiplying_factor
Now df1 looks like:
d e multiplying_factor final_value
a 1 2 10 20
b 3 4 20 80
c 5 6 30 180
a 7 8 10 80
I would like to join 2 dataframes, so that the result will be the intersection on the two datasets on the key column.
By doing this:
result = pd.merge(df1,df2,on='key', how='inner')
I will get what I need, but with extra columns of df2. I only want df1 columns in the results. (I do not want to delete them later).
Any ideas?
Thanks,
Here is a generic solution which will work for one and for multiple keys (joining) columns:
Setup:
In [28]: a = pd.DataFrame({'a':[1,2,3,4], 'b':[10,20,30,40], 'c':list('abcd')})
In [29]: b = pd.DataFrame({'a':[3,4,5,6], 'b':[30,41,51,61], 'c':list('efgh')})
In [30]: a
Out[30]:
a b c
0 1 10 a
1 2 20 b
2 3 30 c
3 4 40 d
In [31]: b
Out[31]:
a b c
0 3 30 e
1 4 41 f
2 5 51 g
3 6 61 h
multiple joining keys:
In [32]: join_cols = ['a','b']
In [33]: a.merge(b[join_cols], on=join_cols)
Out[33]:
a b c
0 3 30 c
single joining key:
In [34]: join_cols = ['a']
In [35]: a.merge(b[join_cols], on=join_cols)
Out[35]:
a b c
0 3 30 c
1 4 40 d
I have a dataframe as listed below:
In []: dff = pd.DataFrame({'A': np.arange(8),
'B': list('aabbbbcc'),
'C':np.random.randint(100,size=8)})
which i have grouped based on column B
In []: grouped = dff.groupby('B')
Now, I want to filter the dff based on difference of values in column 'C'. For example, if the difference between any two points within the group in column C is greater than a threshold, remove that row.
If dff is:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
6 6 c 74
7 7 c 3
Then, a threshold of 10 for C will produce a final table like:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
here the grouped category c (small letter) is removed as the difference between the two is greater than 10, but category b has all the rows intact as they are all within 10 of each other.
I think I'd do the hard work in numpy:
In [11]: a = np.array([2, 3, 14, 15, 54])
In [12]: res = np.abs(a[:, np.newaxis] - a) < 10 # Note: perhaps you want <= 10.
In [13]: np.fill_diagonal(res, False)
In [14]: res.any(0)
Out[14]: array([ True, True, True, True, False], dtype=bool)
You could wrap this in a function:
In [15]: def has_close(a, n=10):
res = np.abs(a[:, np.newaxis] - a) < n
np.fill_diagonal(res, False)
return res.any(0)
In [16]: g = df.groupby('B', as_index=False)
In [17]: g.C.apply(lambda x: x[has_close(x.C.values)])
Out[17]:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
5 5 b 56