I am working on pandas data frame, something like below:
id vals
0 1 11
1 1 5.5
2 1 -2
3 1 8
4 2 3
5 2 4
6 2 19
7 2 20
Above is just a small part of the df, the vals are grouped by id , and there will always be equal number of vals per id. In above case it's 4 and 4 values for id = 1 and id =2.
What I am trying to achieve is to add the value at index 0 with value at index 4, then value at index 1 with value at index 5 and so on.
Following is the expected df/ series, say df2:
total
0 14
1 9.5
2 17
3 28
Real df has hundreds of id and not just two as above.
Groupby() can be used but I dont get how to get the specific indices from each group.
Please let me know if anything is unclear.
groupby on modulo of df.index values and take sum of vals
In [805]: df.groupby(df.index % 4).vals.sum()
Out[805]:
0 14.0
1 9.5
2 17.0
3 28.0
Name: vals, dtype: float64
Since there are exactly 4 values per ID, we can simply reshape the underlying 1D array data to 2D array and sum along the appropriate axis (axis=0 in this case) -
pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Sample run -
In [192]: df
Out[192]:
id vals
0 1 11.0
1 1 5.5
2 1 -2.0
3 1 8.0
4 2 3.0
5 2 4.0
6 2 19.0
7 2 20.0
In [193]: pd.DataFrame({'total':df.vals.values.reshape(-1,4).sum(0)})
Out[193]:
total
0 14.0
1 9.5
2 17.0
3 28.0
Related
Consider a dataframe which contains several groups of integers:
d = pd.DataFrame({'label': ['a','a','a','a','b','b','b','b'], 'value': [1,2,3,2,7,1,8,9]})
d
label value
0 a 1
1 a 2
2 a 3
3 a 2
4 b 7
5 b 1
6 b 8
7 b 9
For each of these groups of integers, each integer has to be bigger or equal to the previous one. If not the case, it takes on the value of the previous integer. I replace using
s.where(~(s < s.shift()), s.shift())
which works fine for a single series. I can even group the dataframe, and loop through each extracted series:
grouped = s.groupby('label')['value']
for _, s in grouped:
print(s.where(~(s < s.shift()), s.shift()))
0 1.0
1 2.0
2 3.0
3 3.0
Name: value, dtype: float64
4 7.0
5 7.0
6 8.0
7 9.0
Name: value, dtype: float64
However, how do I now get these values back into my original dataframe?
Or, is there a better way to do this? I don't care for using .groupby and don't consider the for loop a pretty solution either...
IIUC, you can use cummax in the groupby like:
d['val_max'] = d.groupby('label')['value'].cummax()
print (d)
label value val_max
0 a 1 1
1 a 2 2
2 a 3 3
3 a 2 3
4 b 7 7
5 b 1 7
6 b 8 8
7 b 9 9
I am facing the next problem:
I have groups (by ID) and for all of those groups I need to apply the following code: if the distances between locations within a group are within 3 meters, they need to be added together, hence a new group will be created (the code how to create a group I showed below). Now, what I want is the number of detections within a distance group, hence the length of the group.
This all worked, but after applying it to the ID groups, it gives me an error.
The code is as follows:
def group_nearby_peaks(df, col, cutoff=-3.00):
"""
This function groups nearby peaks based on location.
When peaks are within 3 meters from each other they will be added together.
"""
min_location_between_groups = cutoff
df = df.sort_values('Location')
return (
df.assign(
location_diff=lambda d: d['Location'].diff(-1).fillna(-9999),
NOD=lambda d: d[col]
.groupby(d["location_diff"].shift().lt(min_location_between_groups).cumsum())
.transform(len)
)
)
def find_relative_difference(df, peak_col, difference_col):
def relative_differences_per_ID(ID_df):
return (
spoortak_df.pipe(find_difference_peaks)
.loc[lambda d: d[peak_col]]
.pipe(group_nearby_peaks, difference_col)
)
return df.groupby('ID').apply(relative_differences_per_ID)
The error I get is the following:
ValueError: No objects to concatenate
With the following example dataframe, I expect this result.
ID Location
0 1 12.0
1 1 14.0
2 1 15.0
3 1 17.5
4 1 25.0
5 1 30.0
6 1 31.0
7 1 34.0
8 1 36.0
9 1 37.0
10 2 8.0
11 2 14.0
12 2 15.0
13 2 17.5
14 2 50.0
15 2 55.0
16 2 58.0
17 2 59.0
18 2 60.0
19 2 70.0
Expected result:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 5
Create groupID s for Location within 3 meters. Those are > 3 meters will be forced as single ID while others will be duplicated ID. Finally, groupby ID and s and count
s = df.groupby('ID').Location.diff().fillna(0).abs().gt(3).cumsum()
df.groupby(['ID',s]).ID.count().reset_index(name='Number of detections').drop('Location', 1)
Out[190]:
ID Number of detections
0 1 4
1 1 1
2 1 5
3 2 1
4 2 3
5 2 1
6 2 4
7 2 1
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
I want to insert a pandas dataframe into another pandas dataframe at certain indices.
Lets say we have this dataframe:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
I can then change values at certain indices as following:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
original_df.iloc[[0,2],[0,1]] = 2
0 1 2
0 2 2 3
1 4 5 6
2 2 2 9
However, if i use the same technique to insert another dataframe, it doesn't work:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df_to_insert = pd.DataFrame([[10,11],[12,13]])
original_df.iloc[[0,2],[0,1]] = df_to_insert
0 1 2
0 10.0 11.0 3.0
1 4.0 5.0 6.0
2 NaN NaN 9.0
I am looking for a way to get the following result:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It seems to me that with the syntax i am using, the values from df_to_insert are taken from the corresponding index at their target locations. Is there a way for me to avoid this?
When you do insert make sure change the df to values , pandas is index sensitive , which means it will always try to match with the index and column during calculation
original_df.iloc[[0,2],[0,1]] = df_to_insert.values
original_df
Out[651]:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It does work with an array rather than a df:
original_df.iloc[[0,2],[0,1]] = np.array([[10,11],[12,13]])
I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15