I have a DataFrame that looks like the below, call this "values":
I would like to create another, call it "sums" that contains the sum of the DataFrame "values" from the column in "sums" to the end. It would look like the below:
I would like to create this without looking through the entire DataFrame, data point by data point. I have been trying with .apply() as seen below, but I keep getting the error: unsupported operand type(s) for +: 'int' and 'datetime.date'
In [26]: values = pandas.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
In [28]: sums = values.copy()
In [29]: sums.iloc[:,:] = ''
In [31]: for column in sums:
...: sums[column].apply(sum(values.loc[:,column:]))
...:
Traceback (most recent call last):
File "<ipython-input-31-030442e5005e>", line 2, in <module>
sums[column].apply(sum(values.loc[:,column:]))
File "C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\pandas\core\series.py", line 2220, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\src\inference.pyx", line 1088, in pandas.lib.map_infer (pandas\lib.c:63043)
TypeError: 'numpy.int64' object is not callable
In [32]: for column in sums:
...: sums[column] = sum(values.loc[:,column:])
In [33]: sums
Out[33]:
0 1 2 3 4 5 6 7 8
0 36 36 35 33 30 26 21 15 8
1 36 36 35 33 30 26 21 15 8
2 36 36 35 33 30 26 21 15 8
3 36 36 35 33 30 26 21 15 8
Is there a way to do this without looping each point individually?
Without looping, you can reverse your dataframe, cumsum per line and then re-reverse it:
>>> values.iloc[:,::-1].cumsum(axis=1).iloc[:,::-1]
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0
You can use the .cumsum() method to get the cumulative sum. The problem is that is operates from left to right, where you need it from right to left.
So we will reverse you data frame, use cumsum(), then set the axes back into the proper order.
import pandas as pd
values = pd.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
values[values.columns[::-1]].cumsum(axis=1).reindex_axis(values.columns, axis=1)
# returns:
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0
Related
I have a data frame with the following shape:
0 1
0 OTT:81 DVBC:398
1 OTT:81 DVBC:474
2 OTT:81 DVBC:474
3 OTT:81 DVBC:454
4 OTT:81 DVBC:443
5 OTT:1 DVBC:254
6 DVBC:151 None
7 OTT:1 DVBC:243
8 OTT:1 DVBC:254
9 DVBC:227 None
I want for column 1 to be same as column 0 if column 1 contains "DVBC".
The split the values on ":" and the fill the empty ones with 0.
The end data frame should look like this
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I try to do this starting with:
if df[0].str.contains("DVBC") is True:
df[1] = df[0]
But after this the data frame looks the same not sure why.
My idea after is to pass the values to the respective columns then split by ":" and rename the columns.
How can I implement this?
Universal solution for split values by : and pivoting- first create Series by DataFrame.stack, split by Series.str.splitSeries.str.rsplit and last reshape by DataFrame.pivot:
df = df.stack().str.split(':', expand=True).reset_index()
df = df.pivot('level_0',0,1).fillna(0).rename_axis(index=None, columns=None)
print (df)
DVBC OTT
0 398 81
1 474 81
2 474 81
3 454 81
4 443 81
5 254 1
6 151 0
7 243 1
8 254 1
9 227 0
Here is one way that should work with any number of columns:
(df
.apply(lambda c: c.str.extract(':(\d+)', expand=False))
.ffill(axis=1)
.mask(df.replace('None', pd.NA).isnull().shift(-1, axis=1, fill_value=False), 0)
)
output:
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I've got a simple pandas' Serie, like this one :
st
0 74
1 91
2 105
3 121
4 136
5 157
Datas for this Serie are the result of a cumulative sum, so I was wondering if a pandas function could "undo" the process, and return a new Serie like :
st result
0 74 74
1 91 17
2 105 14
3 121 16
4 136 15
5 157 21
result[0] = st[0], but after result[i] = st[i]-st[i-1].
It's seemed to be very simple (and maybe I missed a post), but I didn't find anything...
Use Series.diff with replace first missing value by original by Series.fillna and then if necessary cast to integers:
df['res'] = df['st'].diff().fillna(df['st']).astype(int)
print (df)
st res
0 74 74
1 91 17
2 105 14
3 121 16
4 136 15
5 157 21
I want to find the indexes where a new range of 100 values begins.
In the case below, since the first row is 0, the next index would be the next number above 100 (7).
At index 7, the value is 104, so the next index would be next number above 204 (15).
At index 15, the value is 205, so the next index would be the next number above 305 (n/a).
Therefore the output would be [0, 7, 15].
0 0
1 0
2 4
3 10
4 30
5 65
6 92
7 104
8 108
9 109
10 123
11 132
12 153
13 160
14 190
15 205
16 207
17 210
18 240
19 254
20 254
21 254
22 263
23 273
24 280
25 293
You can do zfill to create three digit numbers:
# convert number to string
df['grp'] = df['b'].astype(str).str.zfill(3).str[0]
print(df)
a b grp
0 0 0 0
1 1 0 0
2 2 4 0
3 3 10 0
4 4 30 0
5 5 65 0
6 6 92 0
7 7 104 1
8 8 108 1
9 9 109 1
10 10 123 1
11 11 132 1
12 12 153 1
13 13 160 1
14 14 190 1
15 15 205 2
# get first row from each group
ix = df.groupby('grp').first()['a'].to_numpy()
print(ix)
array([ 0, 7, 15])
For sorted data, we can use searchsorted -
In [98]: df.head()
Out[98]:
A
0 0
1 0
2 4
3 10
4 30
In [143]: df.A.searchsorted(np.arange(0,df.A.iloc[-1],100))
Out[143]: array([ 0, 7, 15])
If you need based on dataframe/series index, index it by df.index -
In [101]: df.index[_]
Out[101]: Int64Index([0, 7, 15], dtype='int64')
I have data from an excel sheet I have summarized in a pandas crosstab. I want to categorize the data further by summing related rows.
Here is my crosstab:
class_of_orbit Elliptical GEO LEO MEO All
users
Civil 0 0 36 0 36
Civil/Government 0 0 2 0 2
Commercial 3 99 412 0 514
Government 9 14 38 0 61
Government/Civil 0 0 10 0 10
Government/Commercial 0 2 81 0 83
Government/Military 0 0 1 0 1
Military 9 67 66 0 142
Military/Civil 0 0 2 0 2
Military/Commercial 0 0 0 32 32
All 21 182 648 32 883
I only want 4 groups: civil, govt,commercial, and military. If "Government" is in the name, I want to sum all the rows that contain it. If "Military" is in the name I want to sum the rows into a military row....
What is the best way to do this?
pd.crosstab
Do it from the start
pd.crosstab(df.users.str.split('/').str[0], df.class_of_orbit)
groupby
On top of what you already have. If you pass a callable to groupby it will apply that to the index and use the result to group by.
xtab.groupby(lambda x: x.split('/')[0]).sum()
Elliptical GEO LEO MEO All
All 21 182 648 32 883
Civil 0 0 38 0 38
Commercial 3 99 412 0 514
Government 9 16 130 0 155
Military 9 67 68 32 176
Grouping by the first part of each name yields
df.groupby(df.class_of_orbit.str.split('/').str.get(0)).sum()
Elliptical GEO LEO MEO All
class_of_orbit
All 21 182 648 32 883
Civil 0 0 38 0 38
Commercial 3 99 412 0 514
Government 9 16 130 0 155
Military 9 67 68 32 176
Love Rafael and piRSquared answers, but if you want to sum all the rows that have just the instance of the group and not only where the group is the first part of the name, you could slightly alter piRsquared's answer.
You could define a helper function to check if a name has a second part and then create a second data frame with the sums of those rows which do have second parts to the name. Then sum this element-wise with the result shown by rafael and piRSquared. I left out the "All" observation but it could be calculated easily from the resulting data frame.
Hope this is okay, I'm new around here.
def second_parts_sum(x):
if len(x.split('/')) > 1:
return x.split('/')[1]
else:
return 'to_be_dropped'
first_parts = xtab.groupby(lambda x: x.split('/')[0]).sum()
second_parts = xtab.groupby(lambda x: second_parts_sum(x)).sum()
first_parts = first_parts[first_parts.index != 'All']
second_parts = second_parts[second_parts.index != 'to_be_dropped']
first_parts + second_parts
Elliptical GEO LEO MEO All
Civil 0 0 50 0 50
Commercial 3 101 493 32 629
Government 9 16 132 0 157
Military 9 67 69 32 177
I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113