I have a pandas dataframe pmov with columns SDRFT and DRFT containing float values. Some of the DRFT values are 0.0. When that happens, I want to replace the DRFT value with the SDRFT value. For testing purposes, I've stored the rows where DRFT = 0.0 in dataframe df.
I've tried defining the function:
def SDRFT_is_DRFT(row):
if row['SDRFT'] == row['DRFT']:
pass
elif row['SDRFT'] == 0:
row['SDRFT'] = row['DRFT']
elif ['DRFT'] == 0:
row['DRFT'] = row['SDRFT']
return row[['SDRFT','DRFT']]
and applying it with: df.apply(SDRFT_is_DRFT, axis=1)
which returns:
In []: df.apply(SDRFT_is_DRFT, axis=1)
Out[]:
SDRFT DRFT
118 29.500000 0.0
144 0.000000 0.0
212 29.166667 0.0
250 21.000000 0.0
308 21.500000 0.0
317 24.500000 0.0
327 11.000000 0.0
334 31.000000 0.0
347 29.500000 0.0
348 35.000000 0.0
Which isn't the outcome I want.
I also tried the function:
def drft_repl(row):
if row['DRFT']==0:
row['DRFT'] = row['SDRFT']
which appeared to work for df.DRFT = df.apply(drft_repl, axis=1)
but pmov.DRFT = pmov.apply(drft_repl, axis=1) resulted in 100% replacement of DRFT values with SDRFT values, except where the DRFT value was nan.
How can I conditionally replace cell values in one column with values in another column of the same row?
try this:
df.loc[df.DRFT == 0, 'DRFT'] = df.SDRFT
I think you can use mask. First is replaced column SDRFT with values of DRFT where is condition True and last is replaced column DRFT with values of SDRFT:
pmov.SDRFT = pmov.SDRFT.mask(pmov.SDRFT == 0, pmov.DRFT)
pmov.DRFT = pmov.DRFT.mask(pmov.DRFT == 0, pmov.SDRFT)
print pmov
SDRFT DRFT
118 29.500000 29.500000
144 0.000000 0.000000
212 29.166667 29.166667
250 21.000000 21.000000
308 21.500000 21.500000
317 24.500000 24.500000
327 11.000000 11.000000
334 31.000000 31.000000
347 29.500000 29.500000
348 35.000000 35.000000
Another solution with loc:
pmov.loc[pmov.SDRFT == 0, 'SDRFT'] = pmov.DRFT
pmov.loc[pmov.DRFT == 0, 'DRFT'] = pmov.SDRFT
print pmov
SDRFT DRFT
118 29.500000 29.500000
144 0.000000 0.000000
212 29.166667 29.166667
250 21.000000 21.000000
308 21.500000 21.500000
317 24.500000 24.500000
327 11.000000 11.000000
334 31.000000 31.000000
347 29.500000 29.500000
348 35.000000 35.000000
For better testing DataFrame was changed:
print pmov
SDRFT DRFT
118 29.5 29.50
144 0.0 5.98
212 0.0 7.30
250 21.0 0.00
308 21.5 0.00
317 0.0 0.00
327 11.0 0.00
334 31.0 0.00
347 29.5 0.00
348 35.0 35.00
pmov.SDRFT = pmov.SDRFT.mask(pmov.SDRFT == 0, pmov.DRFT)
pmov.DRFT = pmov.DRFT.mask(pmov.DRFT == 0, pmov.SDRFT)
print pmov
SDRFT DRFT
118 29.50 29.50
144 5.98 5.98
212 7.30 7.30
250 21.00 21.00
308 21.50 21.50
317 0.00 0.00
327 11.00 11.00
334 31.00 31.00
347 29.50 29.50
348 35.00 35.00
pmov.loc[pmov.DRFT == 0, 'DRFT'] = pmov.SDRFT
pmov.loc[pmov.SDRFT == 0, 'SDRFT'] = pmov.DRFT
print pmov
SDRFT DRFT
118 29.50 29.50
144 5.98 5.98
212 7.30 7.30
250 21.00 21.00
308 21.50 21.50
317 0.00 0.00
327 11.00 11.00
334 31.00 31.00
347 29.50 29.50
348 35.00 35.00
Related
I have a dataframe called new_df
that prints this ....
which basically collates the following data
Pass Profit Trades MA2
0 69 10526.0 14 119
1 47 10420.0 13 97
2 68 10406.0 14 118
3 50 10376.0 13 100
4 285 10352.0 16 335
... ... ... ... ...
21643 117 -10376.0 14 167
21644 116 -10376.0 14 166
21645 115 -10376.0 14 165
21646 114 -10376.0 14 164
21647 113 -10376.0 14 163
[21648 rows x 4 columns]
and then i can see there are 48 times 69 is showing in the Pass column, etc
#counts the number of times each pass number is listed in pass column
new_df['Pass'].value_counts()
69 48
219 48
184 48
185 48
186 48
..
59 48
16 48
20 48
70 48
113 48
Name: Pass, Length: 451, dtype: int64
Right now i am trying to create a new df called sorted_df
the columns i cant get working are below
Total Pass - Counts the number of times a unique number in Pass column also has the profit column above 110000
Pass % - Total Pass / Total Weeks
Total Fail - Counts the number of times a unique number in Pass column also has the profit column below 100000
Fail % - Total Fail / Total Weeks
sorted_df = pd.DataFrame(columns=['Pass','Total Profit','Total Weeks','Average per week','Total Pass','Pass %','Total Fail','Fail %','MA2'])
#group the original df by Pass and get first MA2 value of each group
pass_to_ma2 = new_df.groupby('Pass')['MA2'].first()
total_pass = 0
total_fail = 0
for value in new_df['Pass'].unique():
mask = new_df['Pass'] == value
pass_value = new_df[mask]
total_profit = pass_value['Profit'].sum()
total_weeks = pass_value.shape[0]
average_per_week = total_profit/total_weeks
total_pass = pass_value[pass_value['Profit'] > 110000].shape[0]
pass_percentage = total_pass / total_weeks * 100 if total_weeks > 0 else 0
total_fail = pass_value[pass_value['Profit'] < 100000].shape[0]
fail_percentage = total_fail / total_weeks * 100 if total_weeks > 0 else 0
sorted_df = sorted_df.append({'Pass': value, 'Total Profit': total_profit, 'Total Weeks': total_weeks, 'Average per week': average_per_week, 'In Profit': in_profit, 'Profit %': profit_percentage, 'Total Pass': total_pass, 'Pass %': pass_percentage, 'Total Fail': total_fail, 'Fail %': fail_percentage}, ignore_index=True)
# Add the MA2 value to the sorted_df DataFrame
sorted_df["MA2"] = sorted_df["Pass"].map(pass_to_ma2)
Pass Total Profit Total Weeks Average per week Total Pass Pass % \
0 69.0 505248.0 48.0 10526.0 0.0 0.0
1 47.0 500160.0 48.0 10420.0 0.0 0.0
2 68.0 499488.0 48.0 10406.0 0.0 0.0
3 50.0 498048.0 48.0 10376.0 0.0 0.0
4 285.0 496896.0 48.0 10352.0 0.0 0.0
.. ... ... ... ... ... ...
446 117.0 -498048.0 48.0 -10376.0 0.0 0.0
447 116.0 -498048.0 48.0 -10376.0 0.0 0.0
448 115.0 -498048.0 48.0 -10376.0 0.0 0.0
449 114.0 -498048.0 48.0 -10376.0 0.0 0.0
450 113.0 -498048.0 48.0 -10376.0 0.0 0.0
Total Fail Fail % MA2 In Profit Profit %
0 48.0 100.0 119 0.0 0.0
1 48.0 100.0 97 0.0 0.0
2 48.0 100.0 118 0.0 0.0
3 48.0 100.0 100 0.0 0.0
4 48.0 100.0 335 0.0 0.0
.. ... ... ... ... ...
446 48.0 100.0 167 0.0 0.0
447 48.0 100.0 166 0.0 0.0
448 48.0 100.0 165 0.0 0.0
449 48.0 100.0 164 0.0 0.0
450 48.0 100.0 163 0.0 0.0
[451 rows x 11 columns]
What am i doing wrong?
I have the following pandas dataframe. There are many NaN but there are lots of NaN value (I skipped the NaN value to make it look shorter).
0 NaN
...
26 NaN
27 357.0
28 357.0
29 357.0
30 NaN
...
246 NaN
247 357.0
248 357.0
249 357.0
250 NaN
...
303 NaN
304 58.0
305 58.0
306 58.0
307 58.0
308 58.0
309 58.0
310 58.0
311 58.0
312 58.0
313 58.0
314 58.0
315 58.0
316 NaN
...
333 NaN
334 237.0
I would like to filter all the NaN value and also only keep the first value out of the NaN (e.g. from index 27-29 there are three values, I would like to keep the value indexed 27 and skip the 28 and 29 value). The targeted array should be as follows:
27 357.0
247 357.0
304 58.0
334 237.0
I am not sure how could I keep only the first value. Thanks in advance.
Take only values that aren't nan, but the value before them is nan:
df = df[df.col1.notna() & df.col1.shift().isna()]
Output:
col1
27 357.0
247 357.0
304 58.0
334 237.0
Assuming all values are greater than 0, we could also do:
df = df.fillna(0).diff()
df = df[df.col1.gt(0)]
You can find the continuous index and diff to get its first value
m = (df['col'].dropna()
.index.to_series()
.diff().fillna(2).gt(1)
.reindex(range(df.index.max()+1))
.fillna(False))
out = df[m]
print(out)
col
27 357.0
247 357.0
304 58.0
334 237.0
say that i have a df in the following format:
year 2016 2017 2018 2019 2020 min max avg
month
2021-01-01 284 288 311 383 476 284 476 357.4
2021-02-01 301 315 330 388 441 301 441 359.6
2021-03-01 303 331 341 400 475 303 475 375.4
2021-04-01 283 300 339 419 492 283 492 372.6
2021-05-01 287 288 346 420 445 287 445 359.7
2021-06-01 283 292 340 424 446 283 446 359.1
2021-07-01 294 296 360 444 452 294 452 370.3
2021-08-01 294 315 381 445 451 294 451 375.9
2021-09-01 288 331 405 464 459 288 464 385.6
2021-10-01 327 349 424 457 453 327 457 399.1
2021-11-01 316 351 413 469 471 316 471 401.0
2021-12-01 259 329 384 467 465 259 467 375.7
and i would like to get the difference of the 2020 column by using df['delta'] = df['2020'].diff()
this will obviously return NaN for the first value in the column. how can i make it so that it automatically interprets that diff as the difference between the FIRST value of 2020 and the LAST value of 2019?
If you want only for 2020:
df["delta"] = pd.concat([df["2019"], df["2020"]]).diff().tail(len(df))
Prints:
year 2016 2017 2018 2019 2020 min max avg delta
0 2021-01-01 284 288 311 383 476 284 476 357.4 9.0
1 2021-02-01 301 315 330 388 441 301 441 359.6 -35.0
2 2021-03-01 303 331 341 400 475 303 475 375.4 34.0
3 2021-04-01 283 300 339 419 492 283 492 372.6 17.0
4 2021-05-01 287 288 346 420 445 287 445 359.7 -47.0
5 2021-06-01 283 292 340 424 446 283 446 359.1 1.0
6 2021-07-01 294 296 360 444 452 294 452 370.3 6.0
7 2021-08-01 294 315 381 445 451 294 451 375.9 -1.0
8 2021-09-01 288 331 405 464 459 288 464 385.6 8.0
9 2021-10-01 327 349 424 457 453 327 457 399.1 -6.0
10 2021-11-01 316 351 413 469 471 316 471 401.0 18.0
11 2021-12-01 259 329 384 467 465 259 467 375.7 -6.0
You can try unstack then do the diff, notice the first item in 2016 will still be NaN
out = df.drop(['min','max','avg'],1).unstack().diff().unstack(0)
2016 2017 2018 2019 2020
2021-01-01 NaN 29.0 -18.0 -1.0 9.0
2021-02-01 17.0 27.0 19.0 5.0 -35.0
2021-03-01 2.0 16.0 11.0 12.0 34.0
2021-04-01 -20.0 -31.0 -2.0 19.0 17.0
2021-05-01 4.0 -12.0 7.0 1.0 -47.0
2021-06-01 -4.0 4.0 -6.0 4.0 1.0
2021-07-01 11.0 4.0 20.0 20.0 6.0
2021-08-01 0.0 19.0 21.0 1.0 -1.0
2021-09-01 -6.0 16.0 24.0 19.0 8.0
2021-10-01 39.0 18.0 19.0 -7.0 -6.0
2021-11-01 -11.0 2.0 -11.0 12.0 18.0
2021-12-01 -57.0 -22.0 -29.0 -2.0 -6.0
I have a long df from 07:00:00 to 20:00:00 (df1) and a short df with only fractions of the long one (df2) (identical datetime index values).
I would like to compare the groupsize values of the two data frames.
The datetime index, id, x, and y values should be identical.
I can i do this?
df1:
Out[180]:
date id gs x y
2019-10-09 07:38:22.139 3166 nan 248 233
2019-10-09 07:38:25.259 3166 nan 252 235
2019-10-09 07:38:27.419 3166 nan 253 231
2019-10-09 07:38:30.299 3166 nan 251 232
2019-10-09 07:38:32.379 3166 nan 251 233
2019-10-09 07:38:37.179 3166 nan 228 245
2019-10-09 07:39:49.498 3167 nan 289 253
2019-10-09 07:40:19.099 3168 nan 288 217
2019-10-09 07:40:38.779 3169 nan 278 139
2019-10-09 07:40:39.899 3169 nan 279 183
...
2019-10-09 19:52:53.959 5725 nan 190 180
2019-10-09 19:52:56.439 5725 nan 193 185
2019-10-09 19:52:58.919 5725 nan 204 220
2019-10-09 19:53:06.440 5804 nan 190 198
2019-10-09 19:53:08.919 5804 nan 200 170
2019-10-09 19:53:11.419 5804 nan 265 209
2019-10-09 19:53:16.460 5789 nan 292 218
2019-10-09 19:53:36.460 5806 nan 284 190
2019-10-09 19:54:08.939 5807 nan 404 226
2019-10-09 19:54:23.979 5808 nan 395 131
df2:
Out[181]:
date id gs x y
2019-10-09 11:20:01.418 3479 2.0 353 118.0
2019-10-09 11:20:01.418 3477 2.0 315 92.0
2019-10-09 11:20:01.418 3473 2.0 351 176.0
2019-10-09 11:20:01.418 3476 2.0 318 176.0
2019-10-09 11:20:01.418 3386 0.0 148 255.0
2019-10-09 11:20:01.418 3390 0.0 146 118.0
2019-10-09 11:20:01.418 3447 0.0 469 167.0
2019-10-09 11:20:03.898 3447 0.0 466 169.0
2019-10-09 11:20:03.898 3390 0.0 139 119.0
2019-10-09 11:20:03.898 3477 2.0 316 93.0
Expected output should be a dataframe with columns "date", "id", "x", "y", "gs(df1)", "gs(df2)"
Do a Merge where everything is equal but make sure to reset index so its part of the merge condition
df1_t = df1.reset_index()
df2_t = df1.reset_index()
results = df1_t.merge(df2_t, left_on = ['date', 'ids', 'x', 'y'],
right_on = ['date', 'ids', 'x', 'y'],
indicator = True).reset_index()
print(results)
results will have the rows on df1 that are in df2.
I am following a Lynda tutorial where they use the following code:
import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')
flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack()
flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
and it works perfectly. However, in my case it seems that the code is not compiling, for the last line I keep getting an error.
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I know in the video they are using Python 2, however I have Python 3 since I am learning for work (which uses Python 3). Most of the differences I have been able to figure out, however I cannot figure out how to create this new column called 'total' with the sums of the passengers.
The root cause of this error message is the categorical nature of the month column:
In [42]: flights.dtypes
Out[42]:
year int64
month category
passengers int64
dtype: object
In [43]: flights.month.cat.categories
Out[43]: Index(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], d
type='object')
and you are trying to add a category total - Pandas doesn't like that.
Workaround:
In [45]: flights.month.cat.add_categories('total', inplace=True)
In [46]: x = flights.pivot(index='year', columns='month', values='passengers')
In [47]: x['total'] = x.sum(1)
In [48]: x
Out[48]:
month January February March April May June July August September October November December total
year
1949 112.0 118.0 132.0 129.0 121.0 135.0 148.0 148.0 136.0 119.0 104.0 118.0 1520.0
1950 115.0 126.0 141.0 135.0 125.0 149.0 170.0 170.0 158.0 133.0 114.0 140.0 1676.0
1951 145.0 150.0 178.0 163.0 172.0 178.0 199.0 199.0 184.0 162.0 146.0 166.0 2042.0
1952 171.0 180.0 193.0 181.0 183.0 218.0 230.0 242.0 209.0 191.0 172.0 194.0 2364.0
1953 196.0 196.0 236.0 235.0 229.0 243.0 264.0 272.0 237.0 211.0 180.0 201.0 2700.0
1954 204.0 188.0 235.0 227.0 234.0 264.0 302.0 293.0 259.0 229.0 203.0 229.0 2867.0
1955 242.0 233.0 267.0 269.0 270.0 315.0 364.0 347.0 312.0 274.0 237.0 278.0 3408.0
1956 284.0 277.0 317.0 313.0 318.0 374.0 413.0 405.0 355.0 306.0 271.0 306.0 3939.0
1957 315.0 301.0 356.0 348.0 355.0 422.0 465.0 467.0 404.0 347.0 305.0 336.0 4421.0
1958 340.0 318.0 362.0 348.0 363.0 435.0 491.0 505.0 404.0 359.0 310.0 337.0 4572.0
1959 360.0 342.0 406.0 396.0 420.0 472.0 548.0 559.0 463.0 407.0 362.0 405.0 5140.0
1960 417.0 391.0 419.0 461.0 472.0 535.0 622.0 606.0 508.0 461.0 390.0 432.0 5714.0
UPDATE: alternatively if you don't want to touch the original DF you can get rid of categorical columns in the flights_unstacked DF:
In [76]: flights_unstacked.columns = \
...: flights_unstacked.columns \
...: .set_levels(flights_unstacked.columns.get_level_values(1).categories,
...: level=1)
...:
In [77]: flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
In [78]: flights_unstacked
Out[78]:
passengers
month January February March April May June July August September October November December total
year
1949 112 118 132 129 121 135 148 148 136 119 104 118 1520
1950 115 126 141 135 125 149 170 170 158 133 114 140 1676
1951 145 150 178 163 172 178 199 199 184 162 146 166 2042
1952 171 180 193 181 183 218 230 242 209 191 172 194 2364
1953 196 196 236 235 229 243 264 272 237 211 180 201 2700
1954 204 188 235 227 234 264 302 293 259 229 203 229 2867
1955 242 233 267 269 270 315 364 347 312 274 237 278 3408
1956 284 277 317 313 318 374 413 405 355 306 271 306 3939
1957 315 301 356 348 355 422 465 467 404 347 305 336 4421
1958 340 318 362 348 363 435 491 505 404 359 310 337 4572
1959 360 342 406 396 420 472 548 559 463 407 362 405 5140
1960 417 391 419 461 472 535 622 606 508 461 390 432 5714