Missing data count with Pandas

Missing data count with Pandas - python

I have a pandas.DataFrame with TimeSeries (all columns are casted to float) that are indexed with a DatetimeIndex (granularity/frequency is about 1 hour) for row and a MultiIndex for columns. There are missing data within the series (but no missing row, frequency is set). I would like to compute an acquisition performance (percentage) by month.
def mapMonth(x):
return x.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
c = data.groupby(mapMonth).count()
The above code seems to count values ignoring NaN which is what I want. Now I would like to divide this aggregated DataFrame by the expected count.
n = pd.DataFrame(np.full((data.shape[0],), 1, dtype=float), index=data.index).groupby(groupby.mapMonth).sum()
It gives me expected data count by month, but I found this way very tricky.
Anyway I could not succeed dividing the DataFrame c by n using:
p = c.div(n, axis=0)
DataFrames look like:
networkkey RTU
measurandkey NO2
sitekey 41B001 41B004 41B006 41B008 41B011 41MEU1 41N043 41R001 41R002
channelid 280 27 38 55 59 86 103 122 168
2012-01-01 0 728 728 0 728 732 728 728 728
2012-02-01 0 679 678 0 680 686 681 681 679
2012-03-01 0 728 727 0 727 720 726 728 722
2012-04-01 0 705 698 0 702 710 699 705 701
2012-05-01 0 728 728 0 726 728 725 724 680
2012-06-01 0 703 700 0 701 710 705 705 705
2012-07-01 0 728 728 0 728 657 707 728 728
0
2012-01-01 744.0
2012-02-01 696.0
2012-03-01 744.0
2012-04-01 720.0
2012-05-01 744.0
2012-06-01 720.0
2012-07-01 744.0
2012-08-01 744.0
2012-09-01 720.0
2012-10-01 744.0
2012-11-01 720.0
2012-12-01 744.0
I suspect the problem is because of the MultiIndex. Anyway I do not find this method straightforward.
Is there a cleaner/cleaver what to compute this aggregate with Pandas?

I finally found the size function which does not ignore NaN. Therefore the following code perform what I want in few lines:
# Group Data:
g = data.groupby(groupby.mapMonth)
# Compute Performance
c = g.count()
n = g.size()
d = c.div(n, axis=0)

Related

Resetting all indices from multiple index dataframe

I have a multi-index dataframe (df):
contract A B Total
sex Male Female Male Female TotalMale TotalFemale
grade2
B1 948 467 408 835 1356 1302
B2 184 863 515 359 699 1222
B3 241 351 907 360 1148 711
B4 472 175 809 555 1281 730
B5 740 563 606 601 1346 1164
B6 435 780 295 392 730 1172
Total 3020 3199 3540 3102 6560 6301
I am trying to drop all indexes so my output is:
0 1 2 3 4 5
0 948 467 408 835 1356 1302
1 184 863 515 359 699 1222
2 241 351 907 360 1148 711
3 472 175 809 555 1281 730
4 740 563 606 601 1346 1164
5 435 780 295 392 730 1172
6 3020 3199 3540 3102 6560 6301
I have tried:
df= df.reset_index()
and
df= df.reset_index(drop=True)
without success

Try with build newDataFrame
df = pd.DataFrame(df.to_numpy())

You can use set_axis for the columns:
df.set_axis(range(df.shape[1]), axis=1).reset_index(drop=True)
If you need to use it in a pipeline, combine it with pipe:
(df
.pipe(lambda d: d.set_axis(range(d.shape[1]), axis=1))
.reset_index(drop=True)
)

aggreagtate the overall data based on week and sum all the values for that particular week

I have a csv file with data such as:
from datetime import datetime
import random
pd.DataFrame({'date':pd.date_range(datetime.today(), periods=100).tolist(),
'country': random.sample(range(1,101), 100),
'amount': random.sample(range(1,101), 100),
'others': random.sample(range(1,101), 100)})
I wish to sum the columns records for a week. Example, 2020-05-01 to 2020-05-07 will be one week so it will sum the amount and total it for this week. This should continue until the end of data and the output I wish is something like:
Example output:
date country amount others
month 5 week 1 100 50 50
month 5 week 2 30 60 60
month 5 week 3 50 70 666
month 5 week 4 60 100 445

I know its not the exact what you are looking for but one way of doing is using the pd.Grouper method:
In [74]: res = df.set_index('date').groupby(pd.Grouper(freq='W-MON'))[['country','amount','others']].sum().reset_index(
...: )
In [75]: res
Out[75]:
date country amount others
0 2020-04-20 257 412 344
1 2020-04-27 392 335 259
2 2020-05-04 294 263 363
3 2020-05-11 350 341 245
4 2020-05-18 394 277 330
5 2020-05-25 398 305 341
6 2020-06-01 398 338 509
7 2020-06-08 324 364 421
8 2020-06-15 435 415 430
9 2020-06-22 431 365 352
10 2020-06-29 330 275 358
11 2020-07-06 326 384 308
12 2020-07-13 368 473 364
13 2020-07-20 278 387 362
14 2020-07-27 75 116 64
In [86]: res['month'] = res['date'].dt.strftime('%b')
In [87]: res['weeknum'] = res['date'].apply(lambda x: x.isocalendar()[1])
In [88]: res.head(10)
Out[88]:
date country amount others month weeknum
0 2020-04-20 257 412 344 Apr 17
1 2020-04-27 392 335 259 Apr 18
2 2020-05-04 294 263 363 May 19
3 2020-05-11 350 341 245 May 20
4 2020-05-18 394 277 330 May 21
5 2020-05-25 398 305 341 May 22
6 2020-06-01 398 338 509 Jun 23
7 2020-06-08 324 364 421 Jun 24
8 2020-06-15 435 415 430 Jun 25
9 2020-06-22 431 365 352 Jun 26
It would group the dates on a weekly frequency. More details can be found here. Here the week number is based on a yearly basis.

Taking the mean value of N last days

I have this data frame:
ID Date X 123_Var 456_Var 789_Var
A 16-07-19 3 777 250 810
A 17-07-19 9 637 121 529
A 20-07-19 2 295 272 490
A 21-07-19 3 778 600 544
A 22-07-19 6 741 792 907
A 25-07-19 6 435 416 820
A 26-07-19 8 590 455 342
A 27-07-19 6 763 476 753
A 02-08-19 6 717 211 454
A 03-08-19 6 152 442 475
A 05-08-19 6 564 340 302
A 07-08-19 6 105 929 633
A 08-08-19 6 948 366 586
B 07-08-19 4 509 690 406
B 08-08-19 2 413 725 414
B 12-08-19 2 170 702 912
B 13-08-19 3 851 616 477
B 14-08-19 9 475 447 555
B 15-08-19 1 412 403 708
B 17-08-19 2 299 537 321
B 18-08-19 4 310 119 125
I want to show the mean value of n last days (using Date column), excluding the value of current day.
I'm using this code (what should I do to fix this?):
n = 4
cols = list(df.filter(regex='Var').columns)
df = df.set_index('Date')
df[cols] = (df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
.reset_index(0,drop=True).add_suffix(f'_{n}'))
df.reset_index(inplace=True)
Expected result:
ID Date X 123_Var 456_Var 789_Var 123_Var_4 456_Var_4 789_Var_4
A 16-07-19 3 777 250 810 NaN NaN NaN
A 17-07-19 9 637 121 529 777.000000 250.000000 810.0
A 20-07-19 2 295 272 490 707.000000 185.500000 669.5
A 21-07-19 3 778 600 544 466.000000 196.500000 509.5
A 22-07-19 6 741 792 907 536.500000 436.000000 517.0
A 25-07-19 6 435 416 820 759.500000 696.000000 725.5
A 26-07-19 8 590 455 342 588.000000 604.000000 863.5
A 27-07-19 6 763 476 753 512.500000 435.500000 581.0
A 02-08-19 6 717 211 454 NaN NaN NaN
A 03-08-19 6 152 442 475 717.000000 211.000000 454.0
A 05-08-19 6 564 340 302 434.500000 326.500000 464.5
A 07-08-19 6 105 929 633 358.000000 391.000000 388.5
A 08-08-19 6 948 366 586 334.500000 634.500000 467.5
B 07-08-19 4 509 690 406 NaN NaN NaN
B 08-08-19 2 413 725 414 509.000000 690.000000 406.0
B 12-08-19 2 170 702 912 413.000000 725.000000 414.0
B 13-08-19 3 851 616 477 291.500000 713.500000 663.0
B 14-08-19 9 475 447 555 510.500000 659.000000 694.5
B 15-08-19 1 412 403 708 498.666667 588.333333 648.0
B 17-08-19 2 299 537 321 579.333333 488.666667 580.0
B 18-08-19 4 310 119 125 395.333333 462.333333 528.0
Note: dataframe has changed.

I change unutbu solution for working in rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
n = 5
cols = df.filter(regex='Var').columns
df = df.set_index('Date')
df_ = df.set_index('ID', append=True).swaplevel(1,0)
df1 = df.groupby('ID').rolling(window=f'{n}D')[cols].count()
df2 = df.groupby('ID').rolling(window=f'{n}D')[cols].mean()
df3 = (df1.mul(df2)
.sub(df_[cols])
.div(df1[cols].sub(1)).add_suffix(f'_{n}')
)
df4 = df_.join(df3)
print (df4)
X 123_Var 456_Var 789_Var 123_Var_5 456_Var_5 789_Var_5
ID Date
A 2019-07-16 3 777 250 810 NaN NaN NaN
2019-07-17 9 637 121 529 777.000000 250.000000 810.0
2019-07-20 2 295 272 490 707.000000 185.500000 669.5
2019-07-21 3 778 600 544 466.000000 196.500000 509.5
2019-07-22 6 741 792 907 536.500000 436.000000 517.0
2019-07-25 6 435 416 820 759.500000 696.000000 725.5
2019-07-26 8 590 455 342 588.000000 604.000000 863.5
2019-07-27 6 763 476 753 512.500000 435.500000 581.0
2019-08-02 6 717 211 454 NaN NaN NaN
2019-08-03 6 152 442 475 717.000000 211.000000 454.0
2019-08-05 6 564 340 302 434.500000 326.500000 464.5
2019-08-07 6 105 929 633 358.000000 391.000000 388.5
2019-08-08 6 948 366 586 334.500000 634.500000 467.5
B 2019-08-07 4 509 690 406 NaN NaN NaN
2019-08-08 2 413 725 414 509.000000 690.000000 406.0
2019-08-12 2 170 702 912 413.000000 725.000000 414.0
2019-08-13 3 851 616 477 170.000000 702.000000 912.0
2019-08-14 9 475 447 555 510.500000 659.000000 694.5
2019-08-15 1 412 403 708 498.666667 588.333333 648.0
2019-08-17 2 299 537 321 579.333333 488.666667 580.0
2019-08-18 4 310 119 125 395.333333 462.333333 528.0

Creating a column that calculates the difference between each new row

I have a dataframe with some dates,and associated data with each date that I am reading in from a csv file (the file is relatively small, on the magnitude of 10,000s of rows, and ~10 columns):
memid date a b
10000 7/3/2017 221 143
10001 7/4/2017 442 144
10002 7/6/2017 132 145
10003 7/8/2017 742 146
10004 7/10/2017 149 147
I want to add a column, "date_diff", to this dataframe that calculates the amount of days between each date and the previously most recent date (the rows are always sorted by date):
memid date a b date_diff
10000 7/3/2017 221 143 NaN
10001 7/4/2017 442 144 1
10002 7/6/2017 132 145 2
10003 7/8/2017 742 146 2
10004 7/11/2017 149 147 3
I am having trouble figuring out a good way to create this "date_diff" column as iterating row by row tends to be frowned upon when using pandas/numpy. Is there an easy way to create this column in python/pandas/numpy or is this job better done before the csv is read into my script?
Thanks!
EDIT: Thanks to jpp and Tai for their answer. It covers the original question but I have a follow up:
What if my dataset has multiple rows for each date? Is there a way to easily check the difference between each group of dates to produce an output like the example below? Is it easier if there are a set number of rows for each date?
memid date a b date_diff
10000 7/3/2017 221 143 NaN
10001 7/3/2017 442 144 NaN
10002 7/4/2017 132 145 1
10003 7/4/2017 742 146 1
10004 7/6/2017 149 147 2
10005 7/6/2017 457 148 2

Edit to answer OP's new question: what if there are duplicates in date columns?
Set up: creating a df that does not contains duplicates
df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df_no_dup = df.drop_duplicates("date").copy()
df_no_dup["diff"] = df_no_dup["date"].diff().dt.days
Method 1 : merge
df.merge(df_no_dup[["date", "diff"]], left_on="date", right_on="date", how="left")
memid date a b diff
0 10000 2017-07-03 221 143 NaN
1 10001 2017-07-03 442 144 NaN
2 10002 2017-07-04 132 145 1.0
3 10003 2017-07-04 742 146 1.0
4 10004 2017-07-06 149 147 2.0
5 10005 2017-07-06 457 148 2.0
Method 2 : map
df["diff"] = df["date"].map(df_no_dup.set_index("date")["diff"])
Try this.
df.date = pd.to_datetime(df.date, infer_datetime_format=True)
df.date.diff()
0 NaT
1 1 days
2 2 days
3 2 days
4 2 days
Name: date, dtype: timedelta64[ns]
To convert to integers:
df['diff'] = df['date'].diff() / np.timedelta64(1, 'D')
# memid date a b diff
# 0 10000 2017-07-03 221 143 NaN
# 1 10001 2017-07-04 442 144 1.0
# 2 10002 2017-07-06 132 145 2.0
# 3 10003 2017-07-08 742 146 2.0
# 4 10004 2017-07-10 149 147 2.0

How to force date index to dataframe when concatentating

I am trying to concat two dataframes:
DataFrame 1 'AB1'
AB_BH AB_CA
Date
2007-01-05 305 324
2007-01-12 427 435
2007-01-19 481 460
2007-01-26 491 506
2007-02-02 459 503
2007-02-09 459 493
2007-02-16 450 486
DataFrame 2 'ABFluid'
Obj Total Rigs
Date
2007-01-03 312
2007-01-09 412
2007-01-16 446
2007-01-23 468
2007-01-30 456
2007-02-06 465
2007-02-14 456
2007-02-20 435
2007-02-27 440
Using the following code:
rigdata = pd.concat([AB1,ABFluid['Total Rigs']], axis=1
Which results in this:
AB_BH AB_CA Total Rigs
Date
2007-01-03 NaN NaN 312
2007-01-05 305 324 NaN
2007-01-09 NaN NaN 412
2007-01-12 427 435 NaN
2007-01-16 NaN NaN 446
2007-01-19 481 460 NaN
2007-01-23 NaN NaN 468
2007-01-26 491 506 NaN
But I am looking to force the 'Total Rigs' dataframe to have the same dates as the AB1 frame like this:
AB_BH AB_CA Total Rigs
Date
2007-01-03 305 324 312
2007-01-12 427 435 412
2007-01-19 481 460 446
2007-01-26 491 506 468
Which is just aligning them by column and re_indexing the dates.
Any suggestions??

You could do ABFluid.index = AB1.index before the concat, to make the second DataFrame have the same index as the first.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Missing data count with Pandas - python

I finally found the size function which does not ignore NaN. Therefore the following code perform what I want in few lines: # Group Data: g = data.groupby(groupby.mapMonth) # Compute Performance c = g.count() n = g.size() d = c.div(n, axis=0)

Related

Resetting all indices from multiple index dataframe

aggreagtate the overall data based on week and sum all the values for that particular week

Taking the mean value of N last days

Creating a column that calculates the difference between each new row

How to force date index to dataframe when concatentating

Categories

Resources