I want to filter out a specific value(9999) that appears many times from a subset of my dataset. This is what I have done so far but I'm not sure how to filter out all the 9999 values.
import pandas as pd
import statistics
df=pd.read_csv('Area(2).txt',delimiter='\t')
Initially, this is what a part of my dataset for 30 days (containing 600+ values) looks like below. I'm just showing the first two rows here.
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
Now I wanted to select the range of numbers under the column "Values" between 23-25 April. So I did the following:
df5=df.iloc[528:602,5]
print(df5)
The range of values I get for 23-25 April looks like this:
528 9999
529 9999
530 9999
531 9999
532 9999
597 9999
598 9999
599 9999
600 9999
601 9999
Name: Value, Length: 74, dtype: int64
I want to filter out all 9999 values from this subset, I have tried a number of ways to get rid of these values but I keep getting IndexError: positional indexers are out-of-bounds so I am unable to get rid of 9999 and do further work like finding the variance and standard deviation with the selected subset.
If this helps, I also tried to filter out 9999 in the beginning and it looked like this:
df2=df[df.Value!=9999]
print(df2)
No Date Time Rand Col Value
6 2167 1 4 1991 6:00 181 7 152
7 2168 1 4 1991 7:00 181 8 178
8 2169 1 4 1991 8:00 181 9 239
9 2170 1 4 1991 9:00 181 10 296
10 2171 1 4 1991 10:00 181 11 337
.. ... ... ... ... ... ...
638 2799 27 4 1991 14:00 234 3 193
639 2800 27 4 1991 15:00 234 4 162
640 2801 27 4 1991 16:00 234 5 144
641 2802 27 4 1991 17:00 234 6 151
642 2803 27 4 1991 18:00 234 7 210
[351 rows x 6 columns]
Then I tried to obtain the range of column values between 23 April - 25 April by trying what I did below but I always get IndexError: positional indexers are out-of-bounds
df6=df2.iloc[528:602,5]
print(df6)
How I can properly filter out the value I mentioned and obtain the subset of the dataset that I need?
Given:
No Date Time Rand Col Value
0 2161 1 4 1991 0:00 181 1 9999
1 2162 1 4 1991 1:00 181 2 9999
2 2167 1 4 1991 6:00 181 7 152
3 2168 1 4 1991 7:00 181 8 178
4 2169 1 4 1991 8:00 181 9 239
5 2170 1 4 1991 9:00 181 10 296
6 2171 1 4 1991 10:00 181 11 337
7 2799 27 4 1991 14:00 234 3 193
8 2800 27 4 1991 15:00 234 4 162
9 2801 27 4 1991 16:00 234 5 144
10 2802 27 4 1991 17:00 234 6 151
11 2803 27 4 1991 18:00 234 7 210
First, let's make a proper datetime index:
# Your dates are pretty scuffed, there was some formatting to make them make sense...
df.index = pd.to_datetime(df.Date.str.split().apply(lambda x: f'{x[1].zfill(2)}-{x[0].zfill(2)}-{x[2]}') + ' ' + df.Time)
df.drop(['Date', 'Time'], axis=1, inplace=True)
This gives:
No Rand Col Value
1991-04-01 00:00:00 2161 181 1 9999
1991-04-01 01:00:00 2162 181 2 9999
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337
1991-04-27 14:00:00 2799 234 3 193
1991-04-27 15:00:00 2800 234 4 162
1991-04-27 16:00:00 2801 234 5 144
1991-04-27 17:00:00 2802 234 6 151
1991-04-27 18:00:00 2803 234 7 210
Then, we can easily fulfill your conditions (replace the dates with your own desired range).
df[df.Value.ne(9999)].loc['1991-04-01':'1991-04-01']
# df[df.Value.ne(9999)].loc['1991-04-23':'1991-04-25']
Output:
No Rand Col Value
1991-04-01 06:00:00 2167 181 7 152
1991-04-01 07:00:00 2168 181 8 178
1991-04-01 08:00:00 2169 181 9 239
1991-04-01 09:00:00 2170 181 10 296
1991-04-01 10:00:00 2171 181 11 337
Same question as here: group by pandas dataframe and select latest in each group, except instead of latest date, would like to get next upcoming date for each group.
So given a dataframe sorted by date:
id product date
0 220 6647 2020-09-01
1 220 6647 2020-10-03
2 220 6647 2020-12-16
3 826 3380 2020-11-11
4 826 3380 2020-12-09
5 826 3380 2021-05-19
6 901 4555 2020-09-01
7 901 4555 2020-12-01
8 901 4555 2021-11-01
Using todays date (2020-12-01) to determine the next upcoming date, grouping by id or product and selecting the the next upcoming date should give:
id product date
2 220 6647 2020-12-16
5 826 3380 2020-12-09
8 901 4555 2021-11-01
Filter the dates first, then drop duplicates:
df[df['date']>'2020-12-01'].sort_values(['id','date']).drop_duplicates('id')
Output:
id product date
2 220 6647 2020-12-16
4 826 3380 2020-12-09
8 901 4555 2021-11-01
How to keep the last group within a group using Pandas?
For example, given the following dataset:
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 6647 2014-11-11
4 826 6647 2014-12-09
5 826 6647 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
9 401 4555 2015-01-05
10 401 4555 2015-02-01
how to get the last id group from each of the product group succinctly?
id product date
3 826 6647 2014-11-11
4 826 6647 2014-12-09
5 826 6647 2015-05-19
9 401 4555 2015-01-05
10 401 4555 2015-02-01
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]
I've been searching SO and haven't figured this out yet. Hoping someone can aide this python newb to solving my problem.
I'm trying to figure out how to write an if/then statement in python and perform an aggregation off that if/then statement. My end goal is to say if the date = 1/7/2017 then use the value in the "fake" column. If date = all else then average the two columns together.
Here is what I have so far:
import pandas as pd
import numpy as np
import datetime
np.random.seed(42)
dte=pd.date_range(start=datetime.date(2017,1,1), end= datetime.date(2017,1,15))
fake=np.random.randint(15,100, size=15)
fake2=np.random.randint(300,1000,size=15)
so_df=pd.DataFrame({'date':dte,
'fake':fake,
'fake2':fake2})
so_df['avg']= so_df[['fake','fake2']].mean(axis=1)
so_df.head()
Assuming you have already computed the average column:
so_df['fake'].where(so_df['date']=='20170107', so_df['avg'])
Out:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 449.0
9 395.5
10 197.0
11 438.5
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
If not, you can replace the column reference with the same calculation:
so_df['fake'].where(so_df['date']=='20170107', so_df[['fake','fake2']].mean(axis=1))
To check for multiple dates, you need to use the element-wise version of the or operator (which is pipe: |). Otherwise it will raise an error.
so_df['fake'].where((so_df['date']=='20170107') | (so_df['date']=='20170109'), so_df['avg'])
The above checks for two dates. In the case of 3 or more, you may want to use isin with a list:
so_df['fake'].where(so_df['date'].isin(['20170107', '20170109', '20170112']), so_df['avg'])
Out[42]:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 38.0
9 395.5
10 197.0
11 67.0
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
Let's use np.where:
so_df['avg'] = np.where(so_df['date'] == pd.to_datetime('2017-01-07'),
so_df['fake'], so_df[['fake',
'fake2']].mean(1))
Output:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
One way to do if-else in pandas is by using np.where
There are three values inside, condition, if and else
so_df['avg']= np.where(so_df['date'] == '2017-01-07',so_df['fake'],so_df[['fake','fake2']].mean(axis=1))
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
we can also use Series.where() method:
In [141]: so_df['avg'] = so_df['fake'] \
...: .where(so_df['date'].isin(['2017-01-07','2017-01-09']))
...: .fillna(so_df[['fake','fake2']].mean(1))
...:
In [142]: so_df
Out[142]:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 38.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5