how do I add a column for id in dataframe? The values between 0 to 100 should have an id of 1 otherwise 2.
time values
2018-03-19 14:31:17.200 1095
2018-03-19 14:31:17.300 2296
2018-03-19 14:31:17.400 2147
2018-03-19 14:31:17.500 309
2018-03-19 14:31:17.600 244
2018-03-19 14:31:17.700 263
2018-03-19 14:31:17.800 548
I think need numpy.where with condition created by between (default inclusive=True):
df['id'] = np.where(df['values'].between(0,100), 1,2)
print (df)
time values id
1 2018-03-19 14:31:17.200 1095 2
2 2018-03-19 14:31:17.300 2296 2
3 2018-03-19 14:31:17.400 2147 2
4 2018-03-19 14:31:17.500 309 2
5 2018-03-19 14:31:17.600 244 2
6 2018-03-19 14:31:17.700 263 2
7 2018-03-19 14:31:17.800 548 2
Related
i am trying to take diff value from previous row in a dataframe by grouping column "group", there are several similar questions but i can't get this working.
date group value
0 2020-01-01 A 808
1 2020-01-01 B 331
2 2020-01-02 A 612
3 2020-01-02 B 1391
4 2020-01-03 A 234
5 2020-01-04 A 828
6 2020-01-04 B 820
6 2020-01-05 A 1075
8 2020-01-07 B 572
9 2020-01-10 B 736
10 2020-01-10 A 1436
df.sort_values(['group','date'], inplace=True)
df['diff'] = df['value'].diff()
print(df)
date value group diff
1 2020-01-03 234 A NaN
8 2020-01-01 331 B 97.0
2 2020-01-07 572 B 241.0
9 2020-01-02 612 A 40.0
5 2020-01-10 736 B 124.0
17 2020-01-01 808 A 72.0
14 2020-01-04 820 B 12.0
4 2020-01-04 828 A 8.0
18 2020-01-05 1075 A 247.0
7 2020-01-02 1391 B 316.0
10 2020-01-10 1436 A 45.0
This is the result that i need
date group value diff
0 2020-01-01 A 808 Na
2 2020-01-02 A 612 -196
4 2020-01-03 A 234 -378
5 2020-01-04 A 828 594
6 2020-01-05 A 1075 247
10 2020-01-10 A 1436 361
1 2020-01-01 B 331 Na
3 2020-01-02 B 1391 1060
6 2020-01-04 B 820 -571
8 2020-01-07 B 572 -248
9 2020-01-10 B 736 164
Shifts through each group to create a calculated column. Subtract that column from the original value column to create the difference column.
df.sort_values(['group','date'], ascending=[True,True], inplace=True)
df['shift'] = df.groupby('group')['value'].shift()
df['diff'] = df['value'] - df['shift']
df = df[['date','group','value','diff']]
1
df
date group value diff
0 2020-01-01 A 808 NaN
2 2020-01-02 A 612 -196.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
6 2020-01-05 A 1075 247.0
10 2020-01-10 A 1436 361.0
1 2020-01-01 B 331 NaN
3 2020-01-02 B 1391 1060.0
6 2020-01-04 B 820 -571.0
8 2020-01-07 B 572 -248.0
9 2020-01-10 B 736 164.0
You can group use diff()
df = df.sort_values('date')
df['diff'] = df.groupby(['group'])['value'].diff()
gives
date group value diff
0 2020-01-01 A 808 NaN
1 2020-01-01 B 331 NaN
2 2020-01-02 A 612 -196.0
3 2020-01-02 B 1391 1060.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
6 2020-01-04 B 820 -571.0
7 2020-01-05 A 1075 247.0
8 2020-01-07 B 572 -248.0
10 2020-01-10 A 1436 361.0
9 2020-01-10 B 736 164.0
If you want the dataset ordered as you have it you can add group to the sort but its not necessary for the operation and can be done before or after you get the differences.
df.sort_values(['group','date'])
date group value diff
0 2020-01-01 A 808 NaN
2 2020-01-02 A 612 -196.0
4 2020-01-03 A 234 -378.0
5 2020-01-04 A 828 594.0
7 2020-01-05 A 1075 247.0
10 2020-01-10 A 1436 361.0
1 2020-01-01 B 331 NaN
3 2020-01-02 B 1391 1060.0
6 2020-01-04 B 820 -571.0
8 2020-01-07 B 572 -248.0
9 2020-01-10 B 736 164.0
My question is based on this thread, where we group values of a pandas dataframe and select the latest (by date) from each group:
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
using the following
df.loc[df.groupby('id').date.idxmax()]
Say, however, that I want to include the condition that I only want to select the latest (by date) from each group within +/- 5 days. I.e., after grouping I want to find the latest within the following groups:
0 220 6647 2014-09-01 #because only these two are within +/- 5 days of each other
1 220 6647 2014-09-03
2 220 6647 2014-10-16 #spaced more than 5 days apart the above two records
3 826 3380 2014-11-11
.....
which yields
id product date
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
Dataset with price:
id product date price
0 220 6647 2014-09-01 100 #group 1
1 220 6647 2014-09-03 120 #group 1 --> pick this
2 220 6647 2014-09-05 0 #group 1
3 826 3380 2014-11-11 150 #group 2 --> pick this
4 826 3380 2014-12-09 23 #group 3 --> pick this
5 826 3380 2015-05-12 88 #group 4 --> pick this
6 901 4555 2015-05-15 32 #group 4
7 901 4555 2015-10-05 542 #group 5 --> pick this
8 901 4555 2015-11-01 98 #group 6 --> pick this
I think you need create groups by apply with list comprehension and between, then convert to numeric groups by factorize, last use your solution with loc + idxmax:
df['date'] = pd.to_datetime(df['date'])
df = df.reset_index(drop=True)
td = pd.Timedelta('5 days')
def f(x):
x['g'] = [tuple((x.index[x['date'].between(i - td, i + td)])) for i in x['date']]
return x
df2 = df.groupby('id').apply(f)
df2['g'] = pd.factorize(df2['g'])[0]
print (df2)
id product date price g
0 220 6647 2014-09-01 100 0
1 220 6647 2014-09-03 120 0
2 220 6647 2014-09-05 0 0
3 826 3380 2014-11-11 150 1
4 826 3380 2014-12-09 23 2
5 826 3380 2015-05-12 88 3
6 901 4555 2015-05-15 32 4
7 901 4555 2015-10-05 542 5
8 901 4555 2015-11-01 98 6
df3 = df2.loc[df2.groupby('g')['price'].idxmax()]
print (df3)
id product date price g
1 220 6647 2014-09-03 120 0
3 826 3380 2014-11-11 150 1
4 826 3380 2014-12-09 23 2
5 826 3380 2015-05-12 88 3
6 901 4555 2015-05-15 32 4
7 901 4555 2015-10-05 542 5
8 901 4555 2015-11-01 98 6
Or use a two-liner:
df2=pd.to_numeric(df.groupby('id')['date'].diff(-1).astype(str).str[:-25]).abs().fillna(6)
print(df.loc[df2.index[df2>5].tolist()])
Output:
id product date
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
So use diff and slice using string slice, and absolute all the values, then drop the ones less than 5, get those indexes, then get the indexes in the in df.
I have a dataFrame with more than 200 features, and I put a part of the dataset to show the problem:
index ID X1 X2 Date1 Y1
0 2 324 634 2016-01-01 NaN
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654
4 4 543 843 2017-02-01 NaN
5 5 523 843 2015-09-01 NaN
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
As you can see the rows are duplicated (in ID, X1, X2 and Date1) and I want to remove one of the rows which are similar in ID, X1, X2, Date1 and Y1 which contains NaN. So, my desired DataFrame should be:
index ID X1 X2 Date1 Y1
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
Does any one know, how I can handle it?
Use sort_values on "Y1" to move NaNs to the bottom of the DataFrame, and then use drop_duplicates:
df2 = (df.sort_values('Y1', na_position='last')
.drop_duplicates(['ID', 'X1', 'X2', 'Date1'], keep='first')
.sort_index())
df2
ID X1 X2 Date1 Y1
index
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654.0
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
just use drop_duplicates function https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
df \
.orderBy(Y1).desc()) \
.drop_duplicates(subset='ID')
A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]
I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})