Python Data-wrangling - python

I have a dataframe in Python below:
print (df)
Date Hour Weight
0 2019-01-01 8 1
1 2019-01-01 16 2
2 2019-01-01 24 6
3 2019-01-02 8 10
4 2019-01-02 16 4
5 2019-01-02 24 12
6 2019-01-03 8 10
7 2019-01-03 16 6
8 2019-01-03 24 5
How can I create a column (New_Col) that will return me the value of 'Hour' for the lowest value of 'Weight' in the day. I'm expecting:
Date Hour Weight New_Col
2019-01-01 8 1 8
2019-01-01 16 2 8
2019-01-01 24 6 8
2019-01-02 8 10 16
2019-01-02 16 4 16
2019-01-02 24 12 16
2019-01-03 8 10 24
2019-01-03 16 6 24
2019-01-03 24 5 24

Use GroupBy.transform with DataFrameGroupBy.idxmin, but first create index by Hour column for values from Hour per minimal Weight per groups:
df['New'] = df.set_index('Hour').groupby('Date')['Weight'].transform('idxmin').values
print (df)
Date Hour Weight New_Col New
0 2019-01-01 8 1 8 8
1 2019-01-01 16 2 8 8
2 2019-01-01 24 6 8 8
3 2019-01-02 8 10 16 16
4 2019-01-02 16 4 16 16
5 2019-01-02 24 12 16 16
6 2019-01-03 8 10 24 24
7 2019-01-03 16 6 24 24
8 2019-01-03 24 5 24 24
Alternative solution:
df['New'] = df['Date'].map(df.set_index('Hour').groupby('Date')['Weight'].idxmin())

Related

How can I get the difference of the values of the rows in a dataframe? (for customer code)

My dataset has Customer_Code, As_Of_Date and 24 products. The products have a value of 0 -1. I ordered the data set by customer code and as_of_date. I want to subtract from the next row in the products to the previous row. The important thing here is to get each customer out according to their as_of_date.
I try
df2.set_index('Customer_Code').diff()
and
df2.set_index('As_Of_Date').diff()
and
for i in new["Customer_Code"].unique():
df14 = df12.set_index('As_Of_Date').diff()
but is not true. My code is true for first customer but it is not true for second customer.
How I can do?
You didn't share any data so I made up something that you may use. Your expected outcome also lacks. For further reference, please do not share images. Let's say you have this data:
id date product
0 12 2008-01-01 1
1 12 2008-01-01 2
2 12 2008-01-01 1
3 12 2008-01-02 4
4 12 2008-01-02 5
5 34 2009-01-01 6
6 34 2009-01-01 7
7 34 2009-01-01 84
8 34 2009-01-02 4
9 34 2009-01-02 3
10 34 2009-01-02 3
11 34 2009-01-03 5
12 34 2009-01-03 6
13 34 2009-01-03 8
As I understand it, you want to substract the product value from the previous row, grouped by id and date. (if any other group, adapt). You then need to do this:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(1)), np.nan))
which returns:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -1.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 1.0
7 34 2009-01-01 84 77.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 -1.0
10 34 2009-01-02 3 0.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 1.0
13 34 2009-01-03 8 2.0
or if you want it the other way around:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(-1)), np.nan))
which gives:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -3.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 -1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 -77.0
7 34 2009-01-01 84 80.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 0.0
10 34 2009-01-02 3 -2.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 -2.0
13 34 2009-01-03 8 NaN

Extract year from pandas datetime column as numeric value with NaN for empty cells instead of NaT

I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd
import numpy as np

# example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})

df.ID = pd.to_numeric(df.ID)

df.Date = pd.to_datetime(df.Date)
print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')

#Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan

 #(try1)
df.yyyy = df.Date.astype(float)
 #(try2)
df.yyyy = pd.to_numeric(df.Date)
 #(try3)
print(df)
Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)
Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014

Convert 'Hour of Year' into a datetime

I have time-series data starting from Jan 1st with columns including 'Month', 'Hour of Day' and 'Hour of Year'. I would like to create a datetime column which expresses all this information in the format MM/DD/YYYY HH/MM.
I have tried converting the 'Hour of Year' column to datetime and timedelta objects however both times I receive an error saying that hour must have a value between 0 and 23. As I have data for the whole year, my column ranges from 1 to 8760.
I expect to get data that looks like this : 1/1/2018 1:00.
Here is a sample of the dataset I am working with:
Month Hour_of_Day Hour_of_Year
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
1 6 6
1 7 7
1 8 8
1 9 9
1 10 10
1 11 11
1 12 12
1 13 13
1 14 14
1 15 15
1 16 16
1 17 17
1 18 18
1 19 19
1 20 20
1 21 21
1 22 22
1 23 23
1 24 24
1 1 25
1 2 26
1 3 27
1 4 28
1 5 29
1 6 30
1 7 31
1 8 32
1 9 33
1 10 34
1 11 35
1 12 36
1 13 37
1 14 38
1 15 39
1 16 40
1 17 41
1 18 42
1 19 43
1 20 44
1 21 45
1 22 46
1 23 47
1 24 48
1 1 49
1 2 50
1 3 51
1 4 52
1 5 53
1 6 54
1 7 55
1 8 56
1 9 57
1 10 58
1 11 59
1 12 60
1 13 61
pd.to_timedelta is your friend here:
df['ts'] = pd.Timestamp('2018-01-01')+pd.to_timedelta(df.Hour_of_Year, unit='H')
gives:
Month Hour_of_Day Hour_of_Year ts
0 1 1 1 2018-01-01 01:00:00
1 1 2 2 2018-01-01 02:00:00
2 1 3 3 2018-01-01 03:00:00
3 1 4 4 2018-01-01 04:00:00
4 1 5 5 2018-01-01 05:00:00
5 1 6 6 2018-01-01 06:00:00
6 1 7 7 2018-01-01 07:00:00
7 1 8 8 2018-01-01 08:00:00
8 1 9 9 2018-01-01 09:00:00
9 1 10 10 2018-01-01 10:00:00
10 1 11 11 2018-01-01 11:00:00
11 1 12 12 2018-01-01 12:00:00
12 1 13 13 2018-01-01 13:00:00
13 1 14 14 2018-01-01 14:00:00
14 1 15 15 2018-01-01 15:00:00
15 1 16 16 2018-01-01 16:00:00
16 1 17 17 2018-01-01 17:00:00
17 1 18 18 2018-01-01 18:00:00
18 1 19 19 2018-01-01 19:00:00
19 1 20 20 2018-01-01 20:00:00
20 1 21 21 2018-01-01 21:00:00
21 1 22 22 2018-01-01 22:00:00
22 1 23 23 2018-01-01 23:00:00
23 1 24 24 2018-01-02 00:00:00
24 1 1 25 2018-01-02 01:00:00
25 1 2 26 2018-01-02 02:00:00
26 1 3 27 2018-01-02 03:00:00
27 1 4 28 2018-01-02 04:00:00
28 1 5 29 2018-01-02 05:00:00
29 1 6 30 2018-01-02 06:00:00
.. ... ... ... ...
31 1 8 32 2018-01-02 08:00:00
32 1 9 33 2018-01-02 09:00:00
33 1 10 34 2018-01-02 10:00:00
34 1 11 35 2018-01-02 11:00:00
35 1 12 36 2018-01-02 12:00:00
36 1 13 37 2018-01-02 13:00:00
37 1 14 38 2018-01-02 14:00:00
38 1 15 39 2018-01-02 15:00:00
39 1 16 40 2018-01-02 16:00:00
40 1 17 41 2018-01-02 17:00:00
41 1 18 42 2018-01-02 18:00:00
42 1 19 43 2018-01-02 19:00:00
43 1 20 44 2018-01-02 20:00:00
44 1 21 45 2018-01-02 21:00:00
45 1 22 46 2018-01-02 22:00:00
46 1 23 47 2018-01-02 23:00:00
47 1 24 48 2018-01-03 00:00:00
48 1 1 49 2018-01-03 01:00:00
49 1 2 50 2018-01-03 02:00:00
50 1 3 51 2018-01-03 03:00:00
51 1 4 52 2018-01-03 04:00:00
52 1 5 53 2018-01-03 05:00:00
53 1 6 54 2018-01-03 06:00:00
54 1 7 55 2018-01-03 07:00:00
55 1 8 56 2018-01-03 08:00:00
56 1 9 57 2018-01-03 09:00:00
57 1 10 58 2018-01-03 10:00:00
58 1 11 59 2018-01-03 11:00:00
59 1 12 60 2018-01-03 12:00:00
60 1 13 61 2018-01-03 13:00:00

Combine and Manipulate two columns as Date using PANDAS

I have an csv file and reading it through pandas:
cols=['DATE(GMT)','TIME(GMT)',DATASET]
df=pd.read_csv('datasets.csv', usecols=cols)
csv file content are as follows:
DATE(GMT) TIME(GMT) DATASET
05-01-2018 0 10
05-01-2018 1 15
05-01-2018 2 21
05-01-2018 3 9
05-01-2018 4 25
05-01-2018 5 7
... ... ...
05-02-2018 14 65
Now I need to combine 'DATE(GMT)','TIME(GMT)' as a single DateTime column. So that I can have only two columns i.e. DATETIME and DATASET
You can add parameter parse_dates to red_csv for datetime column:
df = pd.read_csv('datasets.csv', usecols=cols, parse_dates=['DATE(GMT)'])
print (df.dtypes)
DATE(GMT) datetime64[ns]
TIME(GMT) int64
DATASET int64
dtype: object
And then add Time column converted to_timedelta:
df['DATE(GMT)'] += pd.to_timedelta(df.pop('TIME(GMT)').astype(str), unit='H')
print (df)
DATE(GMT) DATASET
0 2018-05-01 00:00:00 10
1 2018-05-01 01:00:00 15
2 2018-05-01 02:00:00 21
3 2018-05-01 03:00:00 9
4 2018-05-01 04:00:00 25
5 2018-05-01 05:00:00 7
6 2018-05-02 14:00:00 65
EDIT:
There is problem some data are non numeric:
print (df)
DATE(GMT) TIME(GMT) DATASET
0 05-01-2018 0 10
1 05-01-2018 1 15
2 05-01-2018 2 21
3 05-01-2018 3 9
4 05-01-2018 4 25
5 05-01-2018 s 7
6 05-02-2018 a 65
You can find it:
print (df[pd.to_numeric(df['TIME(GMT)'], errors='coerce').isnull()])
DATE(GMT) TIME(GMT) DATASET
5 05-01-2018 s 7
6 05-02-2018 a 65
And then if need repalce it by 0 (with all missing values):
df['TIME(GMT)'] = pd.to_numeric(df['TIME(GMT)'], errors='coerce').fillna(0)
print (df)
DATE(GMT) TIME(GMT) DATASET
0 05-01-2018 0.0 10
1 05-01-2018 1.0 15
2 05-01-2018 2.0 21
3 05-01-2018 3.0 9
4 05-01-2018 4.0 25
5 05-01-2018 0.0 7
6 05-02-2018 0.0 65

Apply different variables across different date range in Pandas

I have a dataframe with dates and values from column A to H. Also, I have some fixed variables X1=5, X2=6, Y1=7,Y2=8, Z1=9
Date A B C D E F G H
0 2018-01-02 00:00:00 7161 7205 -44 54920 73 7 5 47073
1 2018-01-03 00:00:00 7101 7147 -46 54710 73 6 5 46570
2 2018-01-04 00:00:00 7146 7189 -43 54730 70 7 5 46933
3 2018-01-05 00:00:00 7079 7121 -43 54720 70 6 5 46404
4 2018-01-08 00:00:00 7080 7125 -45 54280 70 6 5 46355
5 2018-01-09 00:00:00 7060 7102 -43 54440 70 6 5 46319
6 2018-01-10 00:00:00 7113 7153 -40 54510 70 7 5 46837
7 2018-01-11 00:00:00 7103 7141 -38 54690 70 7 5 46728
8 2018-01-12 00:00:00 7074 7110 -36 54310 65 6 5 46357
9 2018-01-15 00:00:00 7181 7210 -29 54320 65 6 5 46792
10 2018-01-16 00:00:00 7036 7078 -42 54420 65 6 5 45709
11 2018-01-17 00:00:00 6994 7034 -40 53690 65 6 5 45416
12 2018-01-18 00:00:00 7032 7076 -44 53590 65 6 5 45705
13 2018-01-19 00:00:00 6999 7041 -42 53560 65 6 5 45331
14 2018-01-22 00:00:00 7025 7068 -43 53500 65 6 5 45455
15 2018-01-23 00:00:00 6883 6923 -41 53490 65 6 5 44470
16 2018-01-24 00:00:00 7111 7150 -39 52630 65 6 5 45866
17 2018-01-25 00:00:00 7101 7138 -37 53470 65 6 5 45663
18 2018-01-26 00:00:00 7043 7085 -43 53380 65 6 5 45087
19 2018-01-29 00:00:00 7041 7085 -44 53370 65 6 5 44958
20 2018-01-30 00:00:00 7010 7050 -41 53040 65 6 5 44790
21 2018-01-31 00:00:00 7079 7118 -39 52880 65 6 5 45248
What I wanted to do is adding some column-wise simple calculations to this dataframe using values in column A to H as well as those fixed variables.
The tricky part is that I need to apply different variables to different date ranges.
For example, during 2018-01-01 to 2018-01-10, I wanted to calculate a new column I where the value equals to: (A+B+C)*X1*Y1+Z1;
While during 2018-01-11 to 2018-01-25, the calculation needs to take (A+B+C)*X2*Y1+Z1. Similar to Y1 and Y2 applied to each of their date ranges.
I know this can calculate/create a new column I.
df[I]=(df[A]+df[B]+df[C])*X1*Y1+Z1
but not sure how to be able to have that flexibility to use different variables to different date ranges.
You can use np.select to define a value based on a condition:
cond = [df.Date.between('2018-01-01','2018-01-10'), df.Date.between('2018-01-11','2018-01-25')]
values = [(df['A']+df['B']+df['C'])*X1*Y1+Z1, (df['A']+df['B']+df['C'])*X2*Y2+Z1]
# select values depending on the condition
df['I'] = np.select(cond, values)

Categories