I have a data frame that looks like this:
date score type
2020-01-01 1 a
2020-04-01 0 a
2020-01-01 3 a
2020-04-01 2 a
2020-11-01 3 b
2019-12-01 4 b
2020-01-01 4 b
If I want to rescale the column score from 60 to 100 for each type and for each date I can easily do as follows:
df['score_rescaled'] = df.groupby(['date', 'type'])['score'].apply(lambda x: (40*(x-min(x)))/(max(x)-min(x)) + 60)
how ever I would like to rescale the column score from 60 to 100 for each type for each date rescaling using not only the values for the single date, but value for each date up to the date:
so for instance for date 2020-01-01 I want to rescale the values at 2020-01-01 using all the scores from month 2020-01-01 and 2019-12-01
for date 2020-04-01 I want to rescale scores in 2020-04-01 using scores from month 2020-01-01, 2019-12-01 and 2020-04-01 etc.
how can I do?
Related
I have a column below as
date
2019-05-11
2019-11-11
2020-03-01
2021-02-18
How can I create a new column that is the same format but by quarter?
Expected output
date | quarter
2019-05-11 2019-04-01
2019-11-11 2019-10-01
2020-03-01 2020-01-01
2021-02-18 2021-01-01
Thanks
You can use pandas.PeriodIndex :
df['date'] = pd.to_datetime(df['date'])
df['quarter'] = pd.PeriodIndex(df['date'].dt.to_period('Q'), freq='Q').to_timestamp()
# Output :
print(df)
date quarter
0 2019-05-11 2019-04-01
1 2019-11-11 2019-10-01
2 2020-03-01 2020-01-01
3 2021-02-18 2021-01-01
Steps:
Convert your date to date_time object if not in date_time type
Convert your dates to quarter period with dt.to_period or with PeriodIndex
Convert current output of quarter numbers to timestamp to get the starting date of each quarter with to_timestamp
Source Code
import pandas as pd
df = pd.DataFrame({"Dates": pd.date_range("01-01-2022", periods=30, freq="24d")})
df["Quarters"] = df["Dates"].dt.to_period("Q").dt.to_timestamp()
print(df.sample(10))
OUTPUT
Dates Quarters
19 2023-04-02 2023-04-01
29 2023-11-28 2023-10-01
26 2023-09-17 2023-07-01
1 2022-01-25 2022-01-01
25 2023-08-24 2023-07-01
22 2023-06-13 2023-04-01
6 2022-05-25 2022-04-01
18 2023-03-09 2023-01-01
12 2022-10-16 2022-10-01
15 2022-12-27 2022-10-01
In this case, a quarter will always be in the same year and will start at day 1. All there is to calculate is the month.
Considering quarter is 3 month (12 / 4) then quarters will be 1, 4, 7 and 10.
You can use the integer division (//) to achieve this.
n = month
quarter = ( (n-1) // 3 ) * 3 + 1
I have two dataframes df1, df2. I need to construct an output that finds the nearest date to df1, whilst simultaneously matching the ID Value in both df1 and df2. df (Output Desired) shown below illustrates what I have tried explain above!
df1:
ID Date
1 2020-01-01
2 2020-01-03
df2:
ID Date
11 2020-01-11
4 2020-02-03
5 2020-04-02
6 2020-01-05
1 2021-01-13
1 2021-03-03
1 2020-01-30
2 2020-03-31
2 2021-04-01
2 2021-02-02
df (Output desired)
ID Date Closest Date
1 2020-01-01 2020-01-30
2 2020-01-03 2020-03-31
Here's one way to achieve it – assuming that the Date columns' dtype is datetime: First,
df3 = df1[df1.ID.isin(df2.ID)]
will give you
ID Date
0 1 2020-01-01
1 2 2020-01-03
Then
df3['Closest_date'] = df3.apply(lambda row:min(df2[df2.ID.eq(row.ID)].Date,
key=lambda x:abs(x-row.Date)),
axis=1)
gets the min of df2.Date, where
df2[df2.ID.eq(row.ID)].Date is getting the rows that have the matching ID and
key=lambda x:abs(x-row.Date) is telling min to compare by distance,
which has to be done on rows, so axis=1
Output:
ID Date Closest_date
0 1 2020-01-01 2020-01-30
1 2 2020-01-03 2020-03-31
i have a dataframe that looks like this
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-01-01 00:00:00 0
2020-02-01 00:00:00 0
2020-03-01 00:00:00 0
2020-04-01 00:00:00 0
I want to remove the time of the index and combine where the dates may be the same the end result will look like
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-06-01 0
2020-07-01 0
2020-08-01 7
etc, etc
change the index data type and filter with .duplicated:
df.index = pd.to_datetime(df.index)
df = df[~df.index.duplicated(keep='first')]
df
Out[1]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
If you want to sum them together rather than get rid of the duplicate, then use:
df.index = pd.to_datetime(df.index)
df = df.sum(level=0)
df
Out[2]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
if the index content is in string format u can simply slice
df.reset_index(inplace=True)#consider column name to be date
df["date"]=df["date"].str[:11]#till time index
df.set_index( "date",inplace=True)
if it is in date format:
df.reset_index(inplace=True)
df['date'] = pd.to_datetime(df['date']).dt.date
df.set_index( "date",inplace=True)
Given this data (reflecting your own) with the string dates and int data in columns (not as index):
dates = ['2020-01-01', '2020-02-01', '2020-05-01', '2020-08-01',
'2020-01-01 00:00:00', '2020-02-01 00:00:00', '2020-03-01 00:00:00',
'2020-04-01 00:00:00']
data = [10,5,2,7,0,0,0,0]
df = pd.DataFrame({'dates':dates, 'data':data})
You can do the following:
df['dates'] = pd.to_datetime(df['dates']).dt.date #convert to datetime and get the date
df = df.groupby('dates').sum().sort_index() # groupby and sort index
Giving:
data
dates
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-08-01 7
You can replace .sum() with your favorite aggregation method. Also, if you want to impute the missing dates (as in your expected output), you can do:
months = pd.date_range(min(df.index), max(df.index), freq='MS').date
df = df.reindex(months).fillna(0)
Giving:
data
dates
2020-01-01 10.0
2020-02-01 5.0
2020-03-01 0.0
2020-04-01 0.0
2020-05-01 2.0
2020-06-01 0.0
2020-07-01 0.0
2020-08-01 7.0
I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4
Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0
Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome
I have a .csv file with some data. There is only one column of in this file, which includes timestamps. I need to organize that data into bins of 30 minutes. This is what my data looks like:
Timestamp
04/01/2019 11:03
05/01/2019 16:30
06/01/2019 13:19
08/01/2019 13:53
09/01/2019 13:43
So in this case, the last two data points would be grouped together in the bin that includes all the data from 13:30 to 14:00.
This is what I have already tried
df = pd.read_csv('book.csv')
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df.groupby(pd.Grouper(key='Timestamp',
freq='30min')).count().dropna()
I am getting around 7000 rows showing all hours for all days with the count next to them, like this:
2019-09-01 03:00:00 0
2019-09-01 03:30:00 0
2019-09-01 04:00:00 0
...
I want to create bins for only the hours that I have in my dataset. I want to see something like this:
Time Count
11:00:00 1
13:00:00 1
13:30:00 2 (we have two data points in this interval)
16:30:00 1
Thanks in advance!
Use groupby.size as:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.Timestamp.dt.floor('30min').dt.time.to_frame()\
.groupby('Timestamp').size()\
.reset_index(name='Count')
Or as per suggestion by jpp:
df = df.Timestamp.dt.floor('30min').dt.time.value_counts().reset_index(name='Count')
print(df)
Timestamp Count
0 11:00:00 1
1 13:00:00 1
2 13:30:00 2
3 16:30:00 1