I was wondering how can we categorize timestamp column in a Data frame into Day and Night column on the basis of time?
I am trying to do so but unable to make a new column complete with the same number of entries.
d_call["time"] = d_call["timestamp"].apply(lambda x: x.time())
d_call["time"].head(1)
0 17:10:52
Name: time, dtype: object
def day_night(name):
for i in name:
if i.hour > 17:
return "night"
else:
return "day"
day_night(d_call["time"])
'day'
d_call["Day / Night"]= d_call["time"].apply(lambda x: day_night(x))
I want to get the entire series of the column but getting the first index only.
You can strip time to get the hour of timestamp and w.r.t hour you can assign your category, you can also use other conditions to put range of time
Considered df
0 2018-06-18 15:05:52.246
1 2018-05-24 21:44:07.903
2 2018-06-06 21:00:19.635
3 2018-05-24 21:44:37.883
4 2018-05-30 11:19:36.546
5 2018-05-25 11:16:07.969
6 2018-05-24 21:43:35.077
7 2018-06-07 18:39:00.258
Name: modified_at, dtype: datetime64[ns]
df['day/night'] = df.modified_at.apply(lambda x:'night' if int(x.strftime('%H')) >19 else 'day')
Out:
0 day
1 night
2 night
3 night
4 day
5 day
6 night
7 day
Name: modified_at, dtype: object
Related
I have code as below. My questions:
why is it assigning week 1 to 2014-12-29 and '2014-1-1'? Why it is not assigning week 53 to 2014-12-29?
how could i assign week number that is continuously increasing? I
want '2014-12-29','2015-1-1' to have week 53 and '2015-1-15' to have
week 55 etc.
x=pd.DataFrame(data=['2014-1-1','2014-12-29','2015-1-1','2015-1-15'],columns=['date'])
x['week_number']=pd.DatetimeIndex(x['date']).week
As far as why the week number is 1 for 12/29/2014 -- see the question I linked to in the comments. For the second part of your question:
January 1, 2014 was a Wednesday. We can take the minimum date of your date column, get the day number and subtract from the difference:
Solution
# x["date"] = pd.to_datetime(x["date"]) # if not already a datetime column
min_date = x["date"].min() + 1 # + 1 because they're zero-indexed
x["weeks_from_start"] = ((x["date"].diff().dt.days.cumsum() - min_date) // 7 + 1).fillna(1).astype(int)
Output:
date weeks_from_start
0 2014-01-01 1
1 2014-12-29 52
2 2015-01-01 52
3 2015-01-15 54
Step by step
The first step is to convert the date column to the datetime type, if you haven't already:
In [3]: x.dtypes
Out[3]:
date object
dtype: object
In [4]: x["date"] = pd.to_datetime(x["date"])
In [5]: x
Out[5]:
date
0 2014-01-01
1 2014-12-29
2 2015-01-01
3 2015-01-15
In [6]: x.dtypes
Out[6]:
date datetime64[ns]
dtype: object
Next, we need to find the minimum of your date column and set that as the starting date day of the week number (adding 1 because the day number starts at 0):
In [7]: x["date"].min().day + 1
Out[7]: 2
Next, use the built-in .diff() function to take the differences of adjacent rows:
In [8]: x["date"].diff()
Out[8]:
0 NaT
1 362 days
2 3 days
3 14 days
Name: date, dtype: timedelta64[ns]
Note that we get NaT ("not a time") for the first entry -- that's because the first row has nothing to compare to above it.
The way to interpret these values is that row 1 is 362 days after row 0, and row 2 is 3 days after row 1, etc.
If you take the cumulative sum and subtract the starting day number, you'll get the days since the starting date, in this case 2014-01-01, as if the Wednesday was day 0 of that first week (this is because when we calculate the number of weeks since that starting date, we need to compensate for the fact that Wednesday was the middle of that week):
In [9]: x["date"].diff().dt.days.cumsum() - min_date
Out[9]:
0 NaN
1 360.0
2 363.0
3 377.0
Name: date, dtype: float64
Now when we take the floor division by 7, we'll get the correct number of weeks since the starting date:
In [10]: (x["date"].diff().dt.days.cumsum() - 2) // 7 + 1
Out[10]:
0 NaN
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Note that we add 1 because (I assume) you're counting from 1 -- i.e., 2014-01-01 is week 1 for you, and not week 0.
Finally, the .fillna is just to take care of that NaT (which turned into a NaN when we started doing arithmetic). You use .fillna(value) to fill NaNs with value:
In [11]: ((x["date"].diff().dt.days.cumsum() - 2) // 7 + 1).fillna(1)
Out[11]:
0 1.0
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Finally use .astype() to convert the column to integers instead of floats.
I am working on a time series data frame.The df is as follows:
0 2019-01-01 Contact Tuesday False January 04:00:00.118000 1
1 2019-01-01 Contact Tuesday False January 04:00:00.483000 1
2 2019-01-01 Contact Tuesday False January 08:00:00.162000 1
3 2019-01-01 Contact Tuesday False January 08:00:00.426000 1
4 2019-01-01 Contact Tuesday False January 08:00:00.564000 1
To get this df I have done other transformation above hence, this is not a direct load.
so I am trying to convert the second last column with 04:00:00.118000 to 04:00:00.
What is the quickest way to achieve this?
If your entries in the second to last column are of type datetime.time, you could use the following:
df[name] = df[name].apply(lambda t: t.replace(microsecond=0))
where name is the name of your second to last column. If they are of type str, then you could use this instead:
df[name] = df[name].apply(lambda t: t.split('.')[0])
Try this, if you have the Object type data then it should work..
Sample data mimicking the data ..
>>> df
date col1
0 January 04:00:00.118000 1
1 January 04:00:00.483000 1
2 January 08:00:00.162000 1
3 January 08:00:00.426000 1
>>> df.dtypes
date object
col1 int64
dtype: object
Solution
>>> df['date'] = df['date'].str.split(".").str[0]
>>> df
date col1
0 January 04:00:00 1
1 January 04:00:00 1
2 January 08:00:00 1
3 January 08:00:00 1
i have a dataset that contains columns called date, shift, value, and so on.I want to extract last value for each date and shift from value column. For example for each day, there are two rows one contains datetime,shift(day or night) and last datapoints from value for each shift.
In this example, I want to extract 3 rd row(because the highest value for 7/14 and Day time is 3)
I only know how to get maximum value for each column. I tried in several ways to get this work done but it didn't work for me. I'm new to python and looking for your help.
assuming the data is already sorted by date you could do something like this? or sort by date then do this?
df['day'] = df['date'].apply(lambda x: x.date())
df.groupby(['day','shift'])['value'].agg(list).apply(lambda x: x[-1])
this will group the dataframe by date and shift, make a list of the values in each group and take the last value.
output:
day shift
2020-07-14 day 3
night 5
Name: value, dtype: int64
here's a way to do this but also grab multilple other columns as well... I admit it's not the cleanest and there's probably a better way but it works:
df:
date shift value value2 day
0 2020-07-14 18:58:00 day 1 9 2020-07-14
1 2020-07-14 18:59:00 day 2 8 2020-07-14
2 2020-07-14 18:59:00 day 3 7 2020-07-14
3 2020-07-14 19:00:00 night 4 6 2020-07-14
4 2020-07-14 19:00:00 night 5 5 2020-07-14
cols = ['value', 'value2']
df.groupby(['day','shift'])[cols].agg(list).apply(lambda x: [x[col][-1] for col in cols], axis=1)
output:
day shift
2020-07-14 day [3, 7]
night [5, 5]
dtype: object
If you need the max instead of the last
import pandas
data = {"date": ["day1","day1","day1","day1","day1"],
"shift": ["Day","Day","Day","Night","Night"],
"value": [1, 2, 3, 4, 5]
}
df = pandas.DataFrame(data)
df.groupby(["date","shift"]).max()
Output
value
date shift
day1 Day 3
Night 5
Check the pandas package, dataframe and groupby operation for more help: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I have a dataferam with a column that contains the date for the first monday of evry week between an arbitrary start date and now. I wish to generate a new column that has 2 week jumps but is the same length as the original column and would contain repeated values. For example this would be the result for the month of October where the column weekly exists and bi-weekly is the target:
data = {'weekly':['2018-10-08','2018-10-15','2018-10-22','2018-10-29']
,'bi-weekly':['2018-10-08','2018-10-08',
'2018-10- 22','2018-10-22']}
df = pd.DataFrame(data)
At the moment I am stuck with pd.date_range(start,end,freq='14D') but this does not contain any repeated values which I need to be able to groupby
IIUC
df.groupby(np.arange(len(df))//2).weekly.transform('first')
Out[487]:
0 2018-10-08
1 2018-10-08
2 2018-10-22
3 2018-10-22
Name: weekly, dtype: datetime64[ns]
I have a pandas dataframe:
id age
001 1 hour
002 2 hours
003 2 days
004 4 days
Age refers to how long the item has been in the database. What I like to do is to print the date when the item is being added to the database.
So if age column contains the string "hour" or "hours", I want to print the current date, and if not, deduct current date by the number of days.
The desired output should look like this:
id age insertion_date
001 1 hour 2018-09-18
002 2 hours 2018-09-18
003 2 days 2018-09-16
004 4 days 2018-09-14
I am using Python 2.7 and so far this is what I have achieved.
import pandas as pd
from datetime import date
for index, row in df.iterrows():
age = row["age"]
if "days" in age:
# Remove days and convert data type of age column
df["age"] = df["age"].astype("str").str.replace('[^\d\.]', '')
# deduct current date by number of days
df["insertion_date"] = df["age"].astype("int64").apply(lambda x: date.today() - timedelta(x))
else:
# print current date
df["insertion_date"] = date.today()
The output from the code above looks like this:
id age insertion_date
001 1 2018-09-17
002 2 2018-09-16
003 2 2018-09-16
004 4 2018-09-14
The issue with this code is that even when the string "hour" or "hours" is present in the age column, it does not add the current date into the insertion_date column.
Would appreciate if someone can point out where I went wrong with this code so I can fix it to get the desired output i.e. it will add current date to the insertion_date column if the string "hour" or "hours" is present in the age column, otherwise, deduct the current date to the number of days in the age column and add the date to the insertion_date column.
You can use Timestamp.floor subtracted by timedeltas created by to_timedelta and TimedeltaIndex.floor:
df['new'] = pd.Timestamp.today().floor('D') - pd.to_timedelta(df['age']).dt.floor('D')
print (df)
id age new
0 1 1 hour 2018-09-18
1 2 2 hours 2018-09-18
2 3 2 days 2018-09-16
3 4 4 days 2018-09-14
print (df['new'].dtypes)
datetime64[ns]
Let's do a little timedeltarithmetic:
df['insertion_date'] = (
pd.to_datetime('today') - pd.to_timedelta(df.age).dt.floor('D')).dt.date
df
id age insertion_date
0 1 1 hour 2018-09-18
1 2 2 hours 2018-09-18
2 3 2 days 2018-09-16
3 4 4 days 2018-09-14