I have code as below. My questions:
why is it assigning week 1 to 2014-12-29 and '2014-1-1'? Why it is not assigning week 53 to 2014-12-29?
how could i assign week number that is continuously increasing? I
want '2014-12-29','2015-1-1' to have week 53 and '2015-1-15' to have
week 55 etc.
x=pd.DataFrame(data=['2014-1-1','2014-12-29','2015-1-1','2015-1-15'],columns=['date'])
x['week_number']=pd.DatetimeIndex(x['date']).week
As far as why the week number is 1 for 12/29/2014 -- see the question I linked to in the comments. For the second part of your question:
January 1, 2014 was a Wednesday. We can take the minimum date of your date column, get the day number and subtract from the difference:
Solution
# x["date"] = pd.to_datetime(x["date"]) # if not already a datetime column
min_date = x["date"].min() + 1 # + 1 because they're zero-indexed
x["weeks_from_start"] = ((x["date"].diff().dt.days.cumsum() - min_date) // 7 + 1).fillna(1).astype(int)
Output:
date weeks_from_start
0 2014-01-01 1
1 2014-12-29 52
2 2015-01-01 52
3 2015-01-15 54
Step by step
The first step is to convert the date column to the datetime type, if you haven't already:
In [3]: x.dtypes
Out[3]:
date object
dtype: object
In [4]: x["date"] = pd.to_datetime(x["date"])
In [5]: x
Out[5]:
date
0 2014-01-01
1 2014-12-29
2 2015-01-01
3 2015-01-15
In [6]: x.dtypes
Out[6]:
date datetime64[ns]
dtype: object
Next, we need to find the minimum of your date column and set that as the starting date day of the week number (adding 1 because the day number starts at 0):
In [7]: x["date"].min().day + 1
Out[7]: 2
Next, use the built-in .diff() function to take the differences of adjacent rows:
In [8]: x["date"].diff()
Out[8]:
0 NaT
1 362 days
2 3 days
3 14 days
Name: date, dtype: timedelta64[ns]
Note that we get NaT ("not a time") for the first entry -- that's because the first row has nothing to compare to above it.
The way to interpret these values is that row 1 is 362 days after row 0, and row 2 is 3 days after row 1, etc.
If you take the cumulative sum and subtract the starting day number, you'll get the days since the starting date, in this case 2014-01-01, as if the Wednesday was day 0 of that first week (this is because when we calculate the number of weeks since that starting date, we need to compensate for the fact that Wednesday was the middle of that week):
In [9]: x["date"].diff().dt.days.cumsum() - min_date
Out[9]:
0 NaN
1 360.0
2 363.0
3 377.0
Name: date, dtype: float64
Now when we take the floor division by 7, we'll get the correct number of weeks since the starting date:
In [10]: (x["date"].diff().dt.days.cumsum() - 2) // 7 + 1
Out[10]:
0 NaN
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Note that we add 1 because (I assume) you're counting from 1 -- i.e., 2014-01-01 is week 1 for you, and not week 0.
Finally, the .fillna is just to take care of that NaT (which turned into a NaN when we started doing arithmetic). You use .fillna(value) to fill NaNs with value:
In [11]: ((x["date"].diff().dt.days.cumsum() - 2) // 7 + 1).fillna(1)
Out[11]:
0 1.0
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Finally use .astype() to convert the column to integers instead of floats.
Related
I need to populate NaN values for some columns in one dataframe based on a condition between two data frames.
DF1 has SOL (start of line) and EOL (end of line) columns and DF2 has UTC_TIME for each entry.
For every point in DF2 where the UTC_TIME is >= the SOL and is <= the EOL of each record in the DF1, that row in DF2 must be assigned the LINE, DEVICE and TAPE_FILE.
So, every one of the points will be assigned a LINE, DEVICE and TAPE_FILE based on the SOL/EOL time the UTC_TIME is between in DF1.
I'm trying to use the numpy where function for each column like this
df2['DEVICE'] = np.where(df2['UTC_TIME'] >= df1['SOL'] and <= df1['EOL'])
Or using a for loop to iterate through each row
for point in points:
if df1['SOL'] >= df2['UTC_TIME'] and df1['EOL'] <= df2['UTC_TIME']
return df1['DEVICE']
Try with merge_asof:
#convert to datetime if needed
df1["SOL"] = pd.to_datetime(df1["SOL"])
df1["EOL"] = pd.to_datetime(df1["EOL"])
df2["UTC_TIME"] = pd.to_datetime(df2["UTC_TIME"])
output = pd.merge_asof(df2[["ID", "UTC_TIME"]],df1,left_on="UTC_TIME",right_on="SOL").drop(["SOL","EOL"],axis=1)
>>> output
ID UTC_TIME LINE DEVICE TAPE_FILE
0 1 2022-04-25 06:50:00 1 Huntec 10
1 2 2022-04-25 07:15:00 2 Teledyne 11
2 3 2022-04-25 10:20:00 3 Huntec 12
3 4 2022-04-25 10:30:00 3 Huntec 12
4 5 2022-04-25 10:50:00 3 Huntec 12
I have a dataframe with an index column and another column that marks whether or not an event occurred on that day with a 1 or 0.
If an event occurred it typically happened continuously for a prolonged period of time. They'll typically mark whether or not a recession occurred, so it'd likely be 60-180 straight days that would be marked with a 1 before going to 0 again.
What I need to do is find the dates that mark the beginning and end of each sequence of 1's.
Here's some quick sample code:
dates = pd.date_range(start='2010-01-01', end='2015-01-01')
nums = np.random.normal(50, 5, 1827)
df = pd.DataFrame(nums, index=dates, columns=['Nums'])
df['Recession'] = np.where((df.index.month == 3) | (df.index.month == 12), 1, 0)
With the example dataframe, the value 1 occurs for the months of March and December, so ideally I'd have a list that reads [2010-03-01, 2010-03-31, 2010-12-01, 2010-12-30, ......, 2015-12-01, 2015-12-30].
I know I could find these values by using a for-loop, but that seems inefficient. I tried using groupby as well, but couldn't find anything that gave the results that I wanted.
Not sure if there's a pandas or numpy method to search an index for the appropriate conditions or not.
Let's try this, using DataFrameGroupBy.idxmin + DataFrameGroupBy.idxmax
# group-by on month, year & aggregate on date
g = (
df.assign(day=df.index.day)
.groupby([df.index.month, df.index.year]).day
)
# create mask of max date & min date for each (month, year) combination
mask = df.index.isin(g.idxmin()) | df.index.isin(g.idxmax())
# apply previous mask with month filter..
df.loc[mask & (df.index.month.isin([3,12])), 'Recession'] = 1
print(df[df['Recession'] == 1])
Nums Recession
2010-03-01 45.698168 1.0
2010-03-31 47.969167 1.0
2010-12-01 49.388595 1.0
2010-12-31 46.689064 1.0
2011-03-01 50.120603 1.0
2011-03-31 58.379980 1.0
2011-12-01 53.745407 1.0
...
...
I would use diff to find the periods, the diff enables to find when it switches from one state to another, so split the indices found in two parts, the starts and ends.
Depending whether the data starts with a recession or not:
locs = (df.Recession.diff().fillna(0)!=0).values.nonzero()[0]
if df.Recession.iloc[0]==0:
start = df.index[locs[::2]]
end = df.index[locs[1::2]-1]
else:
start = df.index[locs[::2]-1]
end = df.index[locs[1::2]]
If the data started with a recession already, up to you if you want to include the first date as a start or not, the above does not include it.
From what I understand you need to find the first value in a sequence? if so we can use groupby and cumsum to sum each consecutive group, and cumcount to count each of the groups.
df["keyGroup"] = (
df.groupby(df["Recession"].ne(df["Recession"].shift()).cumsum()).cumcount() + 1
)
df[df['keyGroup'].eq(1)]
Nums Recession keyGroup
2010-01-01 51.944742 0 1
2010-03-01 54.809271 1 1
2010-04-01 52.632831 0 1
2010-12-01 55.863695 1 1
2011-01-01 52.944778 0 1
2011-03-01 58.164943 1 1
2011-04-01 49.590640 0 1
2011-12-01 47.884919 1 1
2012-01-01 44.128065 0 1
2012-03-01 54.846231 1 1
2012-04-01 51.312064 0 1
2012-12-01 46.091171 1 1
2013-01-01 49.287102 0 1
2013-03-01 54.727874 1 1
2013-04-01 53.163730 0 1
2013-12-01 42.373602 1 1
2014-01-01 43.822791 0 1
2014-03-01 51.203125 1 1
2014-04-01 54.322415 0 1
2014-12-01 44.052536 1 1
2015-01-01 53.438015 0 1
you can call .index to get the values in a list.
df[df['keyGroup'].eq(1)].index
DatetimeIndex(['2010-01-01', '2010-03-01', '2010-04-01', '2010-12-01',
'2011-01-01', '2011-03-01', '2011-04-01', '2011-12-01',
'2012-01-01', '2012-03-01', '2012-04-01', '2012-12-01',
'2013-01-01', '2013-03-01', '2013-04-01', '2013-12-01',
'2014-01-01', '2014-03-01', '2014-04-01', '2014-12-01',
'2015-01-01'],
dtype='datetime64[ns]', name='date', freq=None)
I was wondering how can we categorize timestamp column in a Data frame into Day and Night column on the basis of time?
I am trying to do so but unable to make a new column complete with the same number of entries.
d_call["time"] = d_call["timestamp"].apply(lambda x: x.time())
d_call["time"].head(1)
0 17:10:52
Name: time, dtype: object
def day_night(name):
for i in name:
if i.hour > 17:
return "night"
else:
return "day"
day_night(d_call["time"])
'day'
d_call["Day / Night"]= d_call["time"].apply(lambda x: day_night(x))
I want to get the entire series of the column but getting the first index only.
You can strip time to get the hour of timestamp and w.r.t hour you can assign your category, you can also use other conditions to put range of time
Considered df
0 2018-06-18 15:05:52.246
1 2018-05-24 21:44:07.903
2 2018-06-06 21:00:19.635
3 2018-05-24 21:44:37.883
4 2018-05-30 11:19:36.546
5 2018-05-25 11:16:07.969
6 2018-05-24 21:43:35.077
7 2018-06-07 18:39:00.258
Name: modified_at, dtype: datetime64[ns]
df['day/night'] = df.modified_at.apply(lambda x:'night' if int(x.strftime('%H')) >19 else 'day')
Out:
0 day
1 night
2 night
3 night
4 day
5 day
6 night
7 day
Name: modified_at, dtype: object
I want to convert Date into Quarters. I've used,
x['quarter'] = x['date'].dt.quarter
date quarter
0 2013-1-1 1
But, it also repeats the same for the next year.
date quarter
366 2014-1-1 1
Instead of the 1, I want the (expected result) quarter to be 5.
date quarter
366 2014-1-1 5
.
.
.
.
date quarter
731 2015-1-1 9
You can use a simple mathematical operation
starting_year = 2013
df['quarter'] = df.year.dt.quarter + (df.year.dt.year - starting_year)*4
year quarter
0 2013-01-01 1
0 2014-01-01 5
0 2015-01-01 9
0 2016-01-01 13
My data has trips with datetime info, user id for each trip and trip type (single, round, pseudo).
Here's a data sample (pandas dataframe), named All_Data:
HoraDTRetirada idpass type
2016-02-17 15:36:00 39579449489 'single'
2016-02-18 19:13:00 39579449489 'single'
2016-02-26 09:20:00 72986744521 'pseudo'
2016-02-27 12:11:00 72986744521 'round'
2016-02-27 14:55:00 11533148958 'pseudo'
2016-02-28 12:27:00 72986744521 'round'
2016-02-28 16:32:00 72986744521 'round'
I would like to count the number of times each category repeats in a "week of year" by user.
For example, if the event happens on a monday and the next event happens on a thursday for a same user, that makes two events on the same week; however, if one event happens on a saturday and the next event happens on the following monday, they happened in different weeks.
The output I am looking for would be in a form like this:
idpass weekofyear type frequency
39579449489 1 'single' 2
72986744521 2 'round' 3
72986744521 2 'pseudo' 1
11533148958 2 'pseudo' 1
Edit: this older question approaches a similar problem, but I don't know how to do it with pandas.
import pandas as pd
data = {"HoraDTRetirada": ["2016-02-17 15:36:00", "2016-02-18 19:13:00", "2016-12-31 09:20:00", "2016-02-28 12:11:00",
"2016-02-28 14:55:00", "2016-02-29 12:27:00", "2016-02-29 16:32:00"],
"idpass": ["39579449489", "39579449489", "72986744521", "72986744521", "11533148958", "72986744521",
"72986744521"],
"type": ["single", "single", "pseudo", "round", "pseudo", "round", "round"]}
df = pd.DataFrame.from_dict(data)
print(df)
df["HoraDTRetirada"] = pd.to_datetime(df['HoraDTRetirada'])
df["week"] = df['HoraDTRetirada'].dt.strftime('%U')
k = df.groupby(["idpass", "week", "type"],as_index=False).count()
print(k)
Output:
HoraDTRetirada idpass type
0 2016-02-17 15:36:00 39579449489 single
1 2016-02-18 19:13:00 39579449489 single
2 2016-12-31 09:20:00 72986744521 pseudo
3 2016-02-28 12:11:00 72986744521 round
4 2016-02-28 14:55:00 11533148958 pseudo
5 2016-02-29 12:27:00 72986744521 round
6 2016-02-29 16:32:00 72986744521 round
idpass week type HoraDTRetirada
0 11533148958 09 pseudo 1
1 39579449489 07 single 2
2 72986744521 09 round 3
3 72986744521 52 pseudo 1
This is how I got what I was looking for:
Step 1 from suggested answers was skipped because timestamps were already in pandas datetime form.
Step 2: create column for week of year:
df['week'] = df['HoraDTRetirada'].dt.strftime('%U')
Step 3: group by user id, type and week, and count values with size()
df.groupby(['idpass','type','week']).size()
My suggestion would be to do this:
make sure your timestamp is pandas datetime and add frequency column
df['HoraDTRetirada'] = pd.to_datetime(df['HoraDTRetirada'])
df['freq'] = 1
Group it and count
res = df.groupby(['idpass', 'type', pd.Grouper(key='HoraDTRetirada', freq='1W')]).count().reset_index()
Convert time to week of a year
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
Final result looks like that:
EDIT:
You are right, in your case we should do step 3 before step 2, and if you want to do that, remember that groupby will change, so finally step 2 will be:
res['HoraDTRetirada'] = res['HoraDTRetirada'].apply(lambda x: x.week)
and step 3 :
res = df.groupby(['idpass', 'type', 'HoraDTRetirada')]).count().reset_index()
It's a bit different because the "Hora" variable is not a time anymore, but just an int representing a week.