How to add a column of random numbers based on two conditions? - python

I have a data frame in python containing the following information:
Day Type
Weekday 1
Weekday 2
Weekday 3
Weekday 1
Weekend 2
Weekend 1
I want to add a new column by generating a Weibull random number but each pair of "Day" and "Type" has a unique Weibull distributions.
For example, I have tried the following codes but they did not work:
df['Duration'][ (df['Day'] == "Weekend") & (df['Type'] == 1) ] = int(random.weibullvariate(5.6/math.gamma(1+1/6),6))
df['Duration'] = df['Day','Type'].map(lambda x,y: int(random.weibullvariate(5.6/math.gamma(1+1/10),10)) if x == "Weekday" and y == 1 if x == "Weekend" and y == 1 int(random.weibullvariate(5.6/math.gamma(1+1/6),6)))

Define a function that generates the random number you want and apply it to the rows.
import io
import random
import math
import pandas as pd
data = io.StringIO('''\
Day Type
Weekday 1
Weekday 2
Weekday 3
Weekday 1
Weekend 2
Weekend 1
''')
df = pd.read_csv(data, delim_whitespace=True)
def duration(row):
if row['Day'] == 'Weekend' and row['Type'] == 1:
return int(random.weibullvariate(5.6/math.gamma(1+1/6),6))
if row['Day'] == 'Weekday' and row['Type'] == 1:
return int(random.weibullvariate(5.6/math.gamma(1+1/10),10))
df['Duration'] = df.apply(duration, axis=1)

Related

Alternatives to update rows

I have the following sample data:
date
value
0
2021/05
50
1
2021/06
60
2
2021/07
70
3
2021/08
80
4
2021/09
90
5
2021/10
100
I want to update the data in the 'date' column, where for example '2021/05' becomes '05/10/2021', '2021/06' becomes '06/12/2021' and so long (I have to choose the new date manually for every row).
Is there a better/more clever way to do it instead of:
for i in df.index:
if df['date'][i] == '2021/05':
df['date'][i] = '05/10/2021'
elif df['date'][i] == '2021/06':
df['date'][i] = '06/12/2021'
The problem is that there are more than hundred rows that have to be updated and the code above will be tremendously long.
We can use the select method from numpy like so :
import numpy as np
condlist = [df['date'] == '2021/05',
df['date'] == '2021/06']
choicelist = ['05/10/2021',
'06/12/2021']
df['date'] = np.select(condlist, choicelist, default=np.nan)
I would use an interactive approach, saving the amended DataFrame to a file at the end:
import pandas as pd
dt = pd.DataFrame({"date":["2021/05", "2021/06", "2021/07", "2021/08", "2021/09", "2021/10"], "value": [50, 60, 70, 80, 90, 100]})
for n, i in enumerate(dt.loc[:,"date"]):
to_be_parsed = True
while parsed:
day = input("What is the day for {:s}?".format(i))
date_str = "{:s}/{:0>2s}".format(i, day)
try:
dt.loc[n,"date"] = pd.to_datetime("{:s}/{:0>2s}".format(i, day)).strftime("%m/%d/%Y")
to_be_parsed = False
except:
print("Invalid date: {:s}. Try again".format(date_str))
output_path = input("Save amended dataframe to path (no input to skip): ")
if len(output_path) > 0:
dt.to_csv(output_path, index=False)

'Find' sequences of size 3 in a dataframe column using pandas

I have a datetime dataset that contains a column where all 'days off' are identified as 1s and the rest are 0s. I am trying to create a new column that identifies extended periods off, say 3 days or more. I need to either select those sequences of 1s greater than 3 in a row (eg: 01110) or discard single (010) and doubles 1s (0110).
An example dataset follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'days_off': [1,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,0,1,1,1]})
df['extended_off'] = np.NaN
I've tried a convoluted for loop to look at all the conditions, it appears to be working except at the first two rows, which i cant iterate over with my solution (below)
Is there a better way???, since you really should avoid looping over a df.
for i in range(2, len(df)):
if ((df.loc[i-1, 'days_off'] == 0) and (df.loc[i+1, 'days_off'] == 0)): # single holiday (wednesday)
df.loc[i, 'extended_off'] = 0
elif ((df.loc[i-1, 'days_off'] == 0) and (df.loc[i+2, 'days_off'] == 0)): # normal weekend (no prior)
df.loc[i, 'extended_off'] = 0
elif ((df.loc[i-2, 'days_off'] == 0) and (df.loc[i+1, 'days_off'] == 0)): # normal weekend (no following)
df.loc[i, 'extended_off'] = 0
elif df.loc[i, 'days_off'] == 0: # normal working day
df.loc[i, 'extended_off'] = 0
else:
df.loc[i, 'extended_off'] = 1 # 3 or more days_off in a row
Thanks
Maybe this could be useful
df["new"] = df['days_off'].ne(df['days_off'].shift()).cumsum()
df["counts"] = df.groupby("new")['new'].transform('size')
((df.counts.ge(3)) & (df.days_off == 1)).astype(int)

Creating a Pandas dataframe column which is conditional on a function

Say I have some dataframe like below and I create a new column (track_len) which gives the length of the column track_no.
import pandas as pd
df = pd.DataFrame({'item_id': [1,2,3], 'track_no': ['qwerty23', 'poiu2', 'poiuyt5']})
df['track_len'] = df['track_no'].str.len()
df.head()
My Question is:
How do I now create a new column (new_col) which selects a specific subset of the track_no string and outputs that depending on the length of the track number (track_len).
I have tried creating a function which outputs the specific string slice of the track_no given the various track_len conditions and then use an apply method to create the column and it doesnt work. The code is below:
Tried:
def f(row):
if row['track_len'] == 8:
val = row['track_no'].str[0:3]
elif row['track_len'] == 5:
val = row['track_no'].str[0:1]
elif row['track_len'] =7:
val = row['track_no'].str[0:2]
return val
df['new_col'] = df.apply(f, axis=1)
df.head()
Thus the desired output should be (based on string slicing output of f):
Output
{new_col: ['qwe', 'p', 'po']}
If there are alternative better solutions to this problem those would also be appreciated.
Your function works well you need to remove .str part in your if blocks. Values are already strings:
def f(row):
if row['track_len'] == 8:
val = row['track_no'][:3]
elif row['track_len'] == 5:
val = row['track_no'][:1]
elif row['track_len'] ==7:
val = row['track_no'][:2]
return val
df['new_col'] = df.apply(f, axis=1)
df.head()
#Output:
item_id track_no track_len new_col
0 1 qwerty23 8 qwe
1 2 poiu2 5 p
2 3 poiuyt5 7 po

Compare two date by month & date Python

I have two columns of dates need to be compared, date1 is a list of certain dates, date2 is random date (dob). I need to compare month and day by some conditon to make a flag. sample like:
df_sample = DataFrame({'date1':('2015-01-15','2015-01-15','2015-03-15','2015-04-15','2015-05-15'),
'dob':('1999-01-25','1987-12-12','1965-03-02','2000-08-02','1992-05-15')}
I create a function based on condition below
def eligible(date1,dob):
if date1.month - dob.month==0 and date1.day <= dob.day:
return 'Y'
elif date1.month - dob.month==1 and date1.day > dob.day:
return 'Y'
else:
return 'N'
I want to apply this function to orginal df which has more than 5M rows, hence for loop is not efficiency, is there any way to achieve this?
Datatype is date, not datetime
I think you need numpy.where with conditions chained by | (or):
df_sample['date1'] = pd.to_datetime(df_sample['date1'])
df_sample['dob'] = pd.to_datetime(df_sample['dob'])
months_diff = df_sample.date1.dt.month - df_sample.dob.dt.month
days_date1 = df_sample.date1.dt.day
days_dob = df_sample.dob.dt.day
m1 = (months_diff==0) & (days_date1 <= days_dob)
m2 = (months_diff==1) & (days_date1 > days_dob)
df_sample['out'] = np.where(m1 | m2 ,'Y','N')
print (df_sample)
date1 dob out
0 2015-01-15 1999-01-25 Y
1 2015-01-15 1987-12-12 N
2 2015-03-15 1965-03-02 N
3 2015-04-15 2000-08-02 N
4 2015-05-15 1992-05-15 Y
Using datetime is certainly beneficial:
df_sample['dob'] = pd.to_datetime(df_sample['dob'])
df_sample['date1'] = pd.to_datetime(df_sample['date1'])
Once you have it, your formula can be literally applied to all rows:
df_sample['eligible'] =
( (df_sample.date1.dt.month == df_sample.dob.dt.month)\
& (df_sample.date1.dt.day <= df_sample.dob.dt.day)) |\
( (df_sample.date1.dt.month - df_sample.dob.dt.month == 1)\
& (df_sample.date1.dt.day > df_sample.dob.dt.day))
The result is boolean (True/False), but you can easily convert it to "Y"/"N", if you want.

Count number of sundays in current month

How can I get the numberof Sundays of the current month in Python?
Anyone got any idea about this?
This gives you the number of sundays in a current month as you wanted:
import calendar
from datetime import datetime
In [367]: len([1 for i in calendar.monthcalendar(datetime.now().year,
datetime.now().month) if i[6] != 0])
Out[367]: 4
I happened to need a solution for this, but was unsatisfactory with the solutions here, so I came up with my own:
import calendar
year = 2016
month = 3
day_to_count = calendar.SUNDAY
matrix = calendar.monthcalendar(year,month)
num_days = sum(1 for x in matrix if x[day_to_count] != 0)
I'd do it like this:
import datetime
today = datetime.date.today()
day = datetime.date(today.year, today.month, 1)
single_day = datetime.timedelta(days=1)
sundays = 0
while day.month == today.month:
if day.weekday() == 6:
sundays += 1
day += single_day
print 'Sundays:', sundays
My take: (saves having to worry about being in the right month etc...)
from calendar import weekday, monthrange, SUNDAY
y, m = 2012, 10
days = [weekday(y, m, d+1) for d in range(*monthrange(y, m))]
print days.count(SUNDAY)
Or, as #mgilson has pointed out, you can do away with the list-comp, and wrap it all up as a generator:
sum(1 for d in range(*monthrange(y,m)) if weekday(y,m,d+1)==SUNDAY)
And I suppose, you could throw in a:
from collections import Counter
days = Counter(weekday(y, m, d + 1) for d in range(*monthrange(y, m)))
print days[SUNDAY]
Another example using calendar and datetime:
import datetime
import calendar
today = datetime.date.today()
m = today.month
y = today.year
sum(1 for week in calendar.monthcalendar(y,m) if week[-1])
Perhaps a slightly faster way to do it would be:
first_day,month_len = monthrange(y,m)
date_of_first_sun = 1+6-first_day
print sum(1 for x in range(date_of_first_sun,month_len+1,7))
You can do this using ISO week numbers:
from datetime import date
bom = date.today().replace(day=1) # first day of current month
eom = (date(bom.year, 12, 31) if bom.month == 12 else
(bom.replace(month=bom.month + 1) - 1)) # last day of current month
_, b_week, _ = bom.isocalendar()
_, e_week, e_weekday = eom.isocalendar()
num_sundays = (e_week - b_week) + (1 if e_weekday == 7 else 0)
In general for a particular day of the week (1 = Monday, 7 = Sunday) the calculation is:
num_days = ((e_week - b_week) +
(-1 if b_weekday > day else 0) +
( 1 if e_weekday >= day else 0))
import calendar
MONTH = 10
sundays = 0
cal = calendar.Calendar()
for day in cal.itermonthdates(2012, MONTH):
if day.weekday() == 6 and day.month == MONTH:
sundays += 1
PAY ATTENTION:
Here are the Calendar.itermonthdates's docs:
Return an iterator for one month. The iterator will yield datetime.date
values and will always iterate through complete weeks, so it will yield
dates outside the specified month.
That's why day.month == MONTH is needed
If you want the weekdays to be in range 0-6, use day.weekday(),
if you want them to be in range 1-7, use day.isoweekday()
My solution.
The following was inspired by #Lingson's answer, but I think it does lesser loops.
import calendar
def get_number_of_weekdays(year: int, month: int) -> list:
main_calendar = calendar.monthcalendar(year, month)
number_of_weeks = len(main_calendar)
number_of_weekdays = []
for i in range(7):
number_of_weekday = number_of_weeks
if main_calendar[0][i] == 0:
number_of_weekday -= 1
if main_calendar[-1][i] == 0:
number_of_weekday -= 1
number_of_weekdays.append(number_of_weekday)
return sum(number_of_weekdays) # In my application I needed the number of each weekday, so you could return just the list to do that.

Categories