Conditional lambda in pandas returns ValueError - python

In a df comprised of the columns asset_id, event_start_date, event_end_date,
I wish to add a forth column datediff that for each asset_id will capture how many days passed between a end_date and the following start_date for the same asset_id, but in case that following start_date is earlier than the current end_date, I would like to capture the difference between the two start_dates. Dataset is sorted by (asset_id, start_date asc).
In Excel it would look something like:
I tried:
events['datediff'] = df.groupby('asset_id').apply(lambda x: x['event_start_date'].shift(-1)-x['event_end_date'] if
x['event_start_date'].shift(-1)>x['event_end_date'] else x['event_start_date'].shift(-1)-x['event_start_date'] ).\
fillna(pd.Timedelta(seconds=0)).reset_index(drop=True)
But this is:
not working. Throwing ValueError: The truth value of a Series is ambiguous.
so un-elegant.
Thanks!

df = pd.DataFrame({
'asset_id':[0,0,1,1],
'event_start_date':['2019-07-08','2019-07-11','2019-07-15','2019-07-25'],
'event_end_date':['2019-07-08','2019-07-23','2019-07-29','2019-07-25']
})
df['event_end_date'] = pd.to_datetime(df['event_end_date'])
df['event_start_date'] = pd.to_datetime(df['event_start_date'])
df['next_start']=df.groupby('asset_id')['event_start_date'].shift(-1)
df['date_diff'] = np.where(
df['next_start']>df['event_end_date'],
(df['next_start']-df['event_end_date']).dt.days,
(df['next_start']-df['event_start_date']).dt.days
)
df = df.drop(columns=['next_start']).fillna(0)

Related

how to insert multiple python parameter into sql sting to create a dataframe

I've been trying to create a dataframe based on a SQL string, but I want to query the same things (3 counts) over different periods (in this example, monthly).
For context, I've been working in a python notebook on Civis.
I came up with this :
START_DATE_LIST = ["2021-01-01","2021-02-01","2021-03-01","2021-04-01","2021-05-01","2021-06-01","2021-07-01","2021-08-01","2021-09-01","2021-10-01","2021-11-01","2021-12-01"]
END_DATE_LIST = ["2021-01-31","2021-02-28","2021-03-31","2021-04-30","2021-05-31","2021-06-30","2021-07-31","2021-08-31","2021-09-31","2021-10-31","2021-11-30","2021-12-31"]
for start_date, end_date in zip(START_DATE_LIST,END_DATE_LIST) :
SQL = f"select count (distinct case when (CON.firstdonationdate__c< {start_date} and CON.last_gift_date__c> {start_date} ) then CON.id else null end) as Donors_START_DATE, \
count (distinct case when (CON.firstdonationdate__c< {end_date} and CON.c_last_gift_date__c> {end_date}) then CON.id else null end) as Donors_END_DATE, \
count (distinct case when (CON.firstdonationdate__c> {start_date} and CON.firstdonationdate__c<{end_date}) then CON.id else null end) as New_Donors \
from staging.contact CON;"
df2 =civis.io.read_civis_sql(SQL, "database", use_pandas=True)
df2['START_DATE']=start_date
df2['END_DATE']= end_date
It runs but then the output is only :
donors_start_date donors_end_date new_donors START_DATE END_DATE
0 47458 0 0 2021-12-01 2021-12-31
I'm thinking I have two problems :
1/ it reruns the df each time and, I need to find a way to stack up the outputs for each month.
2/ why doesn't it compute the last two counts for the last month.
Any feedback is greatly appreciated!
I think you have correctly identified the problem yourself:
In each iteration, you perform an SQL query and assign the result to DataFrame object called df2 (thus overriding its previous value)
Instead, you want create a DataFrame object outside the loop, then append data to it:
import pandas as pd
START_DATE_LIST = ...
END_DATE_LIST = ...
df = pd.DataFrame()
for start_date, end_date in zip(START_DATE_LIST, END_DATE_LIST) :
SQL = ...
row = civis.io.read_civis_sql(SQL, "database", use_pandas=True)
row['START_DATE'] = start_date
row['END_DATE'] = end_date
df = df.append(row)

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

How to find and add missing dates in a dataframe of sorted dates (descending order)?

In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...
The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?
from datetime import date, timedelta
d = data['Date']
print(d[0])
print(d[-1])
date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')
idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')
#tried this
data = data.reindex(idx, fill_value=0)
data.head()
#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.
#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()
#Got ValueError: cannot reindex from a duplicate axis
This is my code:
for i in range(len(df)):
if i > 0:
prev = df.loc[i-1]["Date"]
current =df.loc[i]["Date"]
for a in range((prev-current).days):
if a > 0:
df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
df = df.sort_values("Date", ascending=False)
print(df)

SyntaxError comparing a date with a field in Python

Q) Resample the data to get prices for the end of the business month. Select the Adjusted Close for each stock.import pandas as pd
mmm = pd.read_csv('mmm.csv')
ibm = pd.read_csv('ibm.csv')
fb = pd.read_csv('fb.csv')
amz_date = amz.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
amz_price = amz_date[:, ['Date', 'AdjClose']] #dataframe with only these 2 columns
mmm_date = mmm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
mmm_price = mmm_date[:, ['Date', 'AdjClose']]
ibm_date = ibm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
ibm_price = ibm_date[:, ['Date', 'AdjClose']]
fb_date = fb.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
fb_price = fb_date[:, ['Date', 'AdjClose']]
KeyError: 'Date'
What Am i doing wrong? also Date is column in csv file
Your particular problem is that "06" is not a legal way to describe a value here. To do arithmetic, you'd need to drop the leading zero.
Your next problem is that 2017-06-30 is (would be) an arithmetic expression that evaluates to the integer 1981. You need to express this as a date, such as datetime.strptime("6-30-2017").

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories