I'm trying to split a df by datetime. The df is indexed on the datetime variable. Essentially, I can do:
first = df['2020-04-09':'2020-04-21']
second = df['2020-04-22':'2020-05-08']
and that yields my desired result of 2 dfs, each with their respective datetime range's worth of data.
However, I'd like a way to allow for easier editing at the top of the script by assigning the datetime ranges to local variables. Ideally something like this:
first_dates = '2020-04-09':'2020-04-21'
second_dates = '2020-04-22':'2020-05-08'
Such that later on I'm able to use something like:
first = df[first_dates]
second = df[second_dates]
and yield the same result of 2 dfs with their respective date ranges worth of data.
Is this what you want
# edit this
date_str = '2020-04-21'
# no need to edit this
date = pd.to_datetime(date_str, utc=True)
first = df[:date]
second = df[date+pd.to_timedelta('1D'):]
Using datetime, you could use mask comparison maybe?
like:
mask1 = df.index <= dt.date(2020,4,21)
mask2 = df.index > dt.date(2020,4,21)
df1 = df.loc[mask1]
df2 = df.loc[mask2]
Related
I'm trying to create a new column in a DataFrame that comes from a CSV file. What makes a this little bit tricky is that the values from this new column depends on conditions from other columns from the DataFrame.
The output column depends on the values from the following columns from this dataframe: VaccineCode | Occurrence | VaccineN | firstVaccineDate
So if the condition is met for a specific vaccine, I have to sum the respective date from the ApplicationDate column, in order to tell the vaccine date of the second dose.
My code:
import pandas as pd
import datetime
from datetime import timedelta, date, datetime
df = pd.read_csv(path_csv, engine='python', sep=';')
criteria_Astrazeneca = (df.VaccineCode == 85) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_Pfizer = (df.VaccineCode == 86) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_CoronaVac = (df.VaccineCode == 87) & (df.Occurrence == 1) & (df.VaccineN == 1)
days_pfizer = 56
days_coronaVac = 28
days_astraZeneca = 84
What I've tried so far:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
This works until the point that I have to complete the same New_Column with the others results, like this:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df['New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df['New_Column'] = df[criteria_AstraZeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
Naturally, the problem with this approach comes from the fact that the next statement overwrites those before, so I end up just with the New_Column filled with the results that came from the last statement. I need a way to put all results in the same column.
My last try was:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df[criteria_Pfizer].loc[:,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
But it gives the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(ilocs[0], value, pi)
Thank you very much #ddejohn, the first link helped me to solve my problem as follows:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df.loc[criteria_Pfizer,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df.loc[criteria_Astrazeneca,'New_Column'] = df[criteria_Astrazeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
That way, the first statement create the column and fill with the coronavac indexes and the next ones fill the same column just in the respective indexes.
Problem solved, thanks again.
You could also use an data frame transform to create a new rule
I'm having a hard time updating a string value in a subset of Pandas data frame
In the field action, I am able to modify the action column using regular expressions with:
df['action'] = df.action.str.replace('([^a-z0-9\._]{2,})','')
However, if the string contains a specific word, I don't want to modify it, so I tried to only update a subset like this:
df[df['action'].str.contains('TIME')==False]['action'] = df[df['action'].str.contains('TIME')==False].action.str.replace('([^a-z0-9\._]{2,})','')
and also using .loc like:
df.loc('action',df.action.str.contains('TIME')==False) = df.loc('action',df.action.str.contains('TIME')==False).action.str.replace('([^a-z0-9\._]{2,})','')
but in both cases, nothing gets updated. Is there a better way to achieve this?
you can do it with loc but you did it the way around with column first while it should be index first, and using [] and not ()
mask_time = ~df['action'].str.contains('TIME') # same as df.action.str.contains('TIME')==False
df.loc[mask_time,'action'] = df.loc[mask_time,'action'].str.replace('([^a-z0-9\._]{2,})','')
example:
#dummy df
df = pd.DataFrame({'action': ['TIME 1', 'ABC 2']})
print (df)
action
0 TIME 1
1 ABC 2
see the result after using above method:
action
0 TIME 1
1 2
Try this it should work, I found it here
df.loc[df.action.str.contains('TIME')==False,'action'] = df.action.str.replace('([^a-z0-9\._]{2,})','')
I have a pandas dataframe where one of the columns consists of datetime values with varying frequences.
I want to create a new column which flags whenever the gap between two datetime values is greater than one day (datetime current row + timedelta(days=1) < datetime next row).
However, I would want to do this with a list operation, rather than a for loop.
Had the values been int values, you could do something like:
df_ship["gap_gt_1"] = (df_ship['datetime']+1).lt(df_ship['datetime'].shift().bfill()).astype(int)
However, lt and similar operators don't work with datetime objects.
I've tried to do the following, but it only returns 'false' values.
df_ship["gap_gt_1"] = ((df_ship['datetime'] + timedelta(days=1)) < (df_ship['datetime'].shift()))
You can try to do:
import numpy as np
# Take the difference in dates
df["timedelta"] = df['date'] - df['date'].shift(1)
# To make the flags
conditions, type_choices = ([df['timedelta'] > pd.Timedelta(days=1)], [1])
df["flag"] = np.select(conditions, type_choices, default=0)
I'm pretty new to python programming. I read a csv file to a dataframe with median house price of each month as columns. Now I want to create columns to get the mean value of each quarter. e.g. create column housing['2000q1'] as mean of 2000-01, 2000-02, and 2000-03, column housing['2000q2'] as mean of 2000-04,2000-05, 2000-06]...
raw dataframe named 'Housing'
I tried to use nested for loops as below, but always come with errors.
for i in range (2000,2017):
for j in range (1,5):
Housing[i 'q' j] = Housing[[i'-'j*3-2, i'-'j*3-1, i'_'j*3]].mean(axis=1)
Thank you!
Usually, we work with data where the rows are time, so it's good practice to do the same and transpose your data by starting with df = Housing.set_index('CountyName').T (also, variable names should usually start with a small letter, but this isn't important here).
Since your data is already in such a nice format, there is a pragmatic (in the sense that you need not know too much about datetime objects and methods) solution, starting with df = Housing.set_index('CountyName').T:
df.reset_index(inplace = True) # This moves the dates to a column named 'index'
df.rename(columns = {'index':'quarter'}, inplace = True) # Rename this column into something more meaningful
# Rename the months into the appropriate quarters
df.quarter.str.replace('-01|-02|-03', 'q1', inplace = True)
df.quarter.str.replace('-04|-05|-06', 'q2', inplace = True)
df.quarter.str.replace('-07|-08|-09', 'q3', inplace = True)
df.quarter.str.replace('-10|-11|-12', 'q4', inplace = True)
df.drop('SizeRank', inplace = True) # To avoid including this in the calculation of means
c = df.notnull().sum(axis = 1) # Count the number of non-empty entries
df['total'] = df.sum(axis = 1) # The totals on each month
df['c'] = c # only ssign c after computing the total, so it doesn't intefere with the total column
g = df.groupby('quarter')[['total','c']].sum()
g['q_mean'] = g['total']/g['c']
g
g['q_mean'] or g[['q_mean']] should give you the required answer.
Note that we needed to compute the mean manually because you had missing data; otherwise, df.groupby('quarter').mean().mean() would have immediately given you the answer you needed.
A remark: the technically 'correct' way would be to convert your dates into a datetime-like object (which you can do with the pd.to_datetime() method), then run a groupby with a pd.TimeGrouper() argument; this would certainly be worth learning more about if you are going to work with time-indexed data a lot.
You can achieve this using pandas resampling function to compute quarterly averages in a very simple way.
pandas resampling: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
offset names summary: pandas resample documentation
In order to use this function, you need to have only time as columns, so you should temporarily set CountryName and SizeRank as indexes.
Code:
QuarterlyAverage = Housing.set_index(['CountryName', 'SizeRank'], append = True)\
.resample('Q', axis = 1).mean()\
.reset_index(['CountryName', 'SizeRank'], drop = False)
Thanks to #jezrael for suggesting axis = 1 in resampling
I have one set of values measured at regular times. Say:
import pandas as pd
import numpy as np
rng = pd.date_range('2013-01-01', periods=12, freq='H')
data = pd.Series(np.random.randn(len(rng)), index=rng)
And another set of more arbitrary times, for example, (in reality these times are not a regular sequence)
ts_rng = pd.date_range('2013-01-01 01:11:21', periods=7, freq='87Min')
ts = pd.Series(index=ts_rng)
I want to know the value of data interpolated at the times in ts.
I can do this in numpy:
x = np.asarray(ts_rng,dtype=np.float64)
xp = np.asarray(data.index,dtype=np.float64)
fp = np.asarray(data)
ts[:] = np.interp(x,xp,fp)
But I feel pandas has this functionality somewhere in resample, reindex etc. but I can't quite get it.
You can concatenate the two time series and sort by index. Since the values in the second series are NaN you can interpolate and the just select out the values that represent the points from the second series:
pd.concat([data, ts]).sort_index().interpolate().reindex(ts.index)
or
pd.concat([data, ts]).sort_index().interpolate()[ts.index]
Assume you would like to evaluate a time series ts on a different datetime_index. This index and the index of ts may overlap. I recommend to use the following groupby trick. This essentially gets rid of dubious double stamps. I then forward interpolate but feel free to apply more fancy methods
def interpolate(ts, datetime_index):
x = pd.concat([ts, pd.Series(index=datetime_index)])
return x.groupby(x.index).first().sort_index().fillna(method="ffill")[datetime_index]
Here's a clean one liner:
ts = np.interp( ts_rng.asi8 ,data.index.asi8, data[0] )