Pandas DF, DateOffset, creating new column - python

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table

Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

Related

How can I merge the numerous data of two columns within the same DataFrame?

here is a pic of df1 = fatalities
So, in order to create a diagram that displays the years with the most injuries(i have an assignment about plane crash incidents in Greece from 2000-2020), i need to create a column out of the minor_injuries and serious_injuries ones.
So I had a first df with more data, but i tried to catch only the columnw that i needed, so we have the fatalities df1, which contains the years, the fatal_injuries, the minor_injuries, the serious_injuries and the total number of incident per year(all_incidents). What i wish to do, is merge the minor and serious injuries in a column named total_injuries or just injuries.
import pandas as pd
​ pd.set_option('display.max_rows', None)
df = pd.read_csv('all_incidents_cleaned.csv')
df.head()
df\['Year'\] = pd.to_datetime(df.incident_date).dt.year
fatalities = df.groupby('Year').fatalities.value_counts().unstack().reset_index()fatalities\
['all_incidents'\] = fatalities\[\['Θανάσιμος τραυματισμός',
'Μικρός τραυματισμός','Σοβαρός τραυματισμός', 'Χωρίς Τραυματισμό'\]\].sum(axis=1)
df\['percentage_deaths_to_all_incidents'\] = round((fatalities\['Θανάσιμος
τραυματισμός'\]/fatalities\['all_incidents'\])\*100,1)
df1 = fatalities
fatalities_pd = pd.DataFrame(fatalities)
df1
fatalities_pd.rename(columns = {'Θανάσιμος τραυματισμός':'fatal_injuries','Μικρός τραυματισμός':
'minor_injuries', 'Σοβαρός τραυματισμός' :'serious_injuries', 'Χωρίς Τραυματισμό' :
'no_injuries'}, inplace = True)
df1
For your current dataset two steps are needed.
First i would replace the "NaN" values with 0.
This could be done with:
df1.fillna(0)
Then you can create a new column "total_injuries" with the sum of minor and serious injuries:
df1["total_injuries"]=df1["minor_injuries"]+df1["serious_injuries"]
Its always nice when you first check your data for consistency before working on it. Helpful commands would look like:
data.shape
data.info()
data.isna().values.any()
data.duplicated().values.any()
duplicated_rows = data[data.duplicated()]
len(duplicated_rows)
data.describe()

PYTHON, Pandas Dataframe: how to select and read only certain rows

For the purpose to be clear here is the code that works perfectly (of course I put only the beginning, the rest is not important here):
df = pd.read_csv(
'https://github.com/pcm-dpc/COVID-19/raw/master/dati-andamento-nazionale/'
'dpc-covid19-ita-andamento-nazionale.csv',
parse_dates=['data'], index_col='data')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
So basically it takes some data from the online Github csv that you may find here:
Link NAZIONALE and look at the "data" which is the italian for "date" and extract for every date the value nuovi_positivi and then it put it into the program.
Now I have to do the same thing with this json that you may find here
Link Json
As you may see, now for every date there are 21 different values because Italy has 21 regions (Abruzzo Basilicata Campania and so on) but I am interested ONLY with the values of the region "Veneto", and I want to extract only the rows that contains "Veneto" under the label "denominazione_regione" to get for every day the value "nuovi_positivi".
I tried with:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-
json/dpc-covid19-ita-regioni.json' , parse_dates=['data'], index_col='data',
index_row='Veneto')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
but of course it doesn't work. How to solve the problem? Thanks
try this:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json',
convert_dates =['data'])
df.index = df['data']
df.index = df.index.normalize()
df = df[df["denominazione_regione"] == 'Veneto']
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi

Merge FAST two DataFrames based on specific conditions, row by row

I have been struggling with my case for the past 10 days and I can't find a fast and efficient solution.
Here is the case. I have one DF containing web traffic data of a human resources website.
Every row of this Dataframe refers to an application (aka : someone reached the website via a specific web source, and apply to a specific job offer at a specific time).
Here is an example :
import pandas as pd
web_data = {'source': ['Google', 'Facebook','Email'],
'job_id': ['123456', '654321','010101'],
'd_date_hour_event' : ['2019-11-01 00:09:59','2019-11-01 00:10:41','2019-11-01 00:19:20'],
}
web_data = pd.DataFrame(web_data)
On the second DataFrame, I have an extract of a Human Resources internal tool where we gather all the received appplications with some complementary data. Here is an example :
hr_data ={'candidate_id': ['ago23ak', 'bli78gro','123tru456'],
'job_id': ['675848', '343434','010101'],
'date_time_submission' : ['2019-11-10 00:24:59','2019-11-09 12:10:41','2019-11-01 00:19:22'],
'job_label':['HR internship','Data Science Supervisor','Project Manager']
}
hr_data = pd.DataFrame(hr_data)
Here are the difficulties I am facing :
There is not a unique key I can use to merge those two tables. I have to use the "Job_id" (which is unique to every job) combined with the time when the application occured via the columns "d_date_hour_event" (on web_data DF) and "date_time_submission" (on hr_data DF).
For the same application, the time registered on the 2 tables might not be the same (difference of few seconds)
Some of web_data values might not be present in hr_data
In the end, I would like to get one DataFrame that looks like this :
result_dataframe.png
Actually, I already coded the function to realize this merge. It looks like this :
for i, row in web_data.iterrows() :
#we stock the needed value for hr_data research
date = row.d_date_hour_event
job = row.job_id
#we compute the time period
inf = date - timedelta(seconds=10)
sup = date + timedelta(seconds=10)
#we check if there a matching row in hr_data
temp_df = pd.DataFrame()
temp_df = hr_data[(hr_data.job_id == job) & \
(hr_data.date_time_submission >= inf) & (hr_data.date_time_submission <= sup)].tail(1)
#if there is a matching row, we merge them and update web_data table
if not temp_df.empty:
row = row.to_frame().transpose()
join = pd.merge(row, temp_df, how='inner', on='job_id',left_index=False, right_index=True)
web_data.update(join)
But, because my Web_data is over 250K rows and my HR_data is over 140k rows, it takes hours ! (estimation of 35hours running script...).
I am sure that the iterrows is not optimal and that this code can be optimized. I tried to use a custom function with .apply(lambda x: ...) but without success.
Any help would be more than welcome !
Please let me know if you need more explanations.
Many thanks !
Let's try this in a few steps:
1. Convert the datetime columns in both dataframes to pd.datetime format.
web_data = web_data.assign(d_date_hour_event= lambda x: pd.to_datetime(x['d_date_hour_event']))
hr_data = hr_data.assign(date_time_submission=lambda x: pd.to_datetime(x['date_time_submission']))
2. Rename the job_id of the hr_data dataframe so it will not provide any errors when merging.
hr_data = hr_data.rename(columns={"job_id": "job_id_hr"})
3. Make numpy arrays from the columns containing timestamps and job_ids in both dataframes and check for rows in which the timestamp in the web_data dataframe is within 10 seconds of the timestamp in the hr_data dataframe and have the same job_id using numpy broadcasting.
web_data_dates = web_data['d_date_hour_event'].values
hr_data_dates = hr_data['date_time_submission'].values
web_data_job_ids = web_data['job_id'].values
hr_data_job_ids = hr_data['job_id_hr'].values
i, j = np.where(
(hr_data_dates[:, None] <= (web_data_dates+pd.Timedelta(10, 'S'))) &
(hr_data_dates[:, None] >= (web_data_dates-pd.Timedelta(10, 'S'))) &
(hr_data_job_ids[:, None] == web_data_job_ids)
)
overlapping_rows = pd.DataFrame(
np.column_stack([web_data.values[j], hr_data.values[i]]),
columns=web_data.columns.append(hr_data.columns)
)
4. Assign new columns to the original web_data dataframe, so we can update these rows with all the information in case any rows overlap
web_data = web_data.assign(candidate_id=np.nan, job_id_hr=np.nan, date_time_submission=np.datetime64('NaT'), job_label=np.nan)
Finally just update web_data dataframe (or first create a copy if you don't want to overwrite the original datframe)
web_data.update(overlapping_rows)
Must be a lot faster than iterating over all rows
This is the code I am using (which is not working unless you make the changes I described in comments)
web_data = {'source': ['Google', 'Facebook','Email'],
'job_id': ['123456', '654321','010101'],
'd_date_hour_event' : ['2019-11-01 00:09:59','2019-11-01 00:10:41','2019-11-01 00:19:20'],
}
web_data = pd.DataFrame(web_data)
#placed '010101' and '2019-11-01 00:19:22' in second position instead of third position like it used to be
#if you reswitch these values to 3rd position in respectively 'job_id' and 'date_time_submission' arrays, it should work
hr_data ={'candidate_id': ['ago23ak', 'bli78gro','123tru456'],
'job_id': ['675848', '010101','343434'],
'date_time_submission' : ['2019-11-10 00:24:59','2019-11-01 00:19:22','2019-11-09 12:10:41'],
'job_label':['HR internship','Data Science Supervisor','Project Manager']
}
hr_data = pd.DataFrame(hr_data)
hr_data = hr_data.rename(columns={"job_id": "job_id_hr"})
web_data = web_data.assign(d_date_hour_event= lambda x: pd.to_datetime(x['d_date_hour_event']))
hr_data = hr_data.assign(date_time_submission=lambda x: pd.to_datetime(x['date_time_submission']))
web_data_dates = web_data['d_date_hour_event'].values
hr_data_dates = hr_data['date_time_submission'].values
web_data_job_ids = web_data['job_id'].values
hr_data_job_ids = hr_data['job_id_hr'].values
i, j = np.where(
(hr_data_dates[:, None] <= (web_data_dates+pd.Timedelta(10, 'S'))) &
(hr_data_dates[:, None] >= (web_data_dates-pd.Timedelta(10, 'S'))) &
(hr_data_job_ids[:, None] == web_data_job_ids[:, None])
)
overlapping_rows = pd.DataFrame(
np.column_stack([web_data.values[j], hr_data.values[i]]),
columns=web_data.columns.append(hr_data.columns)
)
overlapping_rows

SyntaxError comparing a date with a field in Python

Q) Resample the data to get prices for the end of the business month. Select the Adjusted Close for each stock.import pandas as pd
mmm = pd.read_csv('mmm.csv')
ibm = pd.read_csv('ibm.csv')
fb = pd.read_csv('fb.csv')
amz_date = amz.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
amz_price = amz_date[:, ['Date', 'AdjClose']] #dataframe with only these 2 columns
mmm_date = mmm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
mmm_price = mmm_date[:, ['Date', 'AdjClose']]
ibm_date = ibm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
ibm_price = ibm_date[:, ['Date', 'AdjClose']]
fb_date = fb.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
fb_price = fb_date[:, ['Date', 'AdjClose']]
KeyError: 'Date'
What Am i doing wrong? also Date is column in csv file
Your particular problem is that "06" is not a legal way to describe a value here. To do arithmetic, you'd need to drop the leading zero.
Your next problem is that 2017-06-30 is (would be) an arithmetic expression that evaluates to the integer 1981. You need to express this as a date, such as datetime.strptime("6-30-2017").

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Categories