Pandas calculate last few months data in new column - python

I have dataframe in the following format;
ID | 01/01/2016 | 02/03/2016 | 02/15/2016 | ........
11 | 100 | 200 | 100 | ........
I am trying to calculate the sum of e.g.: last 3 months data in the new column. Expected output should be as follows;
ID | 01/01/2016 | 02/03/2016 | 02/15/2016 | ........ | Last 3 Months
11 | 100 | 200 | 100 | ........ | 300
As a solution, I need to pick today's date and compare it with the dates in the column and sum up the values. However, I am not sure how to do that? Could you please give some tips?
Thank you.

This is not as straight forward as it may initially seem. You need to determine how you will handle year-to-year changes and having differing number of days in each month. I do this using a simple function. You can adjust the code below to meet your needs, but it should get you started.
from __future__ import division, print_function
def subtract_months(m):
'''subtracts specified number of months from current date
Parameters
----------
m : integer
how many months to subtract from today's date
Returns
-------
date : datetime value'''
yr = dt.date.today().year
mon = dt.date.today().month - m
day = dt.date.today().day
# test whether we went into another year
if mon<=0:
yr -= 1
mon = 12 + mon
# test whether we have exceeded maximum number of days in month
if day>calendar.monthrange(yr,mon)[1]:
day = calendar.monthrange(yr,mon)[1]
return dt.date(yr,mon,day)
import pandas as pd
import datetime as dt
import calendar
dates = pd.date_range('20160101','20170101',freq='1D')
data = pd.np.random.randint(0,100,(5,367))
df = pd.DataFrame(data=data,index=list('ABCDE'),columns=dates)
# now add a new column
df['Last 3 Months'] = df.T.truncate(before=subtract_months(3),after=dt.date.today()).sum(axis=0)

Related

Calculate difference in days in one single column based on the values in another column (pandas)

I have a pandas df (called df2) like this:
id | orderdate |
___________________
123|2020-11-01 |
123|2020-08-01 |
233|2020-07-01 |
233|2020-11-04 |
444|2020-11-04 |
444|2020-05-03 |
444|2020-04-01 |
444|2020-11-25 |
The values of orderdate are datetime with the format '%Y%m%d'. They represent orders of a client. I want to calculate the delta time between the first order and the second one for each id (each client).
I come up with:
for i in list(set(df2.id)):
list_sorted=list(set((df2.loc[df2['id']==i, 'orderdate'] )))
list_sorted= sorted(list_sorted) #get sorted list of the order dates in ascending order
min_list= list_sorted[0] # first element is first order
df2.loc[df2['id']==i, 'First Order']= min_list
if len(list_sorted)>1:
penultimate_list= list_sorted[1]
df2.loc[df2['id']==i, 'Second Order']= penultimate_list # second element is second order
df2.loc[df2['id']==i, 'Delta orders']= min_list - penultimate_list #calculate delta
else:
df2.loc[df2['id_user']==i, 'Delta orders']= None
My expected outcome is:
id | orderdate | First Order | Second Order| Delta Orders
______________________________________________
123|2020-11-01 |2020-08-01 | 2020-11-01 | 92 days
123|2020-08-01 |2020-08-01 | 2020-11-01 | 92 days
233|2020-07-01 |2020-07-01 | 2020-11-04 | 126 days
233|2020-11-04 |2020-07-01 | 2020-11-04 | 126 days
444|2020-11-04 |2020-04-01 | 2020-05-03 | 32 days
444|2020-05-03 |2020-04-01 | 2020-05-03 | 32 days
444|2020-04-01 |2020-04-01 | 2020-05-03 | 32 days
444|2020-11-25 |2020-04-01 | 2020-05-03 | 32 days
It works but I feel like it's cumbersome. Any easier way to do it?
Slightly different from what you want, but it's a start:
import pandas as pd
from io import StringIO
data = StringIO(
"""id|orderdate
123|2020-11-01
123|2020-08-01
233|2020-07-01
233|2020-11-04
444|2020-11-04
444|2020-05-03
444|2020-04-01
444|2020-11-25 """)
df = pd.read_csv(data, sep='|')
df['orderdate'] = pd.to_datetime(df['orderdate'], infer_datetime_format=True)
df = df.sort_values(['id', 'orderdate'], ascending=False)
def date_diff(df):
df['order_time_diff'] = (df['orderdate'] - df['orderdate'].shift(-1)).dt.days
df = df.dropna()
return df
# this calculates all order differences
df.groupby('id').apply(date_diff)
# this will get the data as requested
df.groupby('id', as_index=False).apply(date_diff).groupby('id').tail(1)

multiple columns to single datetime dataframe column

I have a data frame that contains (among others) columns for the time of day (00:00-23:59:59) day (1-7), month (1-12), and year (2000-2019). How can I combine the values of each of these columns on a row by row basis into a new DateTime object and then store these new date-times in a new column? I've read other posts pertaining to such a task but they all seem to involve one date column to one DateTime column whereas I have 4 columns that need to be transformed into DateTime. Any help is appreciated!
e.g.
| 4:30:59 | 1 | 1 | 2000 | TO 200/1/1 4:30:59
this is the only code I have so far which probably doesn't do anything
#creating datetime object (MISC)
data = pd.read_csv('road_accidents_data_clean.csv',delimiter=',')
df = pd.DataFrame(data)
format = '%Y-%m-%d %H:%M:%S'
n = 0
df['datetime'] = data.loc[n,'Crash_Day'],data.loc[n,'Crash_Month'],data.loc[n,'Year']
My DataFrame is layed out as follows:
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
1 42 2000 1 1 7:45:10 DRIVER MALE
2 25 2000 1 1 10:15:30 PEDESTRIAN FEMALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
OVERTAKING 1 YARRA MELBOURNE NaN
ADJACENT DIR 0 MONASH MELBOURNE NaN
NOTE: the dataframe is 13 columns wide i just couldn't fit them all on one line so Crash_Type starts to the right of Gender.
below is the code i've been suggested to use/my adaptation of it
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
data['Datetime'] = df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
I've adapted this code in order to combine the values for the datetime column with the my original dataframe.
Combine the columns into a single series of stings using + (converting to str where needed with pandas.Series.astype method) then pass that new series into pd.to_datetime before assigning it to a new column in your df:
import pandas as pd
df = pd.DataFrame(dict(time=['4:30:59'],date=[1],month=[1],year=[2000]))
df['datetime'] = pd.to_datetime(
df['year'].astype(str)+'-'+df['month'].astype(str)+'-'+df['date'].astype(str)+' '+df['time'],
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
example in python tutor
edit: You can also use a numpy.sum to make that one long line adding columns together easier on the eyes:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(
time=['4:30:59','4:20:00'],
date=[1,20],
month=[1,4],
year=[2000,2020],
))
df['datetime']=pd.to_datetime(
np.sum([
df['year'].astype(str),
'-',
df['month'].astype(str),
'-',
df['date'].astype(str),
' ',
df['time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
sum example in python tutor
edit 2: Using your actual column names, it should be something like this:
import pandas as pd
import numpy as np
'''
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
'''
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
another python tutor link
One thing to note is that you might want to double check if your csv file is separated by just a comma or could it be a comma and a space? possible that you may need to load the data with df = pd.read_csv('road_accidents_data_clean.csv',sep=', ') if there is an extra space separating the data in addition to the comma. You don't want to have that extra space in your data.

How would i iterate through a list of lists and perform computational filtering?

this one is a bit of a doozy.
At a high level, I'm trying to figure out how to run a nested for loop. I'm essentially trying to iterate through columns and rows, and perform a computational check to make sure outcomes match a specified requirement - if so, they loop to the next row, if not, they are kicked out and the loop moves onto the next user.
Specifically, I want to perform a T-Test between a control/treatment group of users, and make sure the result is less than a pre-determined value.
Example:
I have my table of values - "DF" - there are 7 columns. The user_id column specifies the user's unique identifier. The user_type column is a binary classifier, users can be of either T (treatment) or C (control) types. The 3 "hour" columns are dummy number columns, values that I'll perform computation on. The mon column is the month, and tval is the number that the computation will have to be less than to be accepted.
In this case, the month is all January data. Each month can have a different tval.
DF
| user_id | user_type | hour1 | hour2 | hour3 | mon | tval |
|---------|-----------|-------|-------|-------|-----|------|
| 4 | T | 1 | 10 | 100 | 1 | 2.08 |
| 5 | C | 2 | 20 | 200 | 1 | 2.08 |
| 6 | C | 3 | 30 | 300 | 1 | 2.08 |
| 7 | T | 4 | 40 | 400 | 1 | 2.08 |
| 8 | T | 5 | 50 | 500 | 1 | 2.08 |
My goal is to iterate through each T user - and for each, loop through each C user. For each "Pair", I want to perform computation (t-test) between their hour 1 values. If the value is less than the tval, move to hour2 values, etc. If not, it gets kicked out and the loop moves to the next C user without completing that C user's loop. If it passes all value checks, the user_ids of each would be appended to a list or something external.
The output would hopefully look like a table of pairs. The T user and C user that have successfully iterated through all hour columns, and the month that passed (as each set of users have data for all 12 months).
Output:
| t_userid | c_userid | month |
|--------- |-----------|-------|
| 4 | 5 | 1 |
| 8 | 6 | 1 |
To sum it all up:
For each T user:
For each C user:
If t-test on t.hour1 and c.hour1 is less than X number (passing test):
move to next hour (hour2) and repeat
If all hours pass, add pair (T user_id, c_user_id) to separate list/series/df,etc
else: skip following hours and move to next C user.
I'm wondering if my data format is also incorrect. Would this be easier if I unpivoted my hourly data and iterated over each row? Any help is greatly appreciated. Thanks, and let me know if any clarification is necessary.
EDIT:
So far I've split the data between Treat and Control groups, and calculated average and standard deviation for a users monthly data (which is normally broken down by day) and added them as columns, hour1_avg and hour1_stdev. I've attempted another for loop, but am getting a ValueError.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is due to the fact that I cant compare a pandas Series to a float, int, str, etc. I will make another post addressing this question.
Here's what I have so far:
for i in treatment.user_id:
for i in control.user_id:
if np.absolute((treatment['hour1_avg'] - control['hour1_avg'])/np.sqrt((treatment['hour1_stdev']**2/31)+(control['hour1_stdev']**2/31))) > treatment.tval:
"End loop, move to next control user"
else:
"copy paste if statement above, but for hour2, etc etc"
Split the dataframe into control and treatment groups
Make join of the resulting dataframes on a constant field (will create all pairs)
Use a combination of apply and any to make the decision
Filter out the join using the decision vector
Code to illustrate the idea:
# assuming the input is in df
control = df[df['user_type'] == 'C']
treatment = df[df['user_type'] == 'T']
# part 2: pairs will be created month-wise.
# If you want all vs all, create a temp field, e.g.: control['_temp'] = 1
pairs = treatment.merge(control, left_on='mon', right_on='mon')
# part 3
def test(row):
# all will stop executing at the first False
return all(
row['hours_%d_x' % i] - row['hours_%d_y' % i] < row['t_val']
for i in range())
# all_less is a series of bool
all_less = pairs.apply(test, axis=1)
# part 4
output = pairs.loc[all_less, ['user_id_x', 'user_id_y', 'mon']].rename(
columns={'user_id_x': 't_user_id', 'user_id_y': 'c_user_id'})

How do i create a heatmap from two columns plus the value of those two in python

and thank you for helping!
I would like to generate a heatmap in python, from the data df.
(i am using pandas, seaborn, numpy, and matplotlib in my project)
The dataframe df looks like:
index | a | b | c | year | month
0 | | | | 2013 | 1
1 | | | | 2015 | 4
2 | | | | 2016 | 10
3 | | | | 2017 | 1
in the dataset the rows are each a ticket.
The dataset is big (51 colums and 100k+ rows), so a, b, c is just to show some random columns. (for month => 1 = jan, 2= feb...)
For the heatmap:
x-axis = year,
y-axis = month,
value: and in the heatmap, I wanted the value between the two axes to be a count of the number of rows, in which a ticket has been given in that year and month.
The result I imagine should look something like the from the seaborn documentation:
https://seaborn.pydata.org/_images/seaborn-heatmap-4.png
I am new to coding and tried a lot of random things I found on the internet and has not been able to make it work.
Thank you for helping!
This should do (with generated data):
import pandas as pd
import seaborn as sns
import random
y = [random.randint(2013,2017) for n in range(2000)]
m = [random.randint(1,12) for n in range(2000)]
df = pd.DataFrame([y,m]).T
df.columns=['y','m']
df['count'] = 1
df2 = df.groupby(['y','m'], as_index=False).count()
df_p = pd.pivot_table(df2,'count','m','y')
sns.heatmap(df_p)
You probably won't need the column count but I added it because I needed an extra column for the groupby to work.

Using Pandas to join and append columns in a loop

I want to append columns from tables generated in a loop to a dataframe. I was hoping to accomplish this using pandas.merge, but it doesn't seem to be working out for me.
My code:
from datetime import date
from datetime import timedelta
import pandas
import numpy
import pyodbc
date1 = date(2017, 1, 1) #Starting Date
date2 = date(2017, 1, 10) #Ending Date
DateDelta = date2 - date1
DateAdd = DateDelta.days
StartDate = date1
count = 1
# Create the holding table
conn = pyodbc.connect('Server Information')
**basetable = pandas.read_sql("SELECT....")
while count <= DateAdd:
print(StartDate)
**datatable = pandas.read_sql("SELECT...WHERE Date = "+str(StartDate)+"...")
finaltable = basetable.merge(datatable,how='left',left_on='OrganizationName',right_on='OrganizationName')
StartDate = StartDate + timedelta(days=1)
count = count + 1
print(finaltable)
Shortened the select statements for brevity's sake, but the tables produced look like this:
**Basetable
School_District
---------------
District_Alpha
District_Beta
...
District_Zed
**Datatable
School_District|2016-01-01|
---------------|----------|
District_Alpha | 400 |
District_Beta | 300 |
... | 200 |
District_Zed | 100 |
I have the datatable written so the column takes the name of the date selected for that particular loop, so column names can be unique once i get this up and running. My problem, however, is that the above code only produces one column of data. I have a good guess as to why: Only the last merge is being processed - I thought using pandas.append would be the way to get around that, but pandas.append doesn't "join" like merge does. Is there some other way to accomplish a sort of Join & Append using Pandas? My goal is to keep this flexible so that other dates can be easily input depending on our data needs.
In the end, what I want to see is:
School_District|2016-01-01|2016-01-02|... |2016-01-10|
---------------|----------|----------|-----|----------|
District_Alpha | 400 | 1 | | 45 |
District_Beta | 300 | 2 | | 33 |
... | 200 | 3 | | 5435 |
District_Zed | 100 | 4 | | 333 |
Your error is in the statement finaltable = basetable.merge(datatable,...). At each loop iteration, you merge the original basetable with the new datatable, store the result in the finaltable... and discard it. What you need is basetable = basetable.merge(datatable,...). No finaltables.

Categories