Gaps in dates while rolling up quarters into a single row - python

I am attempting to roll-up rows from a data set with similar measures into a consolidated row. There are two conditions that must be met for the roll-up:
The measures (ranging from 1-5) should remain the same across the
rows for them to be rolled up to a single row.
The dates should be continuous (no gaps in dates).
If these conditions are not met, the code should generate a separate row.
This is the sample data that I am using:
id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,6/30/2019
ABC123XYZ789,1,1,1,1,1,7/1/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,3/31/2019
FGH589J6U88SW,1,1,1,1,1,4/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,9/30/2019
FGH589J6U88SW,1,1,1,2,1,10/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,3/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,4/1/2019,6/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,7/1/2019,9/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,10/1/2019,12/31/2019
The expected result should be:
id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,12/31/2019
I have implemented the code below which seems to address condition # 1, but I am looking for ideas on how to incorporate condition # 2 into the solution.
import pandas as pd
import time
startTime=time.time()
data=pd.read_csv('C:\\Users\\usertemp\\Data\\Rollup2.csv')
data['end_date']= pd.to_datetime(data['end_date'])
data['begin_date']= pd.to_datetime(data['begin_date'])
data = data.groupby(['id','measure1','measure2', 'measure3', 'measure4', 'measure5']) \
['begin_date', 'end_date'].agg({'begin_date': ['min'], 'end_date': ['max']}).reset_index()
print(data)
print("It took %s seconds for the collapse process" % (time.time() - startTime))
Any help is appreciated.

You can do the following.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Convert begin_date and end_time to datetime
df['begin_date'] = pd.to_datetime(df['begin_date'], format='%m/%d/%Y')
df['end_date']= pd.to_datetime(df['end_date'], format='%m/%d/%Y')
# We create a new column which contains the end_date+1 from the previous row
df['end_date_prev'] = df['end_date'].iloc[:-1] + timedelta(days=1)
df['end_date_prev'] = np.roll(df['end_date_prev'], 1)
# Create a cumsum that resets when begin_date and end_date_prev doesn't match
df['cont'] = (~(df['begin_date'] == df['end_date_prev'])).astype(int).cumsum()
# Since we need all measures to match we create a string column containing all measurements
df['comb_measure'] = df['measure1'].astype(str).str.cat(df[['measure{}'.format(i) for i in range(2,6)]].astype(str))
# Get the final df
new_df = df.groupby(['id', 'comb_measure', 'cont']).agg(
{'measure1':'first', 'measure2':'first', 'measure3':'first', 'measure4':'first', 'measure5':'first',
'begin_date':'first', 'end_date':'last'})

Related

Pandas dataframe - how to calculate the difference of end and start values for a list of items

The task is to create a function that takes in two inputs (1) a list (2) the start date of calculation. The output is to create a function that generates a list of the names of highest 10 values based on percentage return.
import numpy as np
import pandas as pd
import yfinance as yf
#Stocks.xlsx contains the field "Stock Code"
stock_list = pd.read_excel("Stocks.xlsx")
stock_code_List = stock_list["Stock code"]
start_date = "2019-01-01"
def top_10_stocks (start_date,stock_code):
for i in stock_code:
yf.download("IBM",start_date)["Close"]
print(i)
%Gain_Loss = ((df.iloc[-1]-df.iloc[0])/df.iloc[0])*100
sort_value = %Gain_Lost.sort()
This is where I got stuck.

Change dateformat

I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!
After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)

Dataframe drop between_time multiple rows by shifting timedelta

I would like to drop multiple groups of rows by time criterion. Date criterion may be ignored.
I have dataframe that contains 100 million rows, with around 0.001s sampling frequency - but it is variable for different columns.
The goal is to drop multiple rows by criterion of "shifting". The leave duration might be 0.01 seconds and the drop duration might be 0.1 second, as shown in Figure:
I have many problems with Timestamp to Time conversions and with the defining the oneliner that will drop multiple groups of rows.
I made tries with following code:
import pandas as pd
from datetime import timedelta#, timestamp
from datetime import datetime
import numpy as np
# leave_duration=0.01 seconds
# drop_duration=0.1 seconds
i = pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='2ms')
i=i.append(pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='3ms'))
i=i.append(pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='0.5ms'))
df = pd.DataFrame({'A': range(len(i))}, index=i)
df=df.sort_index()
minimum_time=df.index.min()
print("Minimum time:",minimum_time)
maximum_time=df.index.max()
print("Maximum time:",maximum_time)
# futuredate = minimum_time + timedelta(microseconds=100)
print("Dataframe before dropping:\n",df)
df.drop(df.between_time(*pd.to_datetime([minimum_time, maximum_time]).time).index, inplace=True)
print("Dataframe after dropping:\n",df)
# minimum_time=str(minimum_time).split()
# minimum_time=minimum_time[1]
# print(minimum_time)
# maximum_time=str(maximum_time).split()
# maximum_time=maximum_time[1]
# print(maximum_time)
How can I drop rows by time criterion, with shifting?
Working for me:
df = df.loc[(df.index - df.index[0]) % pd.to_timedelta('110ms') > pd.to_timedelta('100ms')]
I *think this is what you're looking for. If not, it hopefully gets you closer.
I defined drop periods by taking the minimum time and incrementing it by your drop/leave times. I then append it to a dictionary where the key is the start of the drop period and the value is the end of the drop period.
Lastly I just iterate through the dictionary and drop rows that fall between those two times in your dataframe, shedding rows at each step.
drop_periods = {}
start_drop = minimum_time + datetime.timedelta(seconds=0.01)
end_drop = start_drop + datetime.timedelta(seconds=0.1)
drop_periods[start_drop] = end_drop
while end_drop < maximum_time:
start_drop = end_drop + datetime.timedelta(seconds=0.01)
end_drop = start_drop + datetime.timedelta(seconds=0.1)
drop_periods[start_drop] = end_drop
for start, end in drop_periods.items():
print("Dataframe before dropping:\n", len(df))
df.drop(df.between_time(*pd.to_datetime([start, end]).time).index, inplace=True)
print("Dataframe after dropping:\n", len(df))

Python loop a whole day, and when current time equals to a given time, do something

I have a Pandas DataFrame df, which has three columns(time, from and to). I want to execute a function which will for loop df['time']. When current time equals to the time in df['time'], call another function like print something. Each row will be executed only once. In real data, the script will be executed 24 hours in the cloud.
import pandas as pd
df=pd.DataFrame({'time':['08:35','09:35','09:45','10:10'],
'from':['SHH','SZH','WXH','ZJH'],
'to':['NJH','NJH','NJH','NJH']})
df
from time to
0 SHH 08:35 NJH
1 SZH 09:35 NJH
2 WXH 09:45 NJH
3 ZJH 10:10 NJH
For example, when the current time is 08:35, print Time is reached, train from SHH to NJH, and when the current time is 09:35, print Time is reached, train from SZH to NJH. I don't know how to modify the code below to satisfy my
job. Need help.
import datetime
import time
def ex(a,b):
print("Time is reached. train from {} to {}".format(a, b))
time_ls = list(df['time'])
from_ls = list(df['from'])
to_ls = list(df['to'])
def run():
for i in range(len(df['time'])):
while time.strftime("%H:%M", time.localtime()) == df['time'][i]:
time_ls.remove(df['time'][i])
yield ex(from_ls[i],to_ls[i])
If you want to extract rows from a pandas.DataFrame that meet a certain condition, you want to slice the DataFrame instead of manually iterating over all of its rows and checking that condition on your own. The pandas implementation for this is way faster than any manual attempt.
Once you got the dataframe that only contains the rows that match the current time (hour and minute), you can iterate over that smaller DataFrame and print the result for each of its rows (since you know that it only contains those rows that match).
See the following example:
from datetime import datetime as dt
import pandas as pd
if __name__ == '__main__':
df = pd.DataFrame({
'time': ['08:35','09:35','09:45','10:10'],
'from': ['SHH','SZH','WXH','ZJH'],
'to': ['NJH','NJH','NJH','NJH']})
ct = dt.strftime(dt.now(), '%H:%M') # Get current hours and minutes
dn = df.loc[df['time'] == ct] # Slice DataFrame based on 'time' column
for row in dn.iterrows():
# Iterate over all rows that meet the condition and print it
print('{time:s} -- train from {from:s} to {to:s}.'.format(**dict(row[1])))
It yields (if hour and minute of ct string match 08:35 for this example):
08:35 -- train from SHH to NJH.

Spatial temporal query in python with many records

I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
Import
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
np.random.seed(111)
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
len(data)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
l.append(j)
else:
pass
try:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
output.append(data['status'].sum())
except IndexError, e:
output.append(0)
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start = datetime.now()
out = work(data)
print datetime.now() - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
<3
Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli.ids
cli = Client()
dview=cli[:]
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
#dview.parallel(block=True)
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.

Categories