I have an existing dataframe which looks like:
id start_date end_date
0 1 20170601 20210531
1 2 20181001 20220930
2 3 20150101 20190228
3 4 20171101 20211031
I am trying to add 85 columns to this dataframe which are:
if the month/year (looping on start_date to end_date) lie between 20120101 and 20190101: 1
else: 0
I tried the following method:
start, end = [datetime.strptime(_, "%Y%m%d") for _ in ['20120101', '20190201']]
global_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
def get_count(contract_start_date, contract_end_date):
start, end = [datetime.strptime(_, "%Y%m%d") for _ in [contract_start_date, contract_end_date]]
current_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
temp_list = []
for each in global_list:
if each in current_list:
temp_list.append(1)
else:
temp_list.append(0)
return pd.Series(temp_list)
sample_df[global_list] = sample_df[['contract_start_date', 'contract_end_date']].apply(lambda x: get_count(*x), axis=1)
and the sample df looks like:
customer_id contract_start_date contract_end_date 01/12 02/12 03/12 04/12 05/12 06/12 07/12 ... 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19
1 1 20181001 20220930 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 1 1
9 2 20160701 20200731 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 3 20171101 20211031 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 rows × 88 columns
it works fine for small dataset but for 160k rows it didn't stopped even after 3 hours. Can someone tell me a better way to do this?
Facing problems when the dates overlap for same customer.
First I'd cut off the dud dates, to normalize the end_time (to ensure it's in the time range):
In [11]: df.end_date = df.end_date.where(df.end_date < '2019-02-01', pd.Timestamp('2019-01-31')) + pd.offsets.MonthBegin()
In [12]: df
Out[12]:
id start_date end_date
0 1 2017-06-01 2019-02-01
1 2 2018-10-01 2019-02-01
2 3 2015-01-01 2019-02-01
3 4 2017-11-01 2019-02-01
Note: you'll need to do the same trick for start_date if there are dates prior to 2012.
I'd create the resulting DataFrame from a date range of the columns and then fill it in (with ones at start time and something else:
In [13]: m = pd.date_range('2012-01-01', '2019-02-01', freq='MS')
In [14]: res = pd.DataFrame(0., columns=m, index=df.index)
In [15]: res.update(pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.start_date).groupby(axis=1, level=0).sum())
In [16]: res.update(-pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.end_date).groupby(axis=1, level=0).sum())
The groupby sum is required if multiple rows start or end in the same month.
# -1 and NaN were really placeholders for zero
In [17]: res = res.replace(0, np.nan).ffill(axis=1).replace([np.nan, -1], 0)
In [18]: res
Out[18]:
2012-01-01 2012-02-01 2012-03-01 2012-04-01 2012-05-01 ... 2018-09-01 2018-10-01 2018-11-01 2018-12-01 2019-01-01
0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 1.0 1.0 1.0
2 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
Related
I have a pandas.DataFrame in it I have a column. The columns contains, integers, strings, time...
I want to create columns (containing [0,1]) that tells if the value in that column is a string or not, a time or not... in an efficient way.
A
0 Hello
1 Name
2 123
3 456
4 22/03/2019
And the output should be
A A_string A_number A_date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
Using pandas str methods to check for the string type could help:
df = pd.read_clipboard()
df['A_string'] = df.A.str.isalpha().astype(int)
df['A_number'] = df.A.str.isdigit().astype(int)
#naive assumption
df['A_Date'] = (~df.A.str.isalnum()).astype(int)
df.filter(['A','A_string','A_number','A_Date'])
A A_string A_number A_Date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
We can use the native pandas .to_numeric, to_datetime to test for dates & numbers. Then we can use .loc for assignment and fillna to match your target df.
df.loc[~pd.to_datetime(df['A'],errors='coerce').isna(),'A_Date'] = 1
df.loc[~pd.to_numeric(df['A'],errors='coerce').isna(),'A_Number'] = 1
df.loc[(pd.to_numeric(df['A'],errors='coerce').isna())
& pd.to_datetime(df['A'],errors='coerce').isna()
,'A_String'] = 1
df = df.fillna(0)
print(df)
A A_Date A_Number A_String
0 Hello 0.0 0.0 1.0
1 Name 0.0 0.0 1.0
2 123 0.0 1.0 0.0
3 456 0.0 1.0 0.0
4 22/03/2019 1.0 0.0 0.0
I want to combine two dataframes. One dataframe, let's say Empty_DF, is empty and has big size (320 columns by 240 rows) with indexes and column names just integers. The other one,ROI_DF, is smaller and filled and matches at a certain location the indexes and column names.
I have tried to use the pandas.merge function as it was suggested in this question; however, it would just append the columns to the empty dataframe Empty_DF and not replacing the values.
Empty_DF = pd.DataFrame({'a':[0,0,0,0,0,0],
'b':[0,0,0,0,0,0], 'b':[0,0,0,0,0,0]}, index=list('abcdef'))
print (Empty_DF)
ROI_DF= pd.DataFrame({'a':range(4),
'b':[5,6,7,8]}, index=list('abce'))
print(ROI_DF)
a b c
a 0 0 0
b 0 0 0
c 0 0 0
d 0 0 0
e 0 0 0
f 0 0 0
In this example, it is sufficient since the dataframe is small and the pandas.fillna option with pandas.drop can be used. Is there a more efficient way of optimizing this to bigger dataframes?
df3 = pd.merge(Empty_DF, ROI_DF, how='left', left_index=True,
right_index=True, suffixes=('_x', ''))
df3['a'].fillna(df3['a_x'], inplace=True)
df3['b'].fillna(df3['b_x'], inplace=True)
df3.drop(['a_x', 'b_x'], axis=1, inplace=True)
print(df3)
a b c
a 0 5 0
b 1 6 0
c 2 7 0
d 0 0 0
e 3 8 0
f 0 0 0
This is perfect case for DataFrame.update, which aligns on indices
Empty_DF.update(ROI_DF)
Output
print(df3)
a b c
a 0.0 5.0 0
b 1.0 6.0 0
c 2.0 7.0 0
d 0.0 0.0 0
e 3.0 8.0 0
f 0.0 0.0 0
Note that update is in place, as quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
That means that your original dataframe will be updated by the new values. To prevent this, use:
df3 = Empty_DF.copy()
df3.update(ROI_DF)
You can either use update:
Empty_DF.update(ROI_DF)
output:
a b c
a 0.0 5.0 0
b 1.0 6.0 0
c 2.0 7.0 0
d 0.0 0.0 0
e 3.0 8.0 0
f 0.0 0.0 0
Or loc:
Empty_DF.loc[ROI_DF.index, ROI_DF.columns] = ROI_DF
output:
a b c
a 0 5 0
b 1 6 0
c 2 7 0
d 0 0 0
e 3 8 0
f 0 0 0
In your case reindex_like
yourdf=ROI_DF.reindex_like(Empty_DF).fillna(0)
a b c
a 0.0 5.0 0.0
b 1.0 6.0 0.0
c 2.0 7.0 0.0
d 0.0 0.0 0.0
e 3.0 8.0 0.0
f 0.0 0.0 0.0
Here I have a dataset with three inputs with date and time. Here I collected my data not in pattern time. Here what I want first is put my start time as 0 and convert other time into minutes.
my code is:
data = pd.read_csv('data6.csv',"," )
data['date'] = pd.to_datetime(data['date'] + " " + data['time'], format='%d/%m/%Y %H:%M:%S')
lastday = data.loc[0, 'date']
def convert_time(x):
global lastday
if x.date() == lastday.date():
tm = x - lastday
return tm.total_seconds()/60
else:
lastday = x
return 0
data['time'] = data['date'].apply(convert_time)
Then I got the results:
But what I expected is :
I want to set the time for every one minute from starting time 0 , then if column has no value at that time then put the 0 values. If values are append then put the value with time column in minutes.
If new day then put again start time as 0 then start the value in minutes .
This is like time group with one minute , data.
Date time in min X1 X2 X3
10/3/2018 1 63 0 0
2
3
4 if no values then put 0 values into that
5 column till the values available
6 Then put it that column values
7
8
9
10
11
12
13
10/4/2018 0 120 30 60
1 0 0 0
My csv file:
link for my csv:
My csv
After new code my time is displaying :
Pandas has functions for this; resample from a datetime index. You have to give an aggregation feature in case your data has multiple values within 1 minute. Below example will sum these values, it is easy to change this.
Please correct me if this is not what you want.
Code
# Read CSV
csv_url = 'https://docs.google.com/spreadsheets/d/1WWq1qhqi4bGzNir_svQV7VstBkGbocToipPCY83Cclc/gviz/tq?tqx=out:csv&sheet=1512153575'
data = pd.read_csv(csv_url)
data['date'] = pd.to_datetime(data['date'] + " " + data['time'], format='%d/%m/%Y %H:%M:%S')
# Resample to 1 minute (T is minute)
df = data.set_index('date') \
.resample('1T') \
.sum() \
.fillna(0)
# Optional ugly one-liner to start index at 0, and 1 row per minute, restart at day start
df.index = ((df.index - pd.to_datetime(df.index.date)).total_seconds() / 60).astype(int)
Output
df.head()
x1 x2 x3 Unnamed: 5 Unnamed: 6 Unnamed: 7
date
2018-03-10 06:15:00 63 0 0 0.0 0.0 0.0
2018-03-10 06:16:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:17:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:18:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:19:00 0 0 0 0.0 0.0 0.0
Output 2
With ugly-ass one-liner
x1 x2 x3 Unnamed: 5 Unnamed: 6 Unnamed: 7
date
0 63 0 0 0.0 0.0 0.0
1 0 0 0 0.0 0.0 0.0
2 0 0 0 0.0 0.0 0.0
3 0 0 0 0.0 0.0 0.0
4 0 0 0 0.0 0.0 0.0
5 0 0 0 0.0 0.0 0.0
You may creat a dataframe df2 contain the column time and minutes of day, and then use
csv_url = 'https://docs.google.com/spreadsheets/d/1WWq1qhqi4bGzNir_svQV7VstBkGbocToipPCY83Cclc/gviz/tq?tqx=out:csv&sheet=1512153575'
data = pd.read_csv(csv_url)
df = pd.merge(data,df2,how='outer',on='time')
df = df.fillna(0)
df2 is like the pic, you can create it by script or excel
I am trying to calculate the # of days between failures. I'd like to know on each day in the series the # of days passed since the last failure where failure = 1. There may be anywhere from 1 to 1500 devices.
For Example, Id like my dataframe to look like this (please pull data from url in the second code block. This is just a short example of a larger dataframe.):
date device failure elapsed
10/01/2015 S1F0KYCR 1 0
10/07/2015 S1F0KYCR 1 7
10/08/2015 S1F0KYCR 0 0
10/09/2015 S1F0KYCR 0 0
10/17/2015 S1F0KYCR 1 11
10/31/2015 S1F0KYCR 0 0
10/01/2015 S8KLM011 1 0
10/02/2015 S8KLM011 1 2
10/07/2015 S8KLM011 0 0
10/09/2015 S8KLM011 0 0
10/11/2015 S8KLM011 0 0
10/21/2015 S8KLM011 1 20
Sample Code:
Edit: Please pull actual data from code block below. The above sample data is an short example. Thanks.
url = "https://raw.githubusercontent.com/dsdaveh/device-failure-analysis/master/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True) #Sort by date and device
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d') #format date to datetime
This is where I am running into obstacles. However the new column should contain the # of days since last failure, where failure = 1.
test['date'] = 0
for i in test.index[1:]:
if not test['failure'][i]:
test['elapsed'][i] = test['elapsed'][i-1] + 1
I have also tried
fails = df[df.failure==1]
fails.Dates = trues.index #need this because .diff() won't work on the index..
fails.Elapsed = trues.Dates.diff()
Using pandas.DataFrame.groupby with diff and apply:
import pandas as pd
import numpy as np
df['date'] = pd.to_datetime(df['date'])
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
Output:
Date Device Failure Elapsed
0 2015-10-01 S1F0KYCR 1 0.0
1 2015-10-07 S1F0KYCR 1 7.0
2 2015-10-08 S1F0KYCR 0 0.0
3 2015-10-09 S1F0KYCR 0 0.0
4 2015-10-17 S1F0KYCR 1 11.0
5 2015-10-31 S1F0KYCR 0 0.0
6 2015-10-01 S8KLM011 1 0.0
7 2015-10-02 S8KLM011 1 2.0
8 2015-10-07 S8KLM011 0 0.0
9 2015-10-09 S8KLM011 0 0.0
10 2015-10-11 S8KLM011 0 0.0
11 2015-10-21 S8KLM011 1 20.0
Update:
Found out the actual data linked in the OP contains No device that has more than two failure cases, making the final result all zeros (i.e. no second failure has ever happened and thus nothing to calculate for elapsed). Using OP's original snippet:
import pandas as pd
url = "http://aws-proserve-data-science.s3.amazonaws.com/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True)
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d')
Find if any device has more than 1 failure:
df.groupby(['device'])['failure'].sum().gt(1).any()
# False
Which actually confirms that the all zeros in df['elapsed'] is actually a correct answer :)
If you tweak your data a bit, it does yield elapsed just as expected.
df.loc[6879, 'device'] = 'S1F0RRB1'
# Making two occurrence of failure for device S1F0RRB1
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
df['elapsed'].value_counts()
# 0.0 124493
# 3.0 1
Here is one way
df['elapsed']=df[df.Failure.astype(bool)].groupby('Device').Date.diff().dt.days.add(1)
df.elapsed.fillna(0,inplace=True)
df
Out[225]:
Date Device Failure Elapsed elapsed
0 2015-10-01 S1F0KYCR 1 0 0.0
1 2015-10-07 S1F0KYCR 1 7 7.0
2 2015-10-08 S1F0KYCR 0 0 0.0
3 2015-10-09 S1F0KYCR 0 0 0.0
4 2015-10-17 S1F0KYCR 1 11 11.0
5 2015-10-31 S1F0KYCR 0 0 0.0
6 2015-10-01 S8KLM011 1 0 0.0
7 2015-10-02 S8KLM011 1 2 2.0
8 2015-10-07 S8KLM011 0 0 0.0
9 2015-10-09 S8KLM011 0 0 0.0
10 2015-10-11 S8KLM011 0 0 0.0
11 2015-10-21 S8KLM011 1 20 20.0
Hi I would like to implement a counter which counts the number of successive zero observations in a dataframe (across multiple columns). But I would like to reset it if a non-zero observation is found. I have used a for loop but it is incredibly slow, I am sure there must be far more efficient ways. This is my code:
Here is a snapshot of df
df.head()
ACL ACT ADH ADR AFE AFH AFT
2013-02-05 NaN NaN NaN NaN NaN NaN NaN
2013-02-12 -0.136861 -0.020406 0.046150 0.000000 -0.005321 NaN 0.058195
2013-02-19 -0.006632 0.041665 0.007365 0.012738 0.040930 NaN -0.037818
2013-02-26 -0.023848 -0.023999 -0.030677 -0.003144 0.050604 NaN -0.047604
2013-03-05 0.009771 -0.024589 -0.021073 -0.039432 0.047315 NaN 0.068727
I first initialise an empty data frame which has the same properties of df (dataframe) above
df1=pd.DataFrame( index= df, columns=df)
df1=df1.fillna(0)
Then I create my function which iterates over the rows, but this only deals with one column at a time
def zero_obs(x=df,y=df1):
for i in range(len(x)):
if x[i] == 0:
y[i] = y[i-1] + 1
else:
y[i] = 0
return y
for col in df.columns:
df1[col] = zero_obs(x=df[col],y=df1[col])
Really appreciate any help!!
The output i expect is as follows:
df1.tail()
BRN AXL TTO AGL ACL
2017-01-03 3 125 0 0 0
2017-01-10 0 126 0 0 0
2017-01-17 1 127 0 0 0
2017-01-24 0 128 0 0 0
2017-01-31 0 129 1 0 0
setup
Consider the dataframe df
df = pd.DataFrame(
np.zeros((10, 2), dtype=int),
columns=list('AB')
)
df.loc[[0, 4, 8], 'A'] = 1
df.loc[6, 'B'] = 1
print(df)
A B
0 1 0
1 0 0
2 0 0
3 0 0
4 1 0
5 0 0
6 0 1
7 0 0
8 1 0
9 0 0
Option 1
pandas apply
def zero_obs(x):
"""`x` is assumed to be a `pd.Series`"""
csum = x.eq(0).cumsum()
cpos = csum.where(x.ne(0)).ffill().fillna(0)
return csum.sub(cpos)
print(df.apply(zero_obs))
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0
Option 2
don't use apply
This function works just as well on df
zero_obs(df)
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0