Dataframe has Everyother column timestamp, how to get it in one column? - python

I have a dataframe I import from excel that is of 'n x n' length, that looks like the following (sorry, i do not know how to easily duplicate this with code)
How do I get the timestamps into one column? Like the following (I've tried pivot)

You may need to extract the data by 3 columns group. Then rename the columns and add the "A,B,C" flag column and concatenate them together. See the test as below:
abc_list = [["2017-10-01",0,"2017-10-02",1,"2017-10-03",8],["2017-11-01",3,"2017-11-01",5,"2017-11-05",10],["2017-12-01",0,"2017-12-07",7,"2017-12-07",12]]
df = pd.DataFrame(abc_list,columns=["Time1","A","Time2","B","Time3","C"])
The output:
Time1 A Time2 B Time3 C
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Then:
df_a=df.iloc[:,0:2].rename(columns={'Time1':'time','A':'value'})
df_a['flag']="A"
df_b=df.iloc[:,2:4].rename(columns={'Time2':'time','B':'value'})
df_b['flag']="B"
df_c=df.iloc[:,4:].rename(columns={'Time3':'time','C':'value'})
df_c['flag']="C"
df_final=pd.concat([df_a,df_b,df_c])
df_final.reset_index(drop=True)
output:
time value flag
0 2017-10-01 0 A
1 2017-11-01 3 A
2 2017-12-01 0 A
3 2017-10-02 1 B
4 2017-11-01 5 B
5 2017-12-07 7 B
6 2017-10-03 8 C
7 2017-11-05 10 C
8 2017-12-07 12 C
This is a quit bit not a pythonic way to do it.
Here is another way:
columns = pd.MultiIndex.from_tuples([('A','Time'),('A','Value'),('B','Time'),('B','Value'),('C','Time'),('C','Value')],names=['Group','Sub_value'])
df.columns=columns
Output:
Group A B C
Sub_value Time Value Time Value Time Value
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Run:
df.stack(level='Group')
Output:
Sub_value Time Value
Group
0 A 2017-10-01 0
B 2017-10-02 1
C 2017-10-03 8
1 A 2017-11-01 3
B 2017-11-01 5
C 2017-11-05 10
2 A 2017-12-01 0
B 2017-12-07 7
C 2017-12-07 12

This is one method. It is fairly easy to extend to any number of columns.
import pandas as pd
dfs = {}
# read in pairs of columns and assign 'Category' column
dfs[i] = {i: pd.read_excel('file.xlsx', usecols=[2*i, 2*i+1], skiprows=[0],
header=None, columns=['Date', 'Value']).assign(Category=j) \
for i, j in enumerate(['A', 'B', 'C'])}
# concatenate dataframes
df = pd.concat(list(dfs.values()), ignore_index=True)

Related

pandas get delta from corresponding date in a seperate list of dates

I have a dataframe:
df a b
7 2019-05-01 00:00:01
6 2019-05-02 00:15:01
1 2019-05-06 00:10:01
3 2019-05-09 01:00:01
8 2019-05-09 04:20:01
9 2019-05-12 01:10:01
4 2019-05-16 03:30:01
And
l = [datetime.datetime(2019,05,02), datetime.datetime(2019,05,10), datetime.datetime(2019,05,22) ]
I want to add a column with the following:
for each row, find the last date from l that is before it, and add number of days between them.
If none of the date is smaller - add the delta from the smallest one.
So the new column will be:
df a b. delta date
7 2019-05-01 00:00:01 -1 datetime.datetime(2019,05,02)
6 2019-05-02 00:15:01 0 datetime.datetime(2019,05,02)
1 2019-05-06 00:10:01 4 datetime.datetime(2019,05,02)
3 2019-05-09 01:00:01 7 datetime.datetime(2019,05,02)
8 2019-05-09 04:20:01 7 datetime.datetime(2019,05,02)
9 2019-05-12 01:10:01 2 datetime.datetime(2019,05,10)
4 2019-05-16 03:30:01 6 datetime.datetime(2019,05,10)
How can I do it?
Using merge_asof to align df['b'] and the list (as Series), then computing the difference:
# ensure datetime
df['b'] = pd.to_datetime(df['b'])
# craft Series for merging (could be combined with line below)
s = pd.Series(l, name='l')
# merge and fillna with minimum date
ref = pd.merge_asof(df['b'], s, left_on='b', right_on='l')['l'].fillna(s.min())
# compute the delta as days
df['delta'] =(df['b']-ref).dt.days
output:
a b delta
0 7 2019-05-01 00:00:01 -1
1 6 2019-05-02 00:15:01 0
2 1 2019-05-06 00:10:01 4
3 3 2019-05-09 01:00:01 7
4 8 2019-05-09 04:20:01 7
5 9 2019-05-12 01:10:01 2
6 4 2019-05-16 03:30:01 6
Here's a one line solution if you your b column has datetime object. Otherwise convert it to datetime object.
df['delta'] = df.apply(lambda x: sorted([x.b - i for i in l], key= lambda y: y.seconds)[0].days, axis=1)
Explanation : To each row you apply a function that :
Compute the deltatime between your row's datetime and every datetime present in l, then store it in a list
Sort this list by the numbers of seconds of each deltatime
Get the first value (with the smallest deltatime) and return its days
this code is seperate this dataset on
weekday Friday
year 2014
day 01
hour 00
minute 03
rides['weekday'] = rides.timestamp.dt.strftime("%A")
rides['year'] = rides.timestamp.dt.strftime("%Y")
rides['day'] = rides.timestamp.dt.strftime("%d")
rides['hour'] = rides.timestamp.dt.strftime("%H")
rides["minute"] = rides.timestamp.dt.strftime("%M")

Count days by ID - Pandas

By having the following table, how can I count the days by ID?
without use of for or any loop because it's large size data.
ID Date
a 01/01/2020
a 05/01/2020
a 08/01/2020
a 10/01/2020
b 05/05/2020
b 08/05/2020
b 12/05/2020
c 08/08/2020
c 22/08/2020
to have this result
ID Date Days Evolved Since Inicial date
a 01/01/2020 1
a 05/01/2020 4
a 08/01/2020 7
a 10/01/2020 9
b 05/05/2020 1
b 08/05/2020 3
b 12/05/2020 7
c 08/08/2020 1
c 22/08/2020 14
Use GroupBy.transform with GroupBy.first for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0:
df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
Or set 1 for first value of group by Series.where and Series.duplicated:
df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
.dt.days.where(df['ID'].duplicated(), 1))
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
You could do something like (df your dataframe):
def days_evolved(sdf):
sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
return sdf
df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)
Result for the sample:
ID Date Days_evolved
0 a 2020-01-01 1 days
1 a 2020-01-05 4 days
2 a 2020-01-08 7 days
3 a 2020-01-10 9 days
4 b 2020-05-05 1 days
5 b 2020-05-08 3 days
6 b 2020-05-12 7 days
7 c 2020-08-08 1 days
8 c 2020-08-22 14 days
If you want int instead of pd.Timedelta then do
df["Days_evolved"] = df["Days_evolved"].dt.days
at the end.

Pandas groupby datetime columns by periods

I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:

Filling Missing Dates for a combination of columns

I have a dataframe 3 columns One Date, 2 Object Columns. I need to fill missing dates of different col1 and col 2 combinations by using max and min dates of the dataframe. Date column only contains first day of each month.
I have done it using naive manner but original data is in thousands or records taking huge amount of time to iterate thru all COL1+COL2 combinations, date ranges. original dataframe contains 15000 records and 30 columns. I need to fill missing date + col1 + col2 then rest all columns empty values. If I have data for Jan 2019 for a col1+col2 combination and dont have it for feb I actually wanted to insert feb, col1, col2, other records empty.
There should be equal unique combinations (COL1 + COL2) from original dataframe to after filling. Same number of combinations before and after
Please help me optimizing it.
df_1 = pd.DataFrame({'Date':['2018-01-01','2018-02-01','2018-03-01','2018-05-01','2018-05-01'],
'COL1':['A','A','B','B','A'],
'COL2':['1','2','1','2','1']})
df_1['Date'] = pd.to_datetime(df_1['Date'])
Initial Dataframe -->>
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 2
2 2018-03-01 B 1
3 2018-05-01 B 2
4 2018-05-01 A 1
--
print(df_1.dtypes)
print(df_1)
COLS_COMBO = [i for i in list(set(list(df_1[['COL1','COL2']].itertuples(name='',index=False))))]
months_range = [str(i.date()) for i in list(pd.date_range(start=min(df_1['Date']).date(),
end=max(df_1['Date']).date(), freq='MS'))]
print(COLS_COMBO)
print(months_range)
for col in COLS_COMBO:
col1,col2 = col[0], col[1]
for month in months_range:
d = df_1[(df_1['Date'] == month) & (df_1['COL1'] == col1) & (df_1['COL2'] == col2)]
if len(d) == 0:
dx = {'Date':month,'COL1':col1,'COL2':col2}
df_1 = df_1.append(dx, ignore_index=True)
print(df_1)
OUTPUT
Data TYPES -->>
Date datetime64[ns]
COL1 object
COL2 object
dtype: object
Unique COmbinations of COL1 + COL2 -->>
[('A', '2'), ('B', '2'), ('B', '1'), ('A', '1')]
Months range using min, max in the dataframe -->>
['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01']
My final output is
FINAL Dataframe -->>
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 2
2 2018-03-01 B 1
3 2018-05-01 B 2
4 2018-05-01 A 1
5 2018-01-01 A 2
6 2018-02-01 A 2
7 2018-03-01 A 2
8 2018-04-01 A 2
9 2018-05-01 A 2
10 2018-01-01 B 2
11 2018-02-01 B 2
12 2018-03-01 B 2
13 2018-04-01 B 2
14 2018-05-01 B 2
15 2018-01-01 B 1
16 2018-02-01 B 1
17 2018-03-01 B 1
18 2018-04-01 B 1
19 2018-05-01 B 1
20 2018-01-01 A 1
21 2018-02-01 A 1
22 2018-03-01 A 1
23 2018-04-01 A 1
24 2018-05-01 A 1
PS:
COL1 is like parent COL2 is child. So there should be no change in the original combinations and also (date+col1+col2) combinations shouldn't be duplicated / updated if exists.
You can use:
from itertools import product
#get all unique combinations of columns
COLS_COMBO = df_1[['COL1','COL2']].drop_duplicates().values.tolist()
#remove times and create MS date range
dates = df_1['Date'].dt.floor('d')
months_range = pd.date_range(dates.min(), dates.max(), freq='MS')
print(COLS_COMBO)
print(months_range)
#create all combinations of values
df = pd.DataFrame([(c, a, b) for (a, b), c in product(COLS_COMBO, months_range)],
columns=['Date','COL1','COL2'])
print (df)
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 1
2 2018-03-01 A 1
3 2018-04-01 A 1
4 2018-05-01 A 1
5 2018-01-01 A 2
6 2018-02-01 A 2
7 2018-03-01 A 2
8 2018-04-01 A 2
9 2018-05-01 A 2
10 2018-01-01 B 1
11 2018-02-01 B 1
12 2018-03-01 B 1
13 2018-04-01 B 1
14 2018-05-01 B 1
15 2018-01-01 B 2
16 2018-02-01 B 2
17 2018-03-01 B 2
18 2018-04-01 B 2
19 2018-05-01 B 2
#add to original df_1 and remove duplicates
df_1 = pd.concat([df_1, df], ignore_index=True).drop_duplicates()
print (df_1)
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 2
2 2018-03-01 B 1
3 2018-05-01 B 2
4 2018-05-01 A 1
6 2018-02-01 A 1
7 2018-03-01 A 1
8 2018-04-01 A 1
10 2018-01-01 A 2
12 2018-03-01 A 2
13 2018-04-01 A 2
14 2018-05-01 A 2
15 2018-01-01 B 1
16 2018-02-01 B 1
18 2018-04-01 B 1
19 2018-05-01 B 1
20 2018-01-01 B 2
21 2018-02-01 B 2
22 2018-03-01 B 2
23 2018-04-01 B 2

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

Categories