pandas DataFrame pivot with sort - python

I'm having a problem presenting data in the required way. My dataframe is formatted and then sorted by 'Site ID'. I need to present the data by Site ID with all date instances grouped alongside.
I'm 90% there in terms of how I want it to look using pivot_table
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
however the date column is not sorted.
(The tiny example output appears sorted however the ****Thu Jan 11 2018 10:43:20 entry**** illustrates my issue on large data sets)
I cannot figure out how to present it like below but also with the dates sorted per site ID
Any help is gratefully accepted
df = pd.DataFrame.from_dict([{'Site Ref': '1234567', 'Site Name': 'Building A', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Mon Jan 08 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Tue Jan 09 2018 10:43:20', 'Duration': 70}, {'Site Ref': '1245678', 'Site Name':'Building B', 'Date': 'Wed Jan 10 2018 10:43:20', 'Duration': 120}, {'Site Ref': '1212345', 'Site Name':'Building C', 'Date': 'Fri Jan 12 2018 10:43:20', 'Duration': 100}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Thu Jan 11 2018 10:43:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Fri Jan 12 2018 12:22:20', 'Duration': 80}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Mon Jan 15 2018 11:43:20', 'Duration': 90}, {'Site Ref': '1123456', 'Site Name':'Building D', 'Date': 'Wed Jan 17 2018 10:43:20', 'Duration': 220}])
df = DataFrame(df, columns=['Site Ref', 'Site Name', 'Date', 'Duration'])
df = df.sort_values(by=['Site Ref'])
df
Site Ref Site Name Date Duration
5 1123456 Building D Thu Jan 11 2018 10:43:20 80
6 1123456 Building D Fri Jan 12 2018 12:22:20 80
7 1123456 Building D Mon Jan 15 2018 11:43:20 90
8 1123456 Building D Wed Jan 17 2018 10:43:20 220
4 1212345 Building C Fri Jan 12 2018 10:43:20 100
0 1234567 Building A Mon Jan 08 2018 10:43:20 120
1 1245678 Building B Mon Jan 08 2018 10:43:20 120
2 1245678 Building B Tue Jan 09 2018 10:43:20 70
3 1245678 Building B Wed Jan 10 2018 10:43:20 120
df_pivot = pd.pivot_table(df, index=['Site Ref','Site Name', 'Date'])
df_pivot
Site Ref Site Name Date
1123456 Building D Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
****Thu Jan 11 2018 10:43:20 80****
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

It's sorted lexicographically, because Date has object (string) dtype
Workaround - add a new column of datetime dtype, use it before Date in the pivot_table and drop it afterwards:
In [74]: (df.assign(x=pd.to_datetime(df['Date']))
.pivot_table(df, index=['Site Ref','Site Name', 'x', 'Date'])
.reset_index(level='x', drop=True))
Out[74]:
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

You need to convert your dates to datetime values rather than strings. Something like the following would work on your current pivot table:
df_pivot.reset_index(inplace=True)
df_pivot['Date'] = pd.to_datetime(df_pivot['Date'])
df_pivot.sort_values(by=['Site Ref', 'Date'], inplace=True)

Sort the values by Site Ref, groupby mean using sort = False i.e
df.sort_values('Site Ref').groupby(['Site Ref','Site Name','Date'],sort=False).mean()
Duration
Site Ref Site Name Date
1123456 Building D Thu Jan 11 2018 10:43:20 80
Fri Jan 12 2018 12:22:20 80
Mon Jan 15 2018 11:43:20 90
Wed Jan 17 2018 10:43:20 220
1212345 Building C Fri Jan 12 2018 10:43:20 100
1234567 Building A Mon Jan 08 2018 10:43:20 120
1245678 Building B Mon Jan 08 2018 10:43:20 120
Tue Jan 09 2018 10:43:20 70
Wed Jan 10 2018 10:43:20 120

Related

Get specific data from txt file to pandas dataframe

I have such data in a txt file:
Wed Mar 23 16:59:25 GMT 2022
1 State
1 ESTAB
Wed Mar 23 16:59:26 GMT 2022
1 State
1 ESTAB
1 CLOSE-WAIT
Wed Mar 23 16:59:27 GMT 2022
1 State
1 ESTAB
10 FIN-WAIT
Wed Mar 23 16:59:28 GMT 2022
1 State
1 CLOSE-WAIT
102 ESTAB
I want to get a pandas dataframe looking like this:
timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1
That means the string in the first line per paragraph should be used for the first column timestamp. The other columns should be filled withg the numbers according to the string following the number. The next column begins after a paragraph.
How can I do this with pandas?
First you can process the txt file to a list of list. Inner list means each hunk lines. Outer list means different hunks:
import pandas as pd
with open('data.txt', 'r') as f:
res = f.read()
records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]
print(records)
[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]
Then you can turn the list of list to list of dictionary by manually define each key and value
l = []
for record in records:
d = {}
d['timestamp'] = record[0]
for r in record[1:]:
key = r.split(' ')[1]
value = r.split(' ')[0]
d[key] = value
l.append(d)
print(l)
[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]
At last you can feed this dictionary into dataframe and fill the nan cell
df = pd.DataFrame(l).fillna(0)
print(df)
timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0 Wed Mar 23 16:59:25 GMT 2022 1 1 0 0
1 Wed Mar 23 16:59:26 GMT 2022 1 1 1 0
2 Wed Mar 23 16:59:27 GMT 2022 1 1 0 10
3 Wed Mar 23 16:59:28 GMT 2022 1 102 1 0
Try:
#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)
#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")
#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)
df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()
#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)
#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()
>>> output
Column timestamp CLOSE-WAIT ESTAB FIN-WAIT State
0 Wed Mar 23 16:59:25 GMT 2022 0 1 0 1
1 Wed Mar 23 16:59:26 GMT 2022 1 1 0 1
2 Wed Mar 23 16:59:27 GMT 2022 0 1 10 1
3 Wed Mar 23 16:59:28 GMT 2022 1 102 0 1

How to split pandas column into two columns with strings and ints

Im looking to split the column Date range into two columns, starting date and ending date. However it split doesn't seem to work because it does not recognise the '-'. Any advice?
I tried using
'''
ebola1 = pd.DataFrame(ebola['Date range'].str.split('-',1).to_list(),columns = ['start date','end date'])
'''
However, it returns the following:
So (1) it doesn't recognize the '-', (2)how do I distinguish between 'Jun-Nov 1976' and 'Oct 2001-Mar 2002', (3) how to I include the new columns in the existing table?
Thanks for the help!
There is used – instead -, so use Series.str.split with expand=True for DataFrame:
data = ['Jun–Nov 1976', 'Sep–Oct 1976', 'Jun 1977', 'Jul–Oct 1979', 'Nov 1994', 'Nov 1994–Feb 1995', 'Jan–Jul 1995', 'Jan–Mar 1996', 'Jul 1996–Jan 1997', 'Oct 2000–Feb 2001', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Oct 2001–Mar 2002', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Dec 2002–Apr 2003', 'Oct–Dec 2003', 'Apr–Jun 2004']
ebola = pd.DataFrame(data, columns=['Date range'])
ebola1 = ebola['Date range'].str.split('–', 1, expand=True)
ebola1.columns = ['start date','end date']
And then numpy.where for add years from end date by Series.str.extract but only if not exist in start date column tested by Series.str.contains:
mask = ebola1['start date'].str.contains('\d')
years = ebola1['end date'].str.extract('(\d+)', expand=False)
ebola1['start date'] = np.where(mask,
ebola1['start date'],
ebola1['start date'] + ' ' + years)
print (ebola1)
start date end date
0 Jun 1976 Nov 1976
1 Sep 1976 Oct 1976
2 Jun 1977 None
3 Jul 1979 Oct 1979
4 Nov 1994 None
5 Nov 1994 Feb 1995
6 Jan 1995 Jul 1995
7 Jan 1996 Mar 1996
8 Jul 1996 Jan 1997
9 Oct 2000 Feb 2001
10 Oct 2001 Mar 2002
11 Oct 2001 Mar 2002
12 Oct 2001 Mar 2002
13 Oct 2001 Mar 2002
14 Oct 2001 Mar 2002
15 Dec 2002 Apr 2003
16 Dec 2002 Apr 2003
17 Dec 2002 Apr 2003
18 Oct 2003 Dec 2003
19 Apr 2004 Jun 2004

How to generate even calendar dates?

Does anyone know how to generate a list in Calendar in python (or some other platform) with "even days", month and year from 2018 until 2021?
Example:
Sun, 02 Jan 2019
Tue, 04 Jan 2019
Thur, 06 Jan 2019
Sat, 08 Jan 2019
Sun, 10 Jan 2019
Tue, 12 Jan 2019
Thur, 14 Jan 2019
Sat, 16 Jan 2019
Sun, 18 Jan 2019
Tue, 20 Jan 2019
Thur, 22 Jan 2019
and so on, respecting the calendar until 2021.
EDIT:
how to generate in python a calendar list between 2018 and 2022 with 2 formats:
Day of the week, Date Month Year Time (hours: minutes: seconds) - Year-Month-Date Time (hours: minutes: seconds)
Note:
Dates: Peer dates only
schedule: Randomly generated schedules
Example:
Tue, 02 Jan 2018 00:59:23 - 2018-01-02 00:59:23
Thu, 04 Jan 2018 10:24:52 - 2018-01-04 10:24:52
Sat, 06 Jan 2018 04:11:09 - 2018-01-06 04:11:09
Mon, 08 Jan 2018 16:12:40 - 2018-01-08 16:12:40
Wed, 10 Jan 2018 10:08:15 - 2018-01-10 10:08:15
Fri, 12 Jan 2018 07:10:09 - 2018-01-12 07:10:09
Sun, 14 Jan 2018 11:50:10 - 2018-01-14 11:50:10
Tue, 16 Jan 2018 02:29:22 - 2018-01-16 02:29:22
Thu, 18 Jan 2018 19:07:20 - 2018-01-18 19:07:20
Sat, 20 Jan 2018 08:50:13 - 2018-01-20 08:50:13
Mon, 22 Jan 2018 02:40:02 - 2018-01-22 02:40:02
and so on, until the year 2022 ...
Here's something fairly simple that seems to work and handles leap years:
from calendar import isleap
from datetime import date
# Days in each month (1-12).
MDAYS = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
def dim(year, month):
""" Number of days in month of the given year. """
return MDAYS[month] + ((month == 2) and isleap(year))
start_year, end_year = 2018, 2021
for year in range(start_year, end_year+1):
for month in range(1, 12+1):
days = dim(year, month)
for day in range(1, days+1):
if day % 2 == 0:
dt = date(year, month, day)
print(dt.strftime('%a, %d %b %Y'))
Output:
Tue, 02 Jan 2018
Thu, 04 Jan 2018
Sat, 06 Jan 2018
Mon, 08 Jan 2018
Wed, 10 Jan 2018
Fri, 12 Jan 2018
Sun, 14 Jan 2018
Tue, 16 Jan 2018
...
Edit:
Here's a way to do what (I think) you asked how to do in your follow-on question:
from calendar import isleap
from datetime import date, datetime, time
from random import randrange
# Days in each month (1-12).
MDAYS = [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
def dim(year, month):
""" Number of days in month of the given year. """
return MDAYS[month] + ((month == 2) and isleap(year))
def whenever():
""" Gets the time value. """
# Currently just returns a randomly selected time of day.
return time(*map(randrange, (24, 60, 60))) # hour:minute:second
start_year, end_year = 2018, 2021
for year in range(start_year, end_year+1):
for month in range(1, 12+1):
days = dim(year, month)
for day in range(1, days+1):
if day % 2 == 0:
dt, when = date(year, month, day), whenever()
dttm = datetime.combine(dt, when)
print(dt.strftime('%a, %d %b %Y'), when, '-', dttm)
Output:
Tue, 02 Jan 2018 00:54:02 - 2018-01-02 00:54:02
Thu, 04 Jan 2018 10:19:51 - 2018-01-04 10:19:51
Sat, 06 Jan 2018 22:48:09 - 2018-01-06 22:48:09
Mon, 08 Jan 2018 06:48:46 - 2018-01-08 06:48:46
Wed, 10 Jan 2018 14:01:54 - 2018-01-10 14:01:54
Fri, 12 Jan 2018 05:42:43 - 2018-01-12 05:42:43
Sun, 14 Jan 2018 21:42:37 - 2018-01-14 21:42:37
Tue, 16 Jan 2018 08:08:39 - 2018-01-16 08:08:39
...
What about:
import datetime
d = datetime.date.today() # Define Start date
while d.year <= 2021: # This will go *through* 2012
if d.day % 2 == 0: # Print if even date
print(d.strftime('%a, %d %b %Y'))
d += datetime.timedelta(days=1) # Jump forward a day
Wed, 31 Oct 2018
Fri, 02 Nov 2018
Sun, 04 Nov 2018
Tue, 06 Nov 2018
Thu, 08 Nov 2018
Sat, 10 Nov 2018
Mon, 12 Nov 2018
Wed, 14 Nov 2018
Fri, 16 Nov 2018
Sun, 18 Nov 2018
Tue, 20 Nov 2018
Thu, 22 Nov 2018
...
Fri, 24 Dec 2021
Sun, 26 Dec 2021
Tue, 28 Dec 2021
Thu, 30 Dec 2021

Change date format in pandas dataframe

I have this dataframe:
date value
1 Thu 17th Nov 2016 385.943800
2 Fri 18th Nov 2016 1074.160340
3 Sat 19th Nov 2016 2980.857860
4 Sun 20th Nov 2016 1919.723960
5 Mon 21st Nov 2016 884.279340
6 Tue 22nd Nov 2016 869.071070
7 Wed 23rd Nov 2016 760.289260
8 Thu 24th Nov 2016 2481.689270
9 Fri 25th Nov 2016 2745.990070
10 Sat 26th Nov 2016 2273.413250
11 Sun 27th Nov 2016 2630.414900
12 Mon 28th Nov 2016 817.322310
13 Tue 29th Nov 2016 1766.876030
14 Wed 30th Nov 2016 469.388420
I would like to change the format of the date column to this format YYYY-MM-DD. The dataframe consists of more than 200 rows, and every day new rows will be added, so I need to find a way to do this automatically.
This link is not helping because it sets the dates like this dates = ['30th November 2009', '31st March 2010', '30th September 2010'] and I can't do it for every row. Anyone knows a way to solve this?
Dateutil will do this job.
from dateutil import parser
print df
df2 = df.copy()
df2.date = df2.date.apply(lambda x: parser.parse(x))
df2
Output:

calculating the mean of a list of timestamps ignoring weekend days in python

I have a list of timestamps and I want to calculate the mean of the list, but I need to ignore the weekend days which are Saturday and Sunday and consider Friday and Monday as one day. I only want to include the working days from Monday to Friday. This is an example of the list. I wrote the timestamps in readable format to follow the process easily.
Example:
['Wed Feb 17 12:57:40 2011', ' Wed Feb 8 12:57:40 2011', 'Tue Jan 25 17:15:35 2011']
MIN='Tue Jan 25 17:15:35 2011'
' Wed Feb 17 12:57:40 2011' , since we have 6 weekend days between this number and the MIN I shift back this number 6days.It will be = 'Fri Feb 11 12:57:40 2011'.
'Wed Feb 8 12:57:40 2011', since we have 4 weekend days between this number and the MIN I shift back this number 4days it will be 'Wed Feb 4 12:57:40 2011'
The new list is now [' Fri Feb 11 12:57:40 2011',' Wed Feb 4 12:57:40 2011',' Tue Jan 25 17:15:35 2011]
MAX= 'Fri Feb 11 12:57:40 2011'
average= (Fri Feb 11 12:57:40 2011 + Wed Feb 4 12:57:40 2011 + Tue Jan 25 17:15:35 2011) /3
difference= MAX - average
Edit: [Removed previous code, which had an error; replaced with code below.]
Here is some output from code that squeezes out weekends, computes average, and puts weekends back in to get an apparently valid average. The code is shown after the output from some test cases.
['Fri Jan 13 12:00:00 2012', 'Mon Jan 16 11:00:00 2012']
Average = Fri Jan 13 23:30:00 2012
['Fri Jan 13 12:00:00 2012', 'Mon Jan 16 13:00:00 2012']
Average = Mon Jan 16 00:30:00 2012
['Fri Jan 13 14:17:58 2012', 'Sat Jan 14 1:2:3 2012', 'Sun Jan 15 4:5:6 2012', 'Mon Jan 16 11:03:29 2012', 'Wed Jan 18 14:27:17 2012', 'Mon Jan 23 10:02:12 2012', 'Mon Jan 30 10:02:12 2012']
Average = Thu Jan 19 16:46:37 2012
['Fri Jan 14 14:17:58 2011', 'Mon Jan 17 11:03:29 2011', 'Wed Jan 19 14:27:17 2011', 'Mon Jan 24 10:02:12 2011']
Average = Wed Jan 19 00:27:44 2011
Python code:
from time import strptime, mktime, localtime, asctime
from math import floor
def averageBusinessDay (dates):
f = [mktime(strptime(x)) for x in dates]
h = [x for x in f if localtime(x).tm_wday < 5] # Get rid of weekend days
bweek, cweek, dweek = 3600*24*5, 3600*24*7, 3600*24*2
e = localtime(h[0]) # Get struct_time for first item
# fm is first Monday in local time
fm = mktime((e.tm_year, e.tm_mon, e.tm_mday-e.tm_wday, 0,0,0,0,0,0))
i = [x-fm for x in h] # Subtract leading Monday
j = [x-floor(x/cweek)*dweek for x in i] # Squeeze out weekends
avx = sum(j)/len(j)
avt = asctime(localtime(avx+floor(avx/bweek)*dweek+fm))
return avt
def atest(dates):
print dates
print 'Average = ', averageBusinessDay (dates)
atest(['Fri Jan 13 12:00:00 2012', 'Mon Jan 16 11:00:00 2012'])
atest(['Fri Jan 13 12:00:00 2012', 'Mon Jan 16 13:00:00 2012'])
atest(['Fri Jan 13 14:17:58 2012', 'Sat Jan 14 1:2:3 2012', 'Sun Jan 15 4:5:6 2012', 'Mon Jan 16 11:03:29 2012', 'Wed Jan 18 14:27:17 2012', 'Mon Jan 23 10:02:12 2012', 'Mon Jan 30 10:02:12 2012'])
atest(['Fri Jan 14 14:17:58 2011', 'Mon Jan 17 11:03:29 2011', 'Wed Jan 19 14:27:17 2011', 'Mon Jan 24 10:02:12 2011'])
Split the strings based on ' ', take the first element and if it's not saturday or sunday, it's a weekday. Now I need to know what you mean by the "mean" of a list of dates.

Categories