This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
The first 20 observations in my data are:
id day hour consumption
0 012af199245dedacf9ea0ba6eedef4e89272c7dc Saturday 8 0.000000
1 019ebd48fe9c9ab20051e9de1d5ddfc6fd13c55b Tuesday 16 0.000000
2 0310daaa6368cf0618f341351b8451e509da27d7 Wednesday 17 0.000000
3 04a2ddb034ff774cda02130fd59b280d55f762d7 Tuesday 16 -0.017699
4 04d61391eeea5b957847dbe08b52d88e64909dbf Thursday 15 0.000000
5 04f1fa8b29c58e19eebf0e26169975a66ec7cbbf Tuesday 15 0.000000
6 0561aa699b6c91c842b850c6b73ee4b3c8cbb03b Thursday 12 -0.002597
7 059492a3600ef0b39726af2201a0ad87610a4a02 Thursday 17 0.000000
8 059fb9175372802b43b3fdcebd2a507bc89e71b0 Thursday 12 -0.001541
9 05da142ebe95e15ab30dee30d1a982d8f419dfb2 Tuesday 20 -0.003050
10 0663c2fd03deecf7f52c3e5c7c0be5c94a3292b8 Sunday 13 -0.005613
11 07040b85d9c0c0ff122b3fef3ab73eab6c53ff0e Saturday 18 0.000000
12 07a33356cb6330b2090152d30413b224ad1c018b Saturday 20 0.005013
13 07d67b08fab92657c699dbeec931a48c9f1cfbf7 Friday 15 -0.015675
14 07f92e8eb78f9d8ab6446ffd2649990cffce2ead Friday 16 -0.004035
15 086cfca739da633d89100874a6c91c37e04880af Friday 0 -0.004068
16 0a64e559b80b819b2a48a939fa96b1f3f3791e54 Monday 12 -0.007687
17 0b477ac123374072c5acf34d1d063d6ae6c4bf0b Friday 21 0.000000
18 0bf144e77495b06fb319f4a312f09015da7c5afd Tuesday 4 0.000000
19 0d1263d90f5a5449a1d0eb80c0f217daff646d36 Saturday 8 -0.005963
I am trying to create a heatmap by doing:
sns.heatmap(df.pivot("day", "hour", "consumption"))
But I am getting the error:
ValueError: Index contains duplicate entries, cannot reshape
I attempted using pivot_table() instead which according to the documentation accounts for this error. But then I get:
DataError: No numeric types to aggregate
Alternatively, plotting the number of the day of the week instead works:
# Convert dates into the number of the day of the week
# 0=Mon; 6=Sunday
df['day_num'] = df['timestamp'].dt.weekday
sns.heatmap(df.pivot_table(index="day_num", columns="hour", values="consumption"))
However, I'd like to keep the names (or abbreviations) in the plot.
How can I fix this?
Note that the params of pivot and pivot_table are ordered differently, so if you don't name the params, the orders need to be changed accordingly:
pivot
# pivot(index=None, columns=None, values=None)
df.pivot('day', 'hour', 'consumption')
pivot_table
# pivot_table(values=None, index=None, columns=None, ...)
df.pivot_table('consumption', 'day', 'hour')
To avoid ambiguity, I suggest using named params:
sns.heatmap(df.pivot_table(index='day', columns='hour', values='consumption'))
Related
I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111
assume I have the following two dataframes. DataFrame A and DataFrame B.
DataFrame A has four columns. Year, Month, day and temperature. (e.g. 2021 || 7 || 5 || 23). Currently, some of the temperature cell in DataFrame A are NaN.
DataFrame B has two columns. Date and temperature. (e.g. 2021/7/7 || 28)
The time interval of DataFrame A and DataFrame B are different. The time interval of DataFrame A is smaller than interval B. But some of them overlap. (e.g. every 10 mins in DataFrame B and every 5 mins in DataFrame A).
Now I want to copy the temperature data from DataFrame B to DataFrame A if there is a NaN value in DataFrame A.
I have a method which using looping, but it is very slow. I want to use pandas vectorization. But I don't know how. Can anyone teach me?
for i in tqdm(range(len(dfA['Temp']))):
if(pd.isna(df['Temp'].iloc[i])):
date_time_str = str(year) + '/' + str(month) + '/' + str(day)
try:
dfA['temp'].iloc[i] = float(dfB.loc[dfB['Date'] == date_time_str].iloc[:, 1])
except:
print("no value")
pass
My solution is very slow, how to do it with pandas vectorization?
Method I tried for vectorization:
dfA.loc[df['temp'].isnull() & ((datetime.datetime(dfA['Year'], df['*Month'], dfA['Day']).strftime("%Y/%m/%d %H:%M"))in dfB.Date.values) , 'temp'] = float(dfB[dfB['Date'] == datetime.datetime(dfA['Year'], df['*Month'], dfA['Day']].iloc[:, 1])
Above is my method and trying, it doesn't work.
Example data:
DataFrame A
Year Month Day Temperature
2020 1 17 25
2020 1 18 NaN
2020 1 19 28
2020 1 20 NaN
2020 1 21 NaN
2020 1 22 NaN
DataFrame B
Date Temp
1/17/2020 25
1/19/2020 28
1/21/2020 31
1/23/2020 34
1/25/2020 23
1/27/2020 54
Expected Output
Year Month Day Temperature
2020 1 17 25
2020 1 18 NaN
2020 1 19 28
2020 1 20 NaN
2020 1 21 31
2020 1 22 NaN
Let's map them:
dfa['Date']=pd.to_datetime(dfa[['Day','Month','Year']])
dfb['Date']=pd.to_datetime(dfb['Date'])
dfb['Temperature']=dfa.pop('Date').map(dfb.set_index('Date')['Temp'])
OR
Let's Merge them:
dfa['Date']=pd.to_datetime(dfa[['Day','Month','Year']])
dfb['Date']=pd.to_datetime(dfb['Date'])
dfa=dfa.merge(dfb[['Date','Temp']],on='Date',how='left')
dfa['Temperature']=dfa['Temperature'].fillna(dfa.pop('Temp'))
One way using pandas.to_datetime with pandas.Series.fillna:
df1 = df1.set_index(pd.to_datetime(df1[["Year", "Month", "Day"]]))
s = df2.set_index(pd.to_datetime(df2.pop("Date"))).squeeze()
df1["Temperature"] = df1["Temperature"].fillna(s)
print(df1.reset_index(drop=True))
Output:
Year Month Day Temperature
0 2020 1 17 25.0
1 2020 1 18 NaN
2 2020 1 19 28.0
3 2020 1 20 NaN
4 2020 1 21 31.0
5 2020 1 22 NaN
I have dataframe that contain two columns. Date from 2018 until now and Orders with order count for each day.
Date Orders
0 2018-01-01 57
1 2018-01-02 324
2 2018-01-03 54
3 2018-01-04 677
4 2018-01-05 234
5 2018-01-06 54
6 2018-01-07 234
7 2018-01-08 65
8 2018-01-09 234
9 2018-01-10 54
10 2018-01-11 234
11 2018-01-12 65
12 2018-01-13 7
13 2018-01-14 6
14 2018-01-15 57
15 2018-01-16 324
16 2018-01-17 54
17 2018-01-18 677
18 2018-01-19 234
19 2018-01-20 54
...
I need to export this into multiple excel files so that every files contain only data for one particular month.
I am trying to work on this script but i am struck:
import pandas as pd
df = pd.read_excel("data/SampleData.xlsx")
for dates in Date:
currMonth = something???
filename = 'file_'+list(set(pd.to_datetime(df.loc[currMonth,
'datestart']).dt.strftime('%m%d%y')))[0]+'.xlsx'
df.loc[idx, 'data'].to_excel(filename)
So I think i have to create variable that will store start and end of each month and than iterate through it.
Any idea how to get this to work?
You might want to have a look here. You can use simple integers to address the month, so you should be able to iterate like this (not tested):
for month in range(1, 13):
df_per_month = df[df['Date'].dt.month == month]
df_per_month.to_excel(f'{month}.xlsx')
Edit: Note that according to docs, month ranges from 1-12.
Also, if you want to iterate month and year, you would have to do something like:
for year in range(2018, 2022):
for month in range(1, 13):
data = df[(df['Date'].dt.month == month) & (df['Date'].dt.year == year)]
data.to_excel(f'{month}-{year}.xlsx')
I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)
I have a Pandas dataframe with a DataTimeIndex and some other columns, similar to this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='6H')
df = pd.DataFrame(index = range)
# Average speed in miles per hour
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
df.info()
# DatetimeIndex: 141 entries, 2017-12-01 00:00:00 to 2018-01-05 00:00:00
# Freq: 6H
# Data columns (total 1 columns):
# value 141 non-null int64
# dtypes: int64(1)
# memory usage: 2.2 KB
df.head(10)
# value
# 2017-12-01 00:00:00 15
# 2017-12-01 06:00:00 54
# 2017-12-01 12:00:00 19
# 2017-12-01 18:00:00 13
# 2017-12-02 00:00:00 35
# 2017-12-02 06:00:00 31
# 2017-12-02 12:00:00 58
# 2017-12-02 18:00:00 6
# 2017-12-03 00:00:00 8
# 2017-12-03 06:00:00 30
How can I select or filter the entries that are:
Weekdays only (that is, not weekend days Saturday or Sunday)
Not within N days of the dates in a list (e.g. U.S. holidays like '12-25' or '01-01')?
I was hoping for something like:
df = exclude_Sat_and_Sun(df)
omit_days = ['12-25', '01-01']
N = 3 # days near the holidays
df = exclude_days_near_omit_days(N, omit_days)
I was thinking of creating a new column to break out the month and day and then comparing them to the criteria for 1 and 2 above. However, I was hoping for something more Pythonic using the DateTimeIndex.
Thanks for any help.
The first part can be easily accomplished using the Pandas DatetimeIndex.dayofweek property, which starts counting weekdays with Monday as 0 and ending with Sunday as 6.
df[df.index.dayofweek < 5] will give you only the weekdays.
For the second part you can use the datetime module. Below I will give an example for only one date, namely 2017-12-25. You can easily generalize it to a list of dates, for example by defining a helper function.
from datetime import datetime, timedelta
N = 3
df[abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N)]
This will give all dates that are more than N=3 days away from 2017-12-25. That is, it will exclude an interval of 7 days from 2017-12-22 to 2017-12-28.
Lastly, you can combine the two criteria using the & operator, as you probably know.
df[
(df.index.dayofweek < 5)
&
(abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N))
]
I followed the answer by #Bahman Engheta and created a function to omit dates from a dataframe.
import pandas as pd
from datetime import datetime, timedelta
def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
'''
Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
near a given list of dates and/or a date on a weekend.
Parameters:
----------
df : Pandas dataframe
list_years : list of str
Contains a list of years in string form
list_dates : list of str
Contains a list of dates in string form encoded as MM-DD
omit_days_near : int
Threshold of days away from list_dates to remove. For example, if
omit_days_near=3, then omit all days that are 3 days away from
any date in list_dates.
omit_weekends : bool
If true, omit dates that are on weekends.
Returns:
-------
Pandas dataframe
New resulting dataframe with dates omitted.
'''
if not isinstance(df, pd.core.frame.DataFrame):
raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)
if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
type(df.index).__name__)
if not isinstance(list_years, list):
list_years = [list_years]
if not isinstance(list_dates, list):
list_dates = [list_dates]
result = df.copy()
if omit_weekends:
result = result.loc[result.index.dayofweek < 5]
omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]
for date in omit_dates:
result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]
return result
Here is example usage. Suppose you have a dataframe that has a DateTimeIndex and other columns, like this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
The resulting dataframe looks like this:
value
2017-12-01 42
2017-12-02 35
2017-12-03 49
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-08 57
2017-12-09 3
2017-12-10 57
2017-12-11 46
2017-12-12 20
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-16 57
2017-12-17 4
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-23 45
2017-12-24 34
2017-12-25 42
2017-12-26 33
2017-12-27 17
2017-12-28 2
2017-12-29 2
2017-12-30 51
2017-12-31 19
2018-01-01 6
2018-01-02 43
2018-01-03 11
2018-01-04 45
2018-01-05 45
Now, let's specify dates to remove. I want to remove the dates '12-10', '12-25', '12-31', and '01-01' (following MM-DD notation) and all dates within 2 days of those dates. Further, I want to remove those dates from both the years '2016' and '2017'. I also want to remove weekend dates.
I'll call my function like this:
years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)
The result is:
value
2017-12-01 42
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-28 2
2018-01-03 11
2018-01-04 45
2018-01-05 45
Is that answer correct? Here are the calendars for December 2017 and January 2018:
December 2017
Su Mo Tu We Th Fr Sa
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
January 2018
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
Looks like it works.