astype() does not change floats - python

Even though this seems really simple, it drives me nuts. Why is .astype(int) not changing the floats to ints? Thank you
df_new = pd.crosstab(df["date"], df["place"]).reset_index()
places = ['cityA', "cityB", "cityC"]
df_new[places] = df_new[places].fillna(0).astype(int)
sums = df_new.select_dtypes(pd.np.number).sum().rename('total')
df_new = df_new.append(sums)
print(df_new)
Output:
place date cityA cityB cityC
0 2008-01-01 0.0 0.0 51.0
1 2009-06-01 0.0 618.0 0.0
2 2015-07-01 549.0 0.0 0.0
3 2016-01-01 41.0 0.0 0.0
4 2016-04-01 62.0 0.0 0.0
5 2017-01-01 800.0 0.0 0.0
6 2018-07-01 69.0 0.0 0.0
total NaT 1521.0 618.0 51.0

If there are NAs (which are floats in Pandas), the other values will be floats as well. See here.

Related

transform event based data into time series data with pandas using groupby and reindex

We want to transform event-based data into multiple time series.
As an example we use pandas to plot some graphics of the changes in salary per employee in a company over time. An event of a change in salary is a entry in a table with a date, a name and the new salary.
employee salary
date
2000-01-01 anna 4500
2003-01-01 oli 5000
2010-01-01 anna 6500
2012-01-01 lena 5000
2013-01-01 oli 7000
2016-01-01 lena 6500
2017-01-09 joe 5000
2018-01-09 peter 5000
2019-01-09 joe 5500
2019-01-31 lena 0
2020-01-01 anna 8500
2020-01-09 peter 5500
2021-01-09 joe 6000
2022-02-28 peter 0
The changes happen in irregularly-spaced intervals thus to work with the data we want reindex to a common regularly-spaced index and then do fill operations on missing data points.
time_series_index = pd.date_range(df_events.index.min(), df_events.index.max())
df_time_series = pd.DataFrame()
for name, group in df_events.groupby('employee'):
time_series = group['salary'].reindex(time_series_index)
time_series = time_series.ffill().fillna(0)
df_time_series[name] = time_series
print(df_time_series)
anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0
The loop above does the job of reindexing to a common index.
Now the question arose whether the approach is state-of-the-art or if there is more compact and straight forward way to do it. We assume the problem of transformation of events to time series is a common problem and therefore we expected there would be a standard to solve these kind of problems.
We tried to make it compact by removing the loop as follows.
df_time_series = df_events.groupby('employee')['salary'].reindex(time_series_index)
It throws AttributeError:
AttributeError: 'SeriesGroupBy' object has no attribute 'reindex'
This should work. If your index is already a datetime index, then you do not need the .rename(pd.to_datetime) part
(df.rename(pd.to_datetime)
.set_index('employee',append = True)
.unstack()
.asfreq('D')
.ffill()
.fillna(0))
Output:
salary
employee anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0

Repeating dates in pandas DataFrame without hour format

I'm trying to insert a range of date labels in my dataframe, df1. I've managed some part of the way, but I still have som bumps that I want to smooth out.
I'm trying to generate a column with dates from 2017-01-01 to 2020-12-31 with all dates repeated 24 times, i.e., a column with 35,068 rows.
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
repeated_dates = pd.DataFrame(dates.repeat(num_repeats))
df1.insert(0, 'Date', repeated_dates)
However, it only generates some iterations of the last date, meaning that my column will be NaT for the remaining x hours.
output:
Date DK1 Up DK1 Down DK2 Up DK2 Down
0 2017-01-01 0.0 0.0 0.0 0.0
1 2017-01-01 0.0 0.0 0.0 0.0
2 2017-01-01 0.0 0.0 0.0 0.0
3 2017-01-01 0.0 0.0 0.0 0.0
4 2017-01-01 0.0 0.0 0.0 0.0
... ... ... ... ... ...
35063 2020-12-31 0.0 0.0 0.0 0.0
35064 NaT 0.0 0.0 0.0 0.0
35065 NaT 0.0 -54.1 0.0 0.0
35066 NaT 25.5 0.0 0.0 0.0
35067 NaT 0.0 0.0 0.0 0.0
Furthermore, how can I change the date format from '2017-01-01' to '01-01-2017'?
You set this up perfectly, so here is the dates that you have,
import pandas as pd
import numpy as np
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
df = pd.DataFrame(dates.repeat(num_repeats),columns=['date'])
and converting the column to the format you want is simple with the strftime function
df['newFormat'] = df['date'].dt.strftime('%d-%m-%Y')
Which gives
date newFormat
0 2017-01-01 01-01-2017
1 2017-01-01 01-01-2017
2 2017-01-01 01-01-2017
3 2017-01-01 01-01-2017
4 2017-01-01 01-01-2017
... ... ...
35059 2020-12-31 31-12-2020
35060 2020-12-31 31-12-2020
35061 2020-12-31 31-12-2020
35062 2020-12-31 31-12-2020
35063 2020-12-31 31-12-2020
now
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
gives
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
'2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
'2020-12-30', '2020-12-31'],
dtype='datetime64[ns]', length=1461, freq='D')
and
1461 * 24 = 35064
so I am not sure where 35,068 comes from. Are you sure about that number?

Check if my time series index data has any missing values for weekdays

I have time series data from "5 Jan 2015" to "28 Dec 2018". I observed some of the working days' dates and their values are missing. How to check how many weekdays are missing in my time range? and what are those dates so that i can extrapolate the values for those dates.
Example:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
On observing calendar, 21st Dec, 2018 was Friday. Then excluding Saturday and Sunday, the dataset should be having "24th Dec 2018" in the list, but its missing. I need to identify such missing dates from range.
My approach till now:
I tried using
pd.date_range('2015-01-05','2018-12-28',freq='W')
to identify the number of weeks and calculate the no. of weekdays from them manually, to identify number of missing dates.
But it dint solved purpose as I need to identify missing dates from range.
Let's say this is your full dataset:
Date Price Volume
2018-12-28 172.0 800
2018-12-27 173.6 400
2018-12-26 170.4 500
2018-12-25 171.0 2200
2018-12-21 172.8 800
And dates were:
dates = pd.date_range('2018-12-15', '2018-12-31')
First, make sure the Date column is actually a date type:
df['Date'] = pd.to_datetime(df['Date'])
Then set Date as the index:
df = df.set_index('Date')
Then reindex with unutbu's solution:
df = df.reindex(dates, fill_value=0.0)
Then reset the index to make it easier to work with:
df = df.reset_index()
It now looks like this:
index Price Volume
0 2018-12-15 0.0 0.0
1 2018-12-16 0.0 0.0
2 2018-12-17 0.0 0.0
3 2018-12-18 0.0 0.0
4 2018-12-19 0.0 0.0
5 2018-12-20 0.0 0.0
6 2018-12-21 172.8 800.0
7 2018-12-22 0.0 0.0
8 2018-12-23 0.0 0.0
9 2018-12-24 0.0 0.0
10 2018-12-25 171.0 2200.0
11 2018-12-26 170.4 500.0
12 2018-12-27 173.6 400.0
13 2018-12-28 172.0 800.0
14 2018-12-29 0.0 0.0
15 2018-12-30 0.0 0.0
16 2018-12-31 0.0 0.0
Do:
df['weekday'] = df['index'].dt.dayofweek
Finally, how many weekdays are missing in your time range:
missing_weekdays = df[(~df['weekday'].isin([5,6])) & (df['Volume'] == 0.0)]
Result:
>>> missing_weekdays
index Price Volume weekday
2 2018-12-17 0.0 0.0 0
3 2018-12-18 0.0 0.0 1
4 2018-12-19 0.0 0.0 2
5 2018-12-20 0.0 0.0 3
9 2018-12-24 0.0 0.0 0
16 2018-12-31 0.0 0.0 0

Creating a multi-index (3 axis) to take the mean of 1 axis

I have a list of DataFrames that consists of a timeseries with a datetime index. I have another list called longname that I want to associate to each of those Dataframes. I would like to group these Dataframes-longname with a list of Mainlabel (which is related through the longname by , MainName,SubName). Now i want to take the mean relative to the datetime index from the dataframe, through the longname And MainNames. I'm sorry if this sounds confusing.
What i have in mind is confusing and complicated. So I was wondering if anyone has a better approach that i should be taking.
What i have done so far is expanding the list of dataframes into 1 column using pd.concat(), but cant seem to label them using the "keys" argument for the longname, giving me an error,
ValueError: Shape of passed values is (823748, 2), indices imply (3343070, 2).
this looses my 2nd indexer. Which if it worked i was hoping to just easily group them using the shortname .eg.
ShortNames = ['MainName1','MainName2']
idx = allvars.index.str.extract('('+ '|'.join(ShortNames) + ')', expand=False)
Allmean = allvars.groupby(idx).mean(axis = (1,2,3))
I have multiple dataframes that look likes this one;
Amount(mm)
Date
1900-01-01 0.0
1900-01-02 0.0
1900-01-03 5.1
1900-01-04 0.0
1900-01-05 0.0
1900-01-06 0.0
1900-01-07 0.0
the list of longnames i have is like:
longnames = ['MainName1,SubName1', 'MainName1,SubName2', 'MainName2,SubName1', 'MainName2,SubName2']
Overall i want to take the mean solely on the datetime index but is grouped into MainNames. So this should result in only having 2 indexes. Which is the MainName and DateTime index. Similar to;
Amount(mm)
Date
MainName1 1900-01-01 0.0
1900-01-02 0.0
1900-01-03 5.1
1900-01-04 0.0
1900-01-05 0.0
1900-01-06 0.0
1900-01-07 0.0
MainName2 1900-01-04 8.0
1900-01-05 9.0
1900-01-06 1.0
1900-01-07 2.0
Sample DataFrames:
print (df1)
print (df2)
print (df3)
Amount(mm)
Date
1900-01-01 0.0
1900-01-02 0.0
1900-01-03 5.1
1900-01-04 0.0
1900-01-05 0.0
1900-01-06 0.0
1900-01-07 0.0
Amount(mm)
Date
1900-01-01 4.0
1900-01-02 5.0
1900-01-03 5.1
1900-01-04 6.0
Amount(mm)
Date
1900-01-04 8.0
1900-01-05 9.0
1900-01-06 1.0
1900-01-07 2.0
First is necessary same length of list longsnames with number of DataFrames (here 3)
dfs = [df1,df2,df3]
longsnames = ['MainName1,SubName1', 'MainName1,SubName2', 'MainName2,SubName1']
allvars = pd.concat(dfs, keys = longsnames)
print (allvars)
Amount(mm)
Date
MainName1,SubName1 1900-01-01 0.0
1900-01-02 0.0
1900-01-03 5.1
1900-01-04 0.0
1900-01-05 0.0
1900-01-06 0.0
1900-01-07 0.0
MainName1,SubName2 1900-01-01 4.0
1900-01-02 5.0
1900-01-03 5.1
1900-01-04 6.0
MainName2,SubName1 1900-01-04 8.0
1900-01-05 9.0
1900-01-06 1.0
1900-01-07 2.0
then is necessary select first level of MultiIndex by Index.get_level_values:
ShortNames = ['MainName1','MainName2']
idx = allvars.index.get_level_values(0).str.extract('('+ '|'.join(ShortNames) + ')', expand=False)
print (idx)
Index(['MainName1', 'MainName1', 'MainName1', 'MainName1', 'MainName1',
'MainName1', 'MainName1', 'MainName1', 'MainName1', 'MainName1',
'MainName1', 'MainName2', 'MainName2', 'MainName2', 'MainName2'],
dtype='object')
And last aggregate mean:
Allmean = allvars.groupby([idx, 'Date']).mean()
#oldier pandas version alternative
#Allmean = allvars.groupby([idx, allvars.index.get_level_values(1)]).mean()
print (Allmean)
Amount(mm)
Date
MainName1 1900-01-01 0.0
1900-01-02 0.0
1900-01-03 5.1
1900-01-04 0.0
1900-01-05 0.0
1900-01-06 0.0
1900-01-07 0.0
MainName2 1900-01-01 4.0
1900-01-02 5.0
1900-01-03 5.1
1900-01-04 6.0

Pandas DataFrame --> GroupBy --> MultiIndex Process

I'm trying to restructure a large DataFrame of the following form as a MultiIndex:
date store_nbr item_nbr units snowfall preciptotal event
0 2012-01-01 1 1 0 0.0 0.0 0.0
1 2012-01-01 1 2 0 0.0 0.0 0.0
2 2012-01-01 1 3 0 0.0 0.0 0.0
3 2012-01-01 1 4 0 0.0 0.0 0.0
4 2012-01-01 1 5 0 0.0 0.0 0.0
I want to group by store_nbr (1-45), within each store_nbr group by item_nbr (1-111) and then for the corresponding index pair (e.g., store_nbr=12, item_nbr=109), display the rows in chronological order, so that ordered rows will look like, for example:
store_nbr=12, item_nbr=109: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=0, snowfall=...
date=2014-02-08, units=0, snowfall=...
... ...
store_nbr=12, item_nbr=110: date=2014-02-06, units=0, snowfall=...
date=2014-02-07, units=1, snowfall=...
date=2014-02-08, units=1, snowfall=...
...
It looks like some combination of groupby and set_index might be useful here, but I'm getting stuck after the following line:
grouped = stores.set_index(['store_nbr', 'item_nbr'])
This produces the following MultiIndex:
date units snowfall preciptotal event
store_nbr item_nbr
1 1 2012-01-01 0 0.0 0.0 0.0
2 2012-01-01 0 0.0 0.0 0.0
3 2012-01-01 0 0.0 0.0 0.0
4 2012-01-01 0 0.0 0.0 0.0
5 2012-01-01 0 0.0 0.0 0.0
Does anyone have any suggestions from here? Is there an easy way to do this by manipulating groupby objects?
You can sort your rows with:
df.sort_values(by='date')

Categories