I have some time series data that is mostly quarterly, but reported in year-month-day format for multiple variables and multiple countries, however some variables for some dates have are posted the last day of the quarter and others might post on close to the last day. I would like to perform a resample that aggregates each row to end of quarter of frequency. I have this:
Date Country Var1 Var2 Var3
2012-03-30 China 12 Nan 200
2012-03-31 China Nan 50 Nan
2012-06-28 China 13 Nan 199
2012-06-30 China Nan 48 Nan
2012-09-30 China 13 49 200
2012-12-31 China 12 50 201
What I want to see is
Date Country Var1 Var2 Var3
2012-03-31 China 12 50 200
2012-06-30 China 13 48 199
2012-09-30 China 13 49 200
2012-12-31 China 12 50 201
I tried a couple of different resample ideas. First I tried
df=df.groupby("Country").resample('Q').applymap(lambda x: df.shift(1) if math.isnan(x) else x)
Then I tried converting all the Nans to zeros then aggregating by sum, but this is not ideal since I cannot keep track of which data actually are zero and which data were missing.
df=df.fillna(0)
df=df.groupby("Country").resample('Q').sum()
Here's a small example with my own dataframe doing what you want.
# creating the dataframe
df = pd.DataFrame(np.random.randn(8, 3), columns=['Var1', 'Var2', 'Var3'])
# adding NaN values
df.iloc[1]['Var1'] = np.nan
df.iloc[5]['Var1'] = np.nan
df.iloc[4]['Var2'] = np.nan
df.iloc[6]['Var2'] = np.nan
df
'''
Var1 Var2 Var3
0 -0.437551 -2.707623 0.726240
1 NaN 2.529733 0.484732
2 0.199278 -0.316516 -0.655426
3 0.732910 -0.638045 -0.706436
4 0.877915 NaN -1.141384
5 NaN -2.050228 2.091994
6 -1.119849 NaN 1.222602
7 0.406632 -2.255687 0.742452
'''
# backfilling values in Var2
df['Var2'] = df['Var2'].fillna(method='backfill').dropna()
# dropping NaN rows based on column Var1
df.dropna()
df
'''
Var1 Var2 Var3
0 -0.437551 -2.707623 0.726240
2 0.199278 -0.316516 -0.655426
3 0.732910 -0.638045 -0.706436
4 0.877915 -2.050228 -1.141384
6 -1.119849 -2.255687 1.222602
7 0.406632 -2.255687 0.742452
'''
Related
I have a pandas dataframe containing NaN values for some column.
I'm trying to fill them with a default value (30), but it doesn't work.
Original dataframe:
type avg_speed
0 CAR 32.0
1 CAR NaN
2 CAR NaN
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR NaN
8 BIKE NaN
9 BIKE 35.1
...
Desired result:
type avg_speed
0 CAR 32.0
1 CAR 30
2 CAR 30
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR 30
8 BIKE 30
9 BIKE 35.1
My code:
def fill_with_default(pandas_df, column_name, default_value):
print(f"Total count: {pandas_df.count()}")
print(f"Count of Nan BEFORE: {pandas_df[column_name].isna().sum()}")
pandas_df[column_name].fillna(default_value, inplace=True)
print(f"Count of Nan AFTER: {pandas_df[column_name].isna().sum()}")
return pandas_df
df = fill_with_default(df, "avg_speed", 30)
Output:
Total count: 105018
Count of Nan BEFORE: 49514
Count of Nan AFTER: 49514
The chain of dataframe transformations and list of columns are too long, so it's difficult to show all steps (join with another dataframe, drop useless columns, add usefull columns, join with other dataframes, filter etc.)
I've tried other options but they also don't work:
#pandas_df.fillna({column_name: default_value}, inplace=True)
#pandas_df.loc[pandas_df[column_name].isnull(),column_name] = default_value
...
Type of column before applying "fillna" is fload64, the same as default_value
Therefore, my question is: what could be the potential reasons of this problem?
What kind of transformation can lead to this problem? Because this is the method that works for another similar data frame. The only difference between them lies in the chain of transformations.
BTW, there is a system log at this place:
/home/hadoop/.local/lib/python3.6/site-
packages/pandas/core/generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-
copy
self._update_inplace(new_data)
in my code I've generated a range of dates using pd.date_range in an effort to compare it to a column of dates read in from excel using pandas. The generated range of dates is refered to as "all_dates".
all_dates=pd.date_range(start='1998-12-31', end='2020-06-23')
for i, date in enumerate(period): # where 'Period' is the column of excel dates
if date==all_dates[i]: # loop until date from excel doesn't match date from generated dates
continue
else:
missing_dates_stock.append(i) # keep list of locations where dates are missing
stock_data.insert(i,"NaN") # insert 'NaN' where missing date is found
This results in TypeError: argument of type 'Timestamp' is not iterable. How can I make the data types match such that I can iterate and compare them? Apologies as I am not very fluent in Python.
I think you are trying to create a NaN row if the date does not exist in the excel file.
Here's a way to do it. You can use the df.merge option.
I am creating df1 to simulate the excel file. It has two columns sale_dt and sale_amt. If the sale_dt does not exist, then we want to create a separate row with NaN in the columns. To ensure we simulate it, I am creating a date range from 1998-12-31 through 2020-06-23 skipping 4 days in between. So we have a dataframe with 4 missing date between each two rows. The solution should create 4 dummy rows with the correct date in ascending order.
import pandas as pd
import random
#create the sales dataframe with missing dates
df1 = pd.DataFrame({'sale_dt':pd.date_range(start='1998-12-31', end='2020-06-23', freq='5D'),
'sale_amt':random.sample(range(1, 2000), 1570)
})
print (df1)
#now create a dataframe with all the dates between '1998-12-31' and '2020-06-23'
df2 = pd.DataFrame({'date':pd.date_range(start='1998-12-31', end='2020-06-23', freq='D')})
print (df2)
#now merge both dataframes with outer join so you get all the rows.
#i am also sorting the data in ascending order so you can see the dates
#also dropping the original sale_dt column and renaming the date column as sale_dt
#then resetting index
df1 = (df1.merge(df2,left_on='sale_dt',right_on='date',how='outer')
.drop(columns=['sale_dt'])
.rename(columns={'date':'sale_dt'})
.sort_values(by='sale_dt')
.reset_index(drop=True))
print (df1.head(20))
The original dataframe was:
sale_dt sale_amt
0 1998-12-31 1988
1 1999-01-05 1746
2 1999-01-10 1395
3 1999-01-15 538
4 1999-01-20 1186
... ... ...
1565 2020-06-03 560
1566 2020-06-08 615
1567 2020-06-13 858
1568 2020-06-18 298
1569 2020-06-23 1427
The output of this will be (first 20 rows):
sale_amt sale_dt
0 1988.0 1998-12-31
1 NaN 1999-01-01
2 NaN 1999-01-02
3 NaN 1999-01-03
4 NaN 1999-01-04
5 1746.0 1999-01-05
6 NaN 1999-01-06
7 NaN 1999-01-07
8 NaN 1999-01-08
9 NaN 1999-01-09
10 1395.0 1999-01-10
11 NaN 1999-01-11
12 NaN 1999-01-12
13 NaN 1999-01-13
14 NaN 1999-01-14
15 538.0 1999-01-15
16 NaN 1999-01-16
17 NaN 1999-01-17
18 NaN 1999-01-18
19 NaN 1999-01-19
I have data that contains prices, volumes and other data about various financial securities. My input data looks like the following:
import numpy as np
import pandas
prices = np.random.rand(15) * 100
volumes = np.random.randint(15, size=15) * 10
idx = pandas.Series([2007, 2007, 2007, 2007, 2007, 2008,
2008, 2008, 2008, 2008, 2009, 2009,
2009, 2009, 2009], name='year')
df = pandas.DataFrame.from_items([('price', prices), ('volume', volumes)])
df.index = idx
# BELOW IS AN EXMPLE OF WHAT INPUT MIGHT LOOK LIKE
# IT WON'T BE EXACT BECAUSE OF THE USE OF RANDOM
# price volume
# year
# 2007 0.121002 30
# 2007 15.256424 70
# 2007 44.479590 50
# 2007 29.096013 0
# 2007 21.424690 0
# 2008 23.019548 40
# 2008 90.011295 0
# 2008 88.487664 30
# 2008 51.609119 70
# 2008 4.265726 80
# 2009 34.402065 140
# 2009 10.259064 100
# 2009 47.024574 110
# 2009 57.614977 140
# 2009 54.718016 50
I want to produce a data frame that looks like:
year 2007 2008 2009
0 0.121002 23.019548 34.402065
1 15.256424 90.011295 10.259064
2 44.479590 88.487664 47.024574
3 29.096013 51.609119 57.614977
4 21.424690 4.265726 54.718016
I know of one way to produce the output above using groupby:
df = df.reset_index()
grouper = df.groupby('year')
df2 = None
for group, data in grouper:
series = data['price'].copy()
series.index = range(len(series))
series.name = group
df2 = pandas.DataFrame(series) if df2 is None else pandas.concat([df2, series], axis=1)
And I also know that you can do pivot to get a DataFrame that has NaNs for the missing indices on the pivot:
# df = df.reset_index()
df.pivot(columns='year', values='price')
# Output
# year 2007 2008 2009
# 0 0.121002 NaN NaN
# 1 15.256424 NaN NaN
# 2 44.479590 NaN NaN
# 3 29.096013 NaN NaN
# 4 21.424690 NaN NaN
# 5 NaN 23.019548 NaN
# 6 NaN 90.011295 NaN
# 7 NaN 88.487664 NaN
# 8 NaN 51.609119 NaN
# 9 NaN 4.265726 NaN
# 10 NaN NaN 34.402065
# 11 NaN NaN 10.259064
# 12 NaN NaN 47.024574
# 13 NaN NaN 57.614977
# 14 NaN NaN 54.718016
My question is the following:
Is there a way that I can create my output DataFrame in the groupby without creating the series, or is there a way I can re-index my input DataFrame so that I get the desired output using pivot?
You need to label each year 0-4. To do this, use the cumcount after grouping. Then you can pivot correctly using that new column as the index.
df['year_count'] = df.groupby(level='year').cumcount()
df.reset_index().pivot(index='year_count', columns='year', values='price')
year 2007 2008 2009
year_count
0 61.682275 32.729113 54.859700
1 44.231296 4.453897 45.325802
2 65.850231 82.023960 28.325119
3 29.098607 86.046499 71.329594
4 67.864723 43.499762 19.255214
You can use groupby with apply new Series created with numpy array by values and then reshape by unstack:
print (df.groupby(level='year')['price'].apply(lambda x: pd.Series(x.values)).unstack(0))
year 2007 2008 2009
0 55.360804 68.671626 78.809139
1 50.246485 55.639250 84.483814
2 17.646684 14.386347 87.185550
3 54.824732 91.846018 60.793002
4 24.303751 50.908714 22.084445
I have a pivot table and I want to plot the values for the 12 months of each year for each town.
2010-01 2010-02 2010-03
City RegionName
Atlanta Downtown NaN NaN NaN
Midtown 194.263702 196.319964 197.946962
Alexandria Alexandria NaN NaN NaN
West
Landmark- NaN NaN NaN
Van Dom
How can I select only the values for each region of each town? I thought maybe it would be better to change the column names with years and months to datetime format and set them as index. How can I do this?
The result must be:
City RegionName
2010-01 Atlanta Downtown NaN
Midtown 194.263702
Alexandria Alexandria NaN
West
Landmark- NaN
Van Dom
Here's some similar dummy data to play with:
idx = pd.MultiIndex.from_arrays([['A','A', 'B','C','C'],
['A1','A2','B1','C1','C2']], names=['City','Region'])
idcol = pd.date_range('2012-01', freq='M', periods=12)
df = pd.DataFrame(np.random.rand(5,12), index=idx, columns=[t.strftime('%Y-%m') for t in idcol])
Let's see what we've got:
print(df.ix[:,:3])
2012-01 2012-02 2012-03
City Region
A A1 0.513709 0.941354 0.133290
A2 0.734199 0.005218 0.068914
B B1 0.043178 0.124049 0.603469
C C1 0.721248 0.483388 0.044008
C2 0.784137 0.864326 0.450250
Let's convert these to a datetime: df.columns = pd.to_datetime(df.columns)
Now to plot you just need to transpose:
df.T.plot()
Update after your updated your question:
Use stack, and then reorder if you want:
df = df.stack().reorder_levels([2,0,1])
df.head()
City Region
2012-01-01 A A1 0.513709
2012-02-01 A A1 0.941354
2012-03-01 A A1 0.133290
2012-04-01 A A1 0.324518
2012-05-01 A A1 0.554125
I found this description of how to resample a multi-index:
Resampling Within a Pandas MultiIndex
However as soon as I use count instead of sum the solution is not working any longer
This might be related to: Resampling with 'how=count' causing problems
Not working count and strings:
values_a =[1]*16
states = ['Georgia']*8 + ['Alabama']*8
#cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([datetime.datetime(2012,1,1)+datetime.timedelta(days = i) for i in range(4)]*4)
df2 = pd.DataFrame(
{'value_a': values_a},
index = [states, dates])
df2.index.names = ['State', 'Date']
df2.reset_index(level=[0], inplace=True)
print(df2.groupby(['State']).resample('W',how='count'))
Yields:
2012-01-01 2012-01-08
State value_a State value_a
State
Alabama 2 2 6 6
Georgia 2 2 6 6
The working version with sum and numbers as values
values_a =[1]*16
states = ['Georgia']*8 + ['Alabama']*8
#cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
dates = pd.DatetimeIndex([datetime.datetime(2012,1,1)+datetime.timedelta(days = i) for i in range(4)]*4)
df2 = pd.DataFrame(
{'value_a': values_a},
index = [states, dates])
df2.index.names = ['State', 'Date']
df2.reset_index(level=[0], inplace=True)
print(df2.groupby(['State']).resample('W',how='sum'))
Yields (notice no duplication of 'State'):
value_a
State Date
Alabama 2012-01-01 2
2012-01-08 6
Georgia 2012-01-01 2
2012-01-08 6
When using count, state isn't a nuisance column (it can count strings) so the resample is going to apply count to it (although the output is not what I would expect). You could do something like (tell it only to apply count to value_a),
>>> print df2.groupby(['State']).resample('W',how={'value_a':'count'})
value_a
State Date
Alabama 2012-01-01 2
2012-01-08 6
Georgia 2012-01-01 2
2012-01-08 6
Or more generally, you can apply different kinds of how to different columns:
>>> print df2.groupby(['State']).resample('W',how={'value_a':'count','State':'last'})
State value_a
State Date
Alabama 2012-01-01 Alabama 2
2012-01-08 Alabama 6
Georgia 2012-01-01 Georgia 2
2012-01-08 Georgia 6
So while the above allows you to count a resampled multi-index dataframe it doesn't explain the behavior of output fromhow='count'. The following is closer to the way I would expect it to behave:
print df2.groupby(['State']).resample('W',how={'value_a':'count','State':'count'})
State value_a
State Date
Alabama 2012-01-01 2 2
2012-01-08 6 6
Georgia 2012-01-01 2 2
2012-01-08 6 6
#Karl D soln is correct; this will be possible in 0.14/master (releasing shortly), see docs here
In [118]: df2.groupby([pd.Grouper(level='Date',freq='W'),'State']).count()
Out[118]:
value_a
Date State
2012-01-01 Alabama 2
Georgia 2
2012-01-08 Alabama 6
Georgia 6
Prior to 0.14 it was difficult to groupby / resample with a time based grouper and another grouper. pd.Grouper allows a very flexible specification to do this.