Using .ix loses headers - python

How come when I use:
dfa=df1["10_day_mean"].ix["2015":"2015"]
The dataframe dfa has no header?
dfa:
Date
2015-01-10 2.000000
2015-01-20 3.000000
df1:
10day_mean Attenuation Channel1 Channel2 Channel3 Channel4 \
Date
2004-02-27 3.025 2.8640 NaN NaN NaN NaN
Is there a way to change the header of the dfa because when I plot it out my legend is the 10_day_mean and I wish to relable it as "Daily mean of every 10 days"
Thanks guys
I tried
dfa=dfa.rename(columns={0:"rename"})
and
dfa=dfa.rename(columns={"10day_mean":"rename"})
But then it says none

Your confusion here is that when you do this:
dfa=df1["10_day_mean"].ix["2015":"2015"]
this returns a Series which only has a single column so the output doesn't show the column name above the column, it'll show it at the bottom in the summary info as name.
To get the output you desired you can use double subscripting to force a dataframe with a single column to be returned:
dfa=df1[["10_day_mean"]].ix["2015":"2015"]
Example:
In [90]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[90]:
a b c
0 -1.002036 -1.703049 2.123096
1 0.497920 1.556211 -1.807895
2 0.400020 -0.703138 1.452735
3 -0.296604 -0.227155 -0.311047
4 -0.314948 -0.654925 -0.434458
In [91]:
df['a'].iloc[2:4]
Out[91]:
2 0.400020
3 -0.296604
Name: a, dtype: float64
In [92]:
df[['a']].iloc[2:4]
Out[92]:
a
2 0.400020
3 -0.296604

Related

Grouper() and agg() functions produce multiple copies when squashed

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.
In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing

Pandas Resample-Sum without Zero filling

When resampling Series with mean aggregation (daily to monthly) -> missing datetimes are filled with NaNs which is okay since we can simply remove them using .dropna() function,
however, with sum/total aggregation -> missing datetimes are filled with 0s (zeros) which is technically correct, but a bit bothersome as masks are needed to remove them.
The question is if there is a more efficient way on resampling with aggregate sum without zero-filling or using masks? Preferrably similar to dropna() but for dropping 0s.
For example:
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2000-03-01', '2000-03-02', '2000-05-01', '2000-05-02'])
# wanted output
# 2000-01-31 2.0
# 2000-03-31 2.0
# 2000-05-31 2.0
# ideal output but for aggregate sum.
ser.resample('M').mean().dropna()
# 2000-01-31 1.0
# 2000-03-31 1.0
# 2000-05-31 1.0
# not ideal
ser.resample('M').sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with .grouper() seems to have the exact behavior from resampling.
# not ideal
ser.groupby(pd.Grouper(freq='M')).sum()
# 2000-01-31 2
# 2000-02-29 0
# 2000-03-31 2
# 2000-04-30 0
# 2000-05-31 2
using .groupby() with index.year is also doable, however, there does not seem to be an 'identity' for calendar month. Noting that .index.month is not what we are after.
ser = pd.Series([1]*6)
ser.index = pd.to_datetime(['2000-01-01', '2000-01-02', '2002-03-01', '2002-03-02', '2005-05-01', '2005-05-02'])
ser.groupby(ser.index.year).sum()
# 2000 2
# 2002 2
# 2005 2
Use pd.offsets.MonthEnd and add this with the DatetimeIndex of ser to create a month end grouper, then use Series.groupby with this grouper and aggregate using sum or mean:
grp = ser.groupby(ser.index + pd.offsets.MonthEnd())
s1, s2 = grp.sum(), grp.mean()
Result:
print(s1)
2000-01-31 2
2002-03-31 2
2005-05-31 2
dtype: int64
print(s2)
2000-01-31 1
2002-03-31 1
2005-05-31 1
dtype: int64

Forward filling missing dates into Python Pandas Dataframe

I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})

How to calculate rolling mean on a GroupBy object using Pandas?

How to calculate rolling mean on a GroupBy object using Pandas?
My Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index('ds')
grouped_df = df.groupby('city')
What grouped_df looks like:
I want calculate rolling mean on each of my groups in my GroupBy object using Pandas?
I tried pd.rolling_mean(grouped_df, 3).
Here is the error I get:
AttributeError: 'DataFrameGroupBy' object has no attribute 'dtype'
Edit: Do I use itergroups maybe and calculate rolling mean on each group on each group as I iterate through?
You could try iterating over the groups
In [39]: df = pd.DataFrame({'a':list('aaaaabbbbbaaaccccbbbccc'),"bookings":range(1,24)})
In [40]: grouped = df.groupby('a')
In [41]: for group_name, group_df in grouped:
....: print group_name
....: print pd.rolling_mean(group_df['bookings'],3)
....:
a
0 NaN
1 NaN
2 2.000000
3 3.000000
4 4.000000
10 6.666667
11 9.333333
12 12.000000
dtype: float64
b
5 NaN
6 NaN
7 7.000000
8 8.000000
9 9.000000
17 12.333333
18 15.666667
19 19.000000
dtype: float64
c
13 NaN
14 NaN
15 15
16 16
20 18
21 20
22 22
dtype: float64
You want the dates on your left column and all city values as separate columns. One way to do this is set the index on date and city, and then unstack. This is equivalent to a pivot table. You can then perform your rolling mean in the usual fashion.
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index(['date', 'city']).unstack('city')
rm = pd.rolling_mean(df, 3)
I wouldn't recommend using a function, as the data for a given city can simply be returned as follows (: returns all rows):
df.loc[:, city]

Computing np.diff in Pandas after using groupby leads to unexpected result

I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?
Here is some data that has the right structure (code modeled on an answer here):
import pandas as pd
import numpy as np
import random
from itertools import product
random.seed(1) # so you can play along at home
np.random.seed(2) # ditto
# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))
mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})
Ok, great. This gives me a frame like so:
date ticker value
0 2013-10-03 ticker_2 0.435995
1 2013-10-04 ticker_2 0.025926
2 2013-10-02 ticker_1 0.549662
3 2013-10-01 ticker_0 0.435322
4 2013-10-02 ticker_2 0.420368
5 2013-10-03 ticker_0 0.330335
6 2013-10-04 ticker_1 0.204649
7 2013-10-02 ticker_0 0.619271
8 2013-10-01 ticker_2 0.299655
My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:
data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1
and returns:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:
data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2
A quick check shows that, in fact, data1 is equal to data2. However, if I do this:
data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3
I get a strange result:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0
1 2013-10-04 ticker_2 0.025926 NaN
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 NaN
5 2013-10-03 ticker_0 0.330335 0
6 2013-10-04 ticker_1 0.204649 NaN
7 2013-10-02 ticker_0 0.619271 NaN
8 2013-10-01 ticker_2 0.299655 0
What's going on here? When you call the .diff method on a Pandas object, is it not just calling np.diff? I know there's a diff method on the DataFrame class, but I couldn't figure out how to pass that to transform without the lambda function syntax I used to make data1 work. Am I missing something? Why is the diffs column in data3 screwy? How can I have call the Pandas diff method within transform without needing to write a lambda to do it?
Nice easy to reproduce example!! more questions should be like this!
Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2
In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)
In [34]: data3.sort_index(inplace=True)
In [25]: data3
Out[25]:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
[9 rows x 4 columns]
I believe that np.diff doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__ on input __array_wrap__ on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff is not dealing with the index properly and doing its own calculation (which in this case is wrong).
Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.
You can see that the Series .diff() method is different to np.diff():
In [11]: data.value.diff() # Note the NaN
Out[11]:
0 NaN
1 -0.410069
2 0.523736
3 -0.114340
4 -0.014955
5 -0.090033
6 -0.125686
7 0.414622
8 -0.319616
Name: value, dtype: float64
In [12]: np.diff(data.value.values) # the values array of the column
Out[12]:
array([-0.41006867, 0.52373625, -0.11434009, -0.01495459, -0.09003298,
-0.12568619, 0.41462233, -0.31961629])
In [13]: np.diff(data.value) # on the column (Series)
Out[13]:
0 NaN
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 NaN
Name: value, dtype: float64
In [14]: np.diff(data.value.index) # er... on the index
Out[14]: Int64Index([8], dtype=int64)
In [15]: np.diff(data.value.index.values)
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1])

Categories