How to groupby time series data - python

I have a dataframe below,column B's dtype is datetime64.
A B
0 a 2016-09-13
1 b 2016-09-14
2 b 2016-09-15
3 a 2016-10-13
4 a 2016-10-14
I would like to groupby according to month(or in general year and day...)
so I would like to get count result below, key = column B.
a b
2016-09 1 2
2016-10 2 0
I tried groupby. but I couldn't figure out how to handle dtypes like datetime64...
How can I handle and group dtype datetime64?

If you set the index to the datetime you can use pd.TimeGrouper to sort by various time ranges. Example code:
# recreate dataframe
df = pd.DataFrame({'A': ['a', 'b', 'b', 'a', 'a'], 'B': ['2016-09-13', '2016-09-14', '2016-09-15',
'2016-10-13', '2016-10-14']})
df['B'] = pd.to_datetime(df['B'])
# set column B as index for use of TimeGrouper
df.set_index('B', inplace=True)
# Now do the magic of Ami Tavory's answer combined with timeGrouper:
df = df.groupby([pd.TimeGrouper('M'), 'A']).size().unstack().fillna(0)
This returns:
A a b
B
2016-09-30 1.0 2.0
2016-10-31 2.0 0.0
or alternatively (credits to ayhan) skip the setting to index step and use the following one-liner straight after creating the dataframe:
# recreate dataframe
df = pd.DataFrame({'A': ['a', 'b', 'b', 'a', 'a'], 'B': ['2016-09-13', '2016-09-14', '2016-09-15',
'2016-10-13', '2016-10-14']})
df['B'] = pd.to_datetime(df['B'])
df = df.groupby([pd.Grouper(key='B', freq='M'), 'A']).size().unstack().fillna(0)
which returns the same answer

Say you start with
In [247]: df = pd.DataFrame({'A': ['a', 'b', 'b', 'a', 'a'], 'B': ['2016-09-13', '2016-09-14', '2016-09-15', '2016-10-13', '2016-10-14']})
In [248]: df.B = pd.to_datetime(df.B)
Then you can groupby-size, then unstack:
In [249]: df = df.groupby([df.B.dt.year.astype(str) + '-' + df.B.dt.month.astype(str), df.A]).size().unstack().fillna(0).astype(int)
Finally, you just need to make B a date again:
In [250]: df.index = pd.to_datetime(df.index)
In [251]: df
Out[251]:
A a b
B
2016-10-01 2 0
2016-09-01 1 2
Note that the final conversion to a date-time set a uniform day (you can't have a "dayless" object of this type).

Related

Aggregate values in pandas dataframe based on lists of indices in a pandas series

Suppose you have a dataframe with an "id" column and a column of values:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df1
id vals
0 a 1
1 b 2
2 c 3
You also have a series that contains lists of "id" values that correspond to those in df1:
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2
id
0 [b, c]
1 [a, c]
2 [a, b]
Now, you need a computationally efficient method for taking the mean of the "vals" column in df1 using the corresponding ids in df2 and creating a new column in df1. For instance, for the first row (index=0) we would take the mean of the values for ids "b" and "c" in df1 (since these are the id values in df2 for index=0):
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
You could do it this way:
df1['avg_vals'] = df2.apply(lambda x: df1.loc[df1['id'].isin(x), 'vals'].mean())
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
...but suppose it is too slow for your purposes. I.e., I need something much more computationally efficient if possible! Thanks for your help in advance.
Let us try
df1['new'] = pd.DataFrame(df2.tolist()).replace(dict(zip(df1.id,df1.vals))).mean(1)
df1
Out[109]:
id vals new
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Try something like:
df1['avg_vals'] = (df2.explode()
.map(df1.set_index('id')['vals'])
.groupby(level=0)
.mean()
)
output:
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Thanks to #Beny and #mozway for their answers. But, these still were not performing as efficiently as I needed. I was able to take some of mozway's answer and add a merge and groupby to it which sped things up:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2 = df2.explode().reset_index(drop=False)
df1['avg_vals'] = pd.merge(df1, df2, left_on='id', right_on=0, how='right').groupby('index').mean()['vals']
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5

why the pandas operation df.loc[:, ['a', 'b'] = df.loc[:, ['c', 'd'] does not change the values in df.loc[:, ['a', 'b']?

I have to use df.loc[:, ['a', 'b'] = df.loc[:, ['c', 'd'].values to successfully change the values in df.loc[:, ['a', 'b'] why?
In the contrast, df.loc[:, ['a'] = df['c'] works! why?
df is a pandas.DataFrame.
Answer is for oldier pandas versions, in last version of pandas all working nice, it means not necessary convert to numpy arrays or rename columns.
Reason is called index alignement, here is assigned different columns names c,d to a,b columns, so failed. For prevent it is used converting selected DataFrame to numpy array, because array has no columns, so working nice.
Or you can use rename for same columns names in both DataFrames:
df = pd.DataFrame({
'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3],
'd':[1,3,5,7,1,0],
})
df.loc[:, ['a', 'b']] = df.loc[:, ['c', 'd']].rename(columns={'c':'a', 'd':'b'})
print (df)
a b c d
0 7 1 7 1
1 8 3 8 3
2 9 5 9 5
3 4 7 4 7
4 2 1 2 1
5 3 0 3 0
In Series are not columns, so working nice.
Following up from what #jezrael answer, you can reassign column using the same syntax for a new column definition, what I mean is:
import pandas as pd
df = pd.DataFrame({
'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3],
'd':[1,3,5,7,1,0]})
df[['a', 'b']] = df.loc[:, ['c', 'd']]
df[['a', 'b']] = df[['c', 'd']]
Both lines will yield the expected result. On the other hand, to use loc on the left hand side you can convert the right hand side to numpy array or rename column as in jezrael answer:
df.loc[:, ['a', 'b']] = df.loc[:, ['c', 'd']].to_numpy()

fillna for category column with Series input does not work as expected

I have a category column which I want to fill with a Series.
I tried this:
df = pd.DataFrame({'key': ['a', 'b'], 'value': ['c', np.nan]})
df['value'] = df['value'].astype("category")
df['value'] = df['value'].cat.add_categories(df['key'].unique())
print(df['value'].cat.categories)
df['value'] = df['value'].fillna(df['key'])
print(df)
Expected output:
Index(['c', 'a', 'b'], dtype='object')
key value
0 a c
1 b b
Actual output:
Index(['c', 'a', 'b'], dtype='object')
key value
0 a a
1 b b
This appears to be a bug, but thankfully the workaround is quite simple. You will have to treat "value" as a string column when filling.
df['value'] = pd.Categorical(
df.value.astype(object).fillna(df.key), categories=df.stack().unique())
df
key value
0 a c
1 b b
From the doc , Categorical data will accept scalar not series , so you may need to convert it back to series
df.value.astype('object').fillna(df.key) # then convert to category again
Out[248]:
0 c
1 b
Name: value, dtype: object
value : scalar Value to use to fill holes (e.g. 0)

Pandas filter dataframe for positive and negative values

I have a pandas dataframe with 3 columns where:
Category dtype - string
Date dtype - datetime
Values dtype - float
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['date'] = ['2018-01-01', '2018-01-01', '2018-01-03', '2018-01-05', '2018-01-01', '2018-01-02', '2018-01-06', '2018-01-03', '2018-01-04','2018-01-01']
df['values'] = [1, 2, -1.5, 2.3, 5, -0.7, -5.2, -5.2, 1, -1.1]
df
Dataframe view
I want to filter for rows that have a positive value and a negative value (with the least difference) close to that date per category.
So, essentially an output that looks like:
df = pd.DataFrame()
df['category'] = ['a', 'a','b', 'b', 'c', 'c']
df['date'] = ['2018-01-01', '2018-01-01', '2018-01-01', '2018-01-03', '2018-01-01', '2018-01-03']
df['values'] = [1, -1.1, 2, -1.5, 5, -5.2]
df
Filtered Dataframe
I have looked at similar queries on SO (Identifying closest value in a column for each filter using Pandas, How do I find the closest values in a Pandas series to an input number?)
The first one utilises idxmin, which returns first occurence, not the closest in value.
The second link is speaking about a specific value as an input - I don't think a pure np.argsort works in my case.
I can imagine using a complex web of if statements to do this, but, I'm not sure what the most efficient way of doing this is with pandas.
Any guidance would be greatly appreciated.
IIUC, sort your dataframe first then use idxmin:
df1 = df.sort_values(['category','date'])
df1[df1.groupby('category')['values']\
.transform(lambda x: x.index.isin([x.ge(0).idxmin(), x.lt(0).idxmin()]))]
Output:
category date values
0 a 2018-01-01 1.0
9 a 2018-01-01 -1.1
1 b 2018-01-01 2.0
2 b 2018-01-03 -1.5
4 c 2018-01-01 5.0
7 c 2018-01-03 -5.2

pandas - filter dataframe by another dataframe by row elements

I have a dataframe df1 which looks like:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2 like:
c l
0 A b
1 C a
I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
That seems to me too complicated, it returns:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:
c k l
0 A 1 a
2 B 2 a
4 C 2 d
You can do this efficiently using isin on a multiindex constructed from the desired columns:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on #IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
(Above answer is an edit. Following was my initial answer)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.
This is pretty succinct and works well:
df1 = df1[~df1.index.isin(df2.index)]
Using DataFrame.merge & DataFrame.query:
A more elegant method would be to do left join with the argument indicator=True, then filter all the rows which are left_only with query:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True returns a dataframe with an extra column _merge which marks each row left_only, both, right_only:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only
I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]
How about:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)
Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.
You can concatenate both DataFrames and drop all duplicates:
df1.append(df2).drop_duplicates(subset=['c', 'l'], keep=False)
Output:
c k l
0 A 1.0 a
2 B 2.0 a
4 C 2.0 d
This method doesn't work if you have duplicates subset=['c', 'l'] in df1.

Categories