Python - Get group names from aggregated results in pandas - python

I have a dataframe like this:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
When I apply
df.groupby('minute').sum().sort('values', ascending=False)
This gives:
values
minute
3 7
2 6
4 6
1 4
I want to get first two values in minute column in an array like [3,2]. How can I access values in minute column

If what you want is the values from the minute column in the grouped dataframe (which would be the index column as well) , you can use DataFrame.index , to access that column. Example -
grouped = df.groupby('minute').sum().sort('values', ascending=False)
grouped.index[:2]
If you really want it as a list, you can use .tolist() to convert it to a list. Example -
grouped.index[:2].tolist()
Demo -
In [3]: df
Out[3]:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
In [4]: grouped = df.groupby('minute').sum().sort('values', ascending=False)
In [5]: grouped.index[:2]
Out[5]: Int64Index([3, 2], dtype='int64', name='minute')
In [6]: grouped.index[:2].tolist()
Out[6]: [3, 2]

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

Creating Data Frame with repeating values that repeat

I'm trying to create a dataframe in Pandas that has two variables ("date" and "time_of_day" where "date" is 120 observations long with 30 days (each day has four observations: 1,1,1,1; 2,2,2,2; etc.) and then the second variable "time_of_day) repeats 30 times with values of 1,2,3,4.
The closest I found to this question was here: How to create a series of numbers using Pandas in Python, which got me the below code, but I'm receiving an error that it must be a 1-dimensional array.
df = pd.DataFrame({'date': np.tile([pd.Series(range(1,31))],4), 'time_of_day': pd.Series(np.tile([1, 2, 3, 4],30 ))})
So the final dataframe would look something like
date
time_of_day
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
Thanks much!
you need once np.repeat and once np.tile
df = pd.DataFrame({'date': np.repeat(range(1,31),4),
'time_of_day': np.tile([1, 2, 3, 4],30)})
print(df.head(10))
date time_of_day
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 3 1
9 3 2
or you could use pd.MultiIndex.from_product, same result.
df = (
pd.MultiIndex.from_product([range(1,31), range(1,5)],
names=['date','time_of_day'])
.to_frame(index=False)
)
or product from itertools
from itertools import product
df = pd.DataFrame(product(range(1,31), range(1,5)), columns=['date','time_of_day'])
New feature in merge cross
out = pd.DataFrame(range(1,31)).merge(pd.DataFrame([1, 2, 3, 4]),how='cross')

python - how to iterate each row and set the most often appears [duplicate]

how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])

Pandas indexing behavior after grouping: do I see an "extra row"?

This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.
It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2
There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).

pandas: how to find the most frequent value of each row?

how to find the most frequent value of each row of a dataframe?
For example:
In [14]: df
Out[14]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
return:
[3,1,7]
try .mode() method:
In [88]: df
Out[88]:
a b c
0 2 3 3
1 1 1 2
2 7 7 8
In [89]: df.mode(axis=1)
Out[89]:
0
0 3
1 1
2 7
From docs:
Gets the mode(s) of each element along the axis selected. Adds a row
for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected
axis (when more than one item share the maximum frequency), which is
the reason why a dataframe is returned. If you want to impute missing
values with the mode in a dataframe df, you can just do this:
df.fillna(df.mode().iloc[0])

Categories