Pandas: Get top 10 values AFTER grouping - python

I have a pandas data frame with a column 'id' and a column 'value'. It is already sorted by first id (ascending) and then value (descending). What I need is the top 10 values per id.
I assumed that something like the following would work, but it doesn't:
df.groupby("id", as_index=False).aggregate(lambda (index,rows) : rows.iloc[:10])
What I get is just a list of ids, the value column (and other columns that I omitted for the question) aren't there anymore.
Any ideas how it might be done, without iterating through each of the single rows and appending the first ten to another data structure?

Is this what you're looking for?
df.groupby('id').head(10)

I would like to answer this by giving and example dataframe as:
df = pd.DataFrame(np.array([['a','a','b','c','a','c','b'],[4,6,1,8,9,4,1],[12,11,7,1,5,5,7],[123,54,146,96,10,114,200]]).T,columns=['item','date','hour','value'])
df['value'] = pd.to_numeric(df['value'])
This gives you a dataframe
item date hour value
a 4 12 123
a 6 11 54
b 1 7 146
c 8 1 96
a 9 5 10
c 4 5 114
b 1 7 200
Now this is grouped below and displays first 2 values of grouped items.
df.groupby(['item'])['value'].head(2)

Related

Put level of dataframe index at the same level of columns on a Multi-Index Dataframe

Context: I'd like to "bump" the index level of a multi-index dataframe up. In other words, I'd like to put the index level of a dataframe at the same level as the columns of a multi-indexed dataframe
Let's say we have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.index.name = 'Index Column'
And we perform this change to add a multi-index level (like a label of a table)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
Which results in this:
Multi-Index Table Label
A B C
Index Column
0 1 4 7
1 2 5 8
2 3 6 9
Desired Output: How can I make it so that the dataframe looks like this instead (notice the removal of the empty level on the dataframe/table):
Multi-Index Table Label
Index Column A B C
0 1 4 7
1 2 5 8
2 3 6 9
Attempts: I was testing something out and you can essentially remove the index level by doing this:
tt.index.name = None
Which would result in :
Multi-Index Table Label
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Essentially removing that extra level/empty line, but the thing is that I do want to keep the Index Column as it will give information about the type of data present on the index (which in this example are just 0,1,2 but can be years, dates, etc).
How could I do that?
Thank you all in advance :)
How about this:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.insert(loc=0, column='Index Column', value=tt.index)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
tt = tt.style.hide_index()

Python: extract column from pandas pivot

I have a pivoted table
total_chart = df.pivot_table(index="Name", values="Items", aggfunc='count')
The output gives
A 8
B 52
C 24
D 6
E 43
F 5
G 13
I 1
I trying to get only the second column (number only)
Is there any simple way to get it?
The code below should do the trick for you.
It counts "Items", sort it ascending by the index "Name" and output just the counts without the index.
df['Items'].value_counts().sort_index(ascending=True).tolist()

Merging two dataframes while considering overlaps and missing indexes [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have multiple dataframes that have an ID and a value and I am trying to merge them such that each ID has all the values in it's row.
ID
Value
1
10
3
21
4
12
5
43
7
11
And then I have another dataframe:
ID
Value2
1
12
2
14
4
55
6
23
7
90
I want to merge these two in a way where it considers the ID's that are already in the first dataframe and if an ID that is the second dataframe is not in the first one, it adds it to the ID row with value2 leaving value empty. This is what my result would look like:
ID
Value
Value2
1
10
12
3
21
-
4
12
55
5
43
-
7
11
90
2
-
14
6
-
23
Hope this makes sense. I don't really care for the order of the ID numbers, they can be sorted or not. My goal is to be able to create dictionaries for each ID with "Value", "Value2", "Value3,... as keys and the corresponding actual value numbers as the keys values. Please let me know if any clarification needed.
You can use pandas' merge method (see here for the help page):
import pandas as pd
df1.merge(df2, how='outer', on='ID')
Specifying 'outer' will use union keys from both dataframes.

Create a multiindex DataFrame from existing delimited column names

I have a pandas DataFrame that looks like the following
A_value A_avg B_value B_avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
and my goal is to create a multiindex Dataframe that looks like that:
A B
value avg value avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
So the part of the column name before the '-' should be the first level of the column index and the part afterwards the second level. The first part is unstructured, the second is always the same (4 endings).
I tried to solve it with pd.wide_to_long() but I think that is the wrong path, as I don't want to change the df itself. The real df is much larger, so creating it manually is not an option. I'm stuck here and did not find a solution.
You can split the columns by the delimier and expand to create Multiindex:
df.columns=df.columns.str.split("_",expand=True)

How can I filter a table based in two values at same time? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes and I want to filter one dataframe if two other values are not present in the other dataframe. Both dataframes share the name of the columns.
For example, dataframe A has:
col1,col2
1 5
-10 15
6 7
and dataframe B has:
col1,col2
6 7
-10 15
-1 5
So in this example, I would like to pick the value pair in A and see if it is present in B.
First row of A has value pair 1,5 and since 1,5 is not present in B that row would be excluded from A.
Second and third row of A has value -10,15 and 6,7, and since both are present in B I would like to keep them.
So the desired output of the filtered table A would be:
col1,col2
-10 15
6 7
How can I achieve this?
EDIT: One of the first things I tried was a merge, but the resulting dataframe was actually bigger than the original. Since merge and merging 101 topic was suggested, I will add the real dataframes here.
Dataframe A have latitude, longitude and id columns (id is not the index). It has 363 rows:
id lat lon
0 0 -33.252192 -70.765291
1 1 -33.224300 -70.780249
2 2 -33.251651 -70.797289
3 3 -33.298574 -70.770133
4 4 -33.214315 -70.787822
... ... ... ...
358 499 -33.227614 -70.770126
359 501 -33.299217 -70.770685
360 502 -33.191476 -70.801492
361 503 -33.239037 -70.780278
362 504 -33.263893 -70.762674
Dataframe B has 73096 rows and it also has and id, latitude and longitude. I'm putting here only lat and lon.
lat lon
1 -33.260415 -70.713767
2 -33.461718 -70.853525
3 -33.258741 -70.638032
4 -33.544858 -70.578624
8 -33.535512 -70.574188
... ... ...
97724 -33.451817 -70.847999
97725 -33.452225 -70.846520
97726 -33.450841 -70.841494
97729 -33.461407 -70.856090
97730 -33.457633 -70.822085
So I want to see if the lat,lon pair in A is present in B and if not then exclude it from A.
When I do A.merge(B) I get a dataframe that is 1108 rows long.
You can try pandas.merge. Something like df1.merge(df2, how='inner', left_on=['col1','col2'], right_on=['col1','col2']).
(To help you remember, the naming of these arguments comes from an inner join in database terminology)
A simple merge will do
df_out = dfA.merge(dfB)
Output
col1 col2
0 -10 15
1 6 7
df.merge does an inner join by default.

Categories