Python: extract column from pandas pivot - python

I have a pivoted table
total_chart = df.pivot_table(index="Name", values="Items", aggfunc='count')
The output gives
A 8
B 52
C 24
D 6
E 43
F 5
G 13
I 1
I trying to get only the second column (number only)
Is there any simple way to get it?

The code below should do the trick for you.
It counts "Items", sort it ascending by the index "Name" and output just the counts without the index.
df['Items'].value_counts().sort_index(ascending=True).tolist()

Related

using set() with pandas

May I ask you please if we can use set() to read the data in a specific column in pandas? For example, I have the following output from a DataFrame df1:
df1= [
0 -10 2 5
1 24 5 10
2 30 3 6
3 30 2 1
4 30 4 5
]
where the first column is the index.. I tried first to isolate the second column
[-10
24
30
30
30]
using the following: x = pd.DataFrame(df1, coulmn=[0]) Then, I transposed the column using the following XX = x.T Then, I used set() function.
However, instead of obtaining [-10 24 30] I got the following [0 1 2 3 4]
So set() read the index instead of reading the first column
set() takes an itterable.
using a pandas dataframe as an itterable yields the column names in turn.
Since you've transposed the dataframe, your index values are now column names, so when you use the transposed dataframe as an itterable you get those index values.
If you want to use set to get the values in the column using set() you can use:
x = pd.DataFrame(df1, colmns=[0])
set(x.iloc[:,0].values)
But if you just want the unique values in column 0 then you can use
df1[[0]].unique()

How do I eliminate the additional row for aggregate column in pandas groupby

I am trying to do some aggregation of fields and I get an additional row on the aggregate column- in this case 'sum'. How do I get rid of this? This makes it difficult to process this dataframe in subsequent processing
import pandas as pd
Grade=[['A',1],['A',2],['A',10],['B',4],['B',2],['C',3],['D',10],['D',5],['D',1]]
Grade_df=pd.DataFrame(Grade,columns=['grade','count'])
Grade_Aggregate=Grade_df.groupby(['grade']).agg({'count':['sum']}).reset_index()
Grade_Aggregate
grade count
sum
0 A 13
1 B 6
2 C 3
3 D 16
You can use Groupby.sum here directly.
Grade_df.groupby('grade', as_index=False)['count'].sum()
grade count
0 A 13
1 B 6
2 C 3
3 D 16
If you want to remove the 'Sum':
Grade_Aggregate.columns=[i[0] for i in Grade_Aggregate.columns]
If you want it to be appended to the column name:
Grade_Aggregate.columns=["_".join(i) for i in Grade_Aggregate.columns]

Iterate through given dataframe

df =
0 20
1 19
2 18
3 17
4 16
I am iterating with a loop:
for k in df:
af = AffinityPropagation(preference=k).fit(X)
labels = af.labels_
score = silhouette_score(frechet, labels)
print("Preference: {0}, Silhouette score: {1}".format(k,score))
I get 1 number. But I need/want to get dataframe with numbers in the length of df len(df)
You need to use iterrows as #CodeDifferently points out in his comment above.
Here is an example:
Where df is:
df = pd.DataFrame({0:range(20,0,-1)})
Then using your method:
for k in df:
print(k)
Output:
0
This zero is the column header for a dataframe. You are iterating thow the dataframe column names.
Using iterrows:
for _,k in df.iterrows():
print(k.iloc[0])
Output:
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Here you are getting each row of the dataframe as series, and using iloc you are getting the first and only value in the rows for this case.
You almost never need to iterate over a DataFrame. Columns are basically NumPy arrays and have array-like 'elementwise' superpowers. (You ~never need to iterate over NumPy arrays either.)
Maybe formulate your task as a function and use the apply() method on the DataFrame or Series. This 'applies' a function to every item in a column without the need for a loop.
But if you really only have one column like this, why use a DataFrame at all? Just use a NumPy array (or get at it with the column's values attribute).

Using itertools.combinations with columns

I have a Dataframe df with 3 columns. A,B and C
A B C
2 4 4
5 2 5
6 9 5
My goal is to use itertools.combinations to find all non-repeating column pairs and to put the first column pair in one DataFrame and the second in the other. So all pairs of this would give A:B,A:C,B:C.
So the first dataframe df1 would have the first of of those column pairs:
df=A A B
2 4 4
5 5 2
6 5 9
and the second df2:
B C C
4 4 4
3 5 5
9 5 5
I'm trying to do something with itertools like:
for cola, colb in itertools.combinations(df, 2):
df1[cola]=cola
df2[colb]=colb
I know that makes no sense but i can change each column to a list and itertool a list of lists and then append each to a list A and B and then turn that list back into a Dataframe but then Im missing the headers. And I tried adding the headers to the list but when i try and remake it back to a DataFrame the indexing seems off and I cant seem to fix it. So I'm just trying to see if there is a way to just itertool entire columns with the headers.
Utilize the zip function to group the columns to be used in each DataFrame separately, and then use pandas.concat to construct your new DataFrames:
from itertools import combinations
df1_cols, df2_cols = zip(*combinations(df.columns,2))
df1 = pd.concat([df[col] for col in df1_cols],axis=1)
df2 = pd.concat([df[col] for col in df2_cols],axis=1)

Pandas: Get top 10 values AFTER grouping

I have a pandas data frame with a column 'id' and a column 'value'. It is already sorted by first id (ascending) and then value (descending). What I need is the top 10 values per id.
I assumed that something like the following would work, but it doesn't:
df.groupby("id", as_index=False).aggregate(lambda (index,rows) : rows.iloc[:10])
What I get is just a list of ids, the value column (and other columns that I omitted for the question) aren't there anymore.
Any ideas how it might be done, without iterating through each of the single rows and appending the first ten to another data structure?
Is this what you're looking for?
df.groupby('id').head(10)
I would like to answer this by giving and example dataframe as:
df = pd.DataFrame(np.array([['a','a','b','c','a','c','b'],[4,6,1,8,9,4,1],[12,11,7,1,5,5,7],[123,54,146,96,10,114,200]]).T,columns=['item','date','hour','value'])
df['value'] = pd.to_numeric(df['value'])
This gives you a dataframe
item date hour value
a 4 12 123
a 6 11 54
b 1 7 146
c 8 1 96
a 9 5 10
c 4 5 114
b 1 7 200
Now this is grouped below and displays first 2 values of grouped items.
df.groupby(['item'])['value'].head(2)

Categories