Python Pandas: Convert ".value_counts" output to dataframe - python

Hi I want to get the counts of unique values of the dataframe. count_values implements this however I want to use its output somewhere else. How can I convert .count_values output to a pandas dataframe. here is an example code:
import pandas as pd
df = pd.DataFrame({'a':[1, 1, 2, 2, 2]})
value_counts = df['a'].value_counts(dropna=True, sort=True)
print(value_counts)
print(type(value_counts))
output is:
2 3
1 2
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
What I need is a dataframe like this:
unique_values counts
2 3
1 2
Thank you.

Use rename_axis for name of column from index and reset_index:
df = df.value_counts().rename_axis('unique_values').reset_index(name='counts')
print (df)
unique_values counts
0 2 3
1 1 2
Or if need one column DataFrame use Series.to_frame:
df = df.value_counts().rename_axis('unique_values').to_frame('counts')
print (df)
counts
unique_values
2 3
1 2

I just run into the same problem, so I provide my thoughts here.
Warning
When you deal with the data structure of Pandas, you have to aware of the return type.
Another solution here
Like #jezrael mentioned before, Pandas do provide API pd.Series.to_frame.
Step 1
You can also wrap the pd.Series to pd.DataFrame by just doing
df_val_counts = pd.DataFrame(value_counts) # wrap pd.Series to pd.DataFrame
Then, you have a pd.DataFrame with column name 'a', and your first column become the index
Input: print(df_value_counts.index.values)
Output: [2 1]
Input: print(df_value_counts.columns)
Output: Index(['a'], dtype='object')
Step 2
What now?
If you want to add new column names here, as a pd.DataFrame, you can simply reset the index by the API of reset_index().
And then, change the column name by a list by API df.coloumns
df_value_counts = df_value_counts.reset_index()
df_value_counts.columns = ['unique_values', 'counts']
Then, you got what you need
Output:
unique_values counts
0 2 3
1 1 2
Full Answer here
import pandas as pd
df = pd.DataFrame({'a':[1, 1, 2, 2, 2]})
value_counts = df['a'].value_counts(dropna=True, sort=True)
# solution here
df_val_counts = pd.DataFrame(value_counts)
df_value_counts_reset = df_val_counts.reset_index()
df_value_counts_reset.columns = ['unique_values', 'counts'] # change column names

I'll throw in my hat as well, essentially the same as #wy-hsu solution, but in function format:
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df

pd.DataFrame(
df.groupby(['groupby_col'])['column_to_perform_value_count'].value_counts()
).rename(
columns={'old_column_name': 'new_column_name'}
).reset_index()

Example of selecting a subset of columns from a dataframe, grouping, applying value_count per group, name value_count column as Count, and displaying first n groups.
# Select 5 columns (A..E) from a dataframe (data_df).
# Sort on A,B. groupby B. Display first 3 groups.
df = data_df[['A','B','C','D','E']].sort_values(['A','B'])
g = df.groupby(['B'])
for n,(k,gg) in enumerate(list(g)[:3]): # display first 3 groups
display(k,gg.value_counts().to_frame('Count').reset_index())

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Aggregate DataFrame down to one row using different functions

I have a dataframe with multiple rows, which I'd like to aggregate down, per-column, to a 1-row dataframe, using a different function per-column.
Take the following dataframe, as an example:
df = pd.DataFrame([[1,2], [2,3]], columns=['A', 'B'])
print(df)
Result:
A B
0 1 2
1 2 3
I'd like to aggregate the first column using sum and the second using mean. There is a convenient DataFrame.agg() method which can take a map of column names to aggregation function, like so:
aggfns = {
'A': 'sum',
'B': 'mean'
}
print(df.agg(aggfns))
However, this results in a Series rather than a DataFrame:
A 3.0
B 2.5
dtype: float64
Among other problems, a series has a single dtype so loses the per-column datatype. A series is well-suited to represent a single dataframe column, but not a single dataframe row.
I managed to come up with this tortured incantation:
df['dummy'] = 0
dfa = df.groupby('dummy').agg(aggfns).reset_index(drop=True)
print(dfa)
This creates a dummy column which is 0 everywhere, groups on it, does the aggregation and drops it, which produces the desired result:
A B
0 3 2.5
Certainly there is something better?
Using Series.to_frame + DataFrame.T (short for transpose):
dfa = df.agg(aggfns).to_frame().T
Output:
>>> dfa
A B
0 3.0 2.5
You could group by an empty Series instead of creating a new column:
dfa = df.assign(d=0).groupby('d').agg(aggfns).reset_index(drop=True)
Output:
>>> dfa
A B
0 3 2.5
You can explicitly create a new DataFrame()
>>> pd.DataFrame({'A': [df.A.sum()], 'B': [df.B.mean()]}
A B
0 3 2.5

Pandas: how to index dataframe for certain value or string without knowing the column name

Although it seems very detrimental, I am having a hard time getting the index of a dataframe for a certain string or value at a random position in the dataframe.
I made an example dataframe:
fruits = {
'column1':["Apples","Pears","Bananas","Oranges","Strawberries"],
'column2':[1,2,3,4,5],
'column3':["Kiwis","Mangos","Pineapples","Grapes","Melons"]
}
df = pd.DataFrame(fruits)
column1 column2 column3
0 Apples 1 Kiwis
1 Pears 2 Mangos
2 Bananas 3 Pineapples
3 Oranges 4 Grapes
4 Strawberries 5 Melons
Now I want to get the position index of Mangos, without having the knowledge in which column or row it exists. So far I succeeded in getting the row index:
print(df.loc[df.isin(["Mangos"]).any(axis=1)].index)
Which results in:
Int64Index([1], dtype='int64')
But now I would also like to retrieve the column index or column name.
This thread is a very simplified version of Get column name where value is something in pandas dataframe, but I could not figure out the code using the other thread.
You can simply do:
df.columns[df.isin(['Mangos']).any()])
Index(['column3'], dtype='object')
Or to just get the column name:
df.columns[df.isin(['Mangos']).any()][0]
# column3
To get the index of the column, try:
df.columns.get_indexer(df.columns[df.isin(['Mangos']).any()])
# [2]
Stack the dataframe to reshape into multiindex series, then use boolean indexing to get the index
s = df.stack()
s[s == 'Mangos'].index
MultiIndex([(1, 'column3')])
You can use np.where as well.
Code:
import numpy as np
[(df.index[i], df.columns[c]) for i, c in zip(*np.where(df.isin(['Mangos'])))]
Output:
[(1, 'column3')]
You can try a simple search yourself.
coordinates = []
indices = df.index
columns = df.columns
for index in indices: # df is the DataFrame
for col in columns:
if df[col][index] == 'Mangos':
coordinates.append((index, col))
coordinates
Output:
[(1, 'column3')]

Pandas find the first occurrence of a specific value in a row within multiple columns and return column index

For a dataframe:
df = pd.DataFrame({"A":[0,0],"B":[0,1],"C":[1,2],"D":[2,2]})
How to obtain the column name or column index when the value is 2 or a certain value
and put it in a new column at df, say df["TAG"]
df = pd.DataFrame({"A":[0,0],"B":[0,1],"C":[1,2],"D":[2,2],"TAG":[D,C]})
i tried
df["TAG"]=np.where(df[cols]>=2,df.columns,'')
where [cols] is the list of df columns
So far i can only find how to find row index when matching a value in Pandas
In excel we can do some approach using MATCH(TRUE,INDEX($A:$D>=2,0),) and apply to multiple rows
Any help or hints are appreciated
Thank you so much in advance
Try idxmax:
>>> df['TAG'] = df.ge(2).T.idxmax()
>>> df
A B C D TAG
0 0 0 1 2 D
1 0 1 2 2 C
>>>

Efficient way to drop a row from Dataframe A if an element equals an element in Dataframe B

I have a column in Dataframe B that contains elements I wish to drop from Dataframe A, should A contain them. I wish to drop the entire row from A.
I'm not new to programming but I am learning the extensive pandas library. From what I've seen, this can't be in any way efficient or proper.
for i in range(0,106):
for j in range(0,171):
if dfB.iloc[i,2] == dfA.iloc[j,0]:
dfA.drop(j, inplace=True)
IIUC:
dfA = dfA.loc[~dfA["ColumnNameInA"].isin(dfB["ColumnNameInB"])]
You would need to substitute the appropriate column names.
In this case, dfA["ColumnNameInA"].isin(dfB["ColumnNameInB"]) creates a series that is True whenever the value in dfA's column is in dfB's column. We pass that to .loc, and reassign to dfA.
This should also work:
df = df[df['A'] == df2['B']]
Assumption: df and df2 are the same lengths, and you are comparing row x from df to row x from df2.
Example Dataset:
df = pd.DataFrame({'A': [1,2,3]})
df2 = pd.DataFrame({'B': [1,4,3]})
Output:
df
A
0 1
2 3

Categories