I have a dataframe with multiple rows, which I'd like to aggregate down, per-column, to a 1-row dataframe, using a different function per-column.
Take the following dataframe, as an example:
df = pd.DataFrame([[1,2], [2,3]], columns=['A', 'B'])
print(df)
Result:
A B
0 1 2
1 2 3
I'd like to aggregate the first column using sum and the second using mean. There is a convenient DataFrame.agg() method which can take a map of column names to aggregation function, like so:
aggfns = {
'A': 'sum',
'B': 'mean'
}
print(df.agg(aggfns))
However, this results in a Series rather than a DataFrame:
A 3.0
B 2.5
dtype: float64
Among other problems, a series has a single dtype so loses the per-column datatype. A series is well-suited to represent a single dataframe column, but not a single dataframe row.
I managed to come up with this tortured incantation:
df['dummy'] = 0
dfa = df.groupby('dummy').agg(aggfns).reset_index(drop=True)
print(dfa)
This creates a dummy column which is 0 everywhere, groups on it, does the aggregation and drops it, which produces the desired result:
A B
0 3 2.5
Certainly there is something better?
Using Series.to_frame + DataFrame.T (short for transpose):
dfa = df.agg(aggfns).to_frame().T
Output:
>>> dfa
A B
0 3.0 2.5
You could group by an empty Series instead of creating a new column:
dfa = df.assign(d=0).groupby('d').agg(aggfns).reset_index(drop=True)
Output:
>>> dfa
A B
0 3 2.5
You can explicitly create a new DataFrame()
>>> pd.DataFrame({'A': [df.A.sum()], 'B': [df.B.mean()]}
A B
0 3 2.5
Related
I am trying to multiply dataframe 1 column a by dataframe 2 column b.
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'])
pnlValue column has many numbers and fx_rate column is just the one number.
The code executes but my end result ends up with tons of NaN .
Any help would be appreciated.
It is probably due to the index of your dataframe. You need to use df_fxrate['fx_rate'].values:
combineQueryandBookFiltered['pnlValue'] = np.multiply(combineQueryandBookFiltered['pnlValue'], df_fxrate['fx_rate'].values)
or better:
combineQueryandBookFiltered['pnlValue']=combineQueryandBookFiltered['pnlValue']*df_fxrate['fx_rate'].values
I show you an example:
df1=pd.DataFrame(index=[1, 2])
df2=pd.DataFrame(index=[0])
df1['col1']=[1,1]
print(df1)
col1
1 1
2 1
df2['col1']=[1]
print(df2)
col1
0 1
print(np.multiply(df1['col1'],df2['col1']))
0 NaN
1 NaN
2 NaN
as you can see the multiplication is done according to the index
So you need something like this:
np.multiply(df1['col1'],df2['col1'].values)
or
df1['col1']*df2['col1'].values
Output:
1 1
2 1
Name: 1, dtype: int64
as you can see now only the df1['col1'] series index is used
-- Hi excelguy,
Is there a reason why you can't use the simple column multiplication?
df['C'] = df['A'] * df['B']
As was pointed out, multiplications of two series are based on their indices and it's likely that your fx_rate series does not have the same indices as the pnlValue series.
But since your fx_rate is only one value, I suggest multiplying your dataframe with a scalar instead:
fx_rate = df_fxrate['fx_rate'].iloc[0]
combineQueryandBookFiltered['pnlValue'] = combineQueryandBookFiltered['pnlValue'] * fx_rate
I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2
Hi I want to get the counts of unique values of the dataframe. count_values implements this however I want to use its output somewhere else. How can I convert .count_values output to a pandas dataframe. here is an example code:
import pandas as pd
df = pd.DataFrame({'a':[1, 1, 2, 2, 2]})
value_counts = df['a'].value_counts(dropna=True, sort=True)
print(value_counts)
print(type(value_counts))
output is:
2 3
1 2
Name: a, dtype: int64
<class 'pandas.core.series.Series'>
What I need is a dataframe like this:
unique_values counts
2 3
1 2
Thank you.
Use rename_axis for name of column from index and reset_index:
df = df.value_counts().rename_axis('unique_values').reset_index(name='counts')
print (df)
unique_values counts
0 2 3
1 1 2
Or if need one column DataFrame use Series.to_frame:
df = df.value_counts().rename_axis('unique_values').to_frame('counts')
print (df)
counts
unique_values
2 3
1 2
I just run into the same problem, so I provide my thoughts here.
Warning
When you deal with the data structure of Pandas, you have to aware of the return type.
Another solution here
Like #jezrael mentioned before, Pandas do provide API pd.Series.to_frame.
Step 1
You can also wrap the pd.Series to pd.DataFrame by just doing
df_val_counts = pd.DataFrame(value_counts) # wrap pd.Series to pd.DataFrame
Then, you have a pd.DataFrame with column name 'a', and your first column become the index
Input: print(df_value_counts.index.values)
Output: [2 1]
Input: print(df_value_counts.columns)
Output: Index(['a'], dtype='object')
Step 2
What now?
If you want to add new column names here, as a pd.DataFrame, you can simply reset the index by the API of reset_index().
And then, change the column name by a list by API df.coloumns
df_value_counts = df_value_counts.reset_index()
df_value_counts.columns = ['unique_values', 'counts']
Then, you got what you need
Output:
unique_values counts
0 2 3
1 1 2
Full Answer here
import pandas as pd
df = pd.DataFrame({'a':[1, 1, 2, 2, 2]})
value_counts = df['a'].value_counts(dropna=True, sort=True)
# solution here
df_val_counts = pd.DataFrame(value_counts)
df_value_counts_reset = df_val_counts.reset_index()
df_value_counts_reset.columns = ['unique_values', 'counts'] # change column names
I'll throw in my hat as well, essentially the same as #wy-hsu solution, but in function format:
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
pd.DataFrame(
df.groupby(['groupby_col'])['column_to_perform_value_count'].value_counts()
).rename(
columns={'old_column_name': 'new_column_name'}
).reset_index()
Example of selecting a subset of columns from a dataframe, grouping, applying value_count per group, name value_count column as Count, and displaying first n groups.
# Select 5 columns (A..E) from a dataframe (data_df).
# Sort on A,B. groupby B. Display first 3 groups.
df = data_df[['A','B','C','D','E']].sort_values(['A','B'])
g = df.groupby(['B'])
for n,(k,gg) in enumerate(list(g)[:3]): # display first 3 groups
display(k,gg.value_counts().to_frame('Count').reset_index())
I have a pandas dataframe as below. How can I drop any column which is a subset of any of the remaining columns? I would like to do this without using fillna.
df = pd.DataFrame([ [1,1,3,3], [np.NaN,2,np.NaN,4]], columns=['A','B','C','D'] )
df
A B C D
0 1.0 1 3.0 3
1 NaN 2 NaN 4
I can identify here that column A is subset of B and column C is a subset of D with something like this:
if all(df[A][df[A].notnull()].isin(df[B]))
I could run a loop over all columns and drop the subset columns. But is there a more efficient way to accomplish this, so that I have the following result:
df
B D
0 1 3
1 2 4
Thanks.
It still requires iteration, but you can use this list comprehension (with an if statement similar to the one you provided) to get columns to keep:
keep_cols = [x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))]
# ['B', 'D']
And then use the result with filter:
df.filter(items=keep_cols)
# B D
# 0 1 3
# 1 2 4
This should be fast enough, since it still uses apply at its core, and seems to be safer/more efficient than dropping columns within a loop.
If you're keen on a one-line solution, of course assigning the list to a variable is an optional step:
df.filter(items=[x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))])
Suppose I have a Pandas DataFrame called df with columns a and b and what I want is the number of distinct values of b per each a. I would do:
distcounts = df.groupby('a')['b'].nunique()
which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:
SELECT a, COUNT(DISTINCT(b)) FROM df
and haven't been able to emulate this query in Pandas exactly. How to?
I think you need reset_index:
distcounts = df.groupby('a')['b'].nunique().reset_index()
Sample:
df = pd.DataFrame({'a':[7,8,8],
'b':[4,5,6]})
print (df)
a b
0 7 4
1 8 5
2 8 6
distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
a b
0 7 1
1 8 2
Another alternative using Groupby.agg instead:
df.groupby('a', as_index=False).agg({'b': 'nunique'})