I know that we can get normalized values from value_counts() of a pandas series but when we do a group by on a dataframe, the only way to get counts is through size(). Is there any way to get normalized values with size()?
Example:
df = pd.DataFrame({'subset_product':['A','A','A','B','B','C','C'],
'subset_close':[1,1,0,1,1,1,0]})
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
df.subset_product.value_counts()
A 3
B 2
C 2
df2
Looking to get:
subset_product subset_close prod_count norm
A 0 1 1/3
A 1 2 2/3
B 1 2 2/2
C 1 1 1/2
C 0 1 1/2
subset_product
Besides manually calculating the normalized values as prod_count/total, is there any way to get normalized values?
I think it is not possible only one groupby + size because groupby by 2 columns subset_product and subset_close and need size by subset_product only for normalize.
Possible solutions are map or transform for Series with same size as df2 with div:
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
s = df.subset_product.value_counts()
df2['prod_count'] = df2['prod_count'].div(df2['subset_product'].map(s))
Or:
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
a = df2.groupby('subset_product')['prod_count'].transform('sum')
df2['prod_count'] = df2['prod_count'].div(a)
print (df2)
subset_product subset_close prod_count
0 A 0 0.333333
1 A 1 0.666667
2 B 1 1.000000
3 C 0 0.500000
4 C 1 0.500000
Related
Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1
I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.
I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0
I have a dataframe composed by the following table:
A B C D
A1 5 3 4
A1 8 1 0
A2 1 1 0
A2 1 9 1
A2 1 3 1
A3 0 4 7
...
I need to group the data according to the 'A' label, then check whether the sum of the 'B' column for each label is larger than 10. If it is larger than 10 then perform an operation that involves subtracting 'C' and 'D'. Finally, I need to drop all rows that identify those 'A' labels for which the condition on the sum is not larger than 10. I am trying to use the groupby method, but I am not sure this is the right way to go. So far I have grouped everything with df.groupby('A')['B'].sum() and get a list of sums per grouped label in order to check the aforementioned condition on the 10 elements. But then how to apply the subtraction between columns C and D and also drop the irrelevant rows?
Use GroupBy.transform with sum for new Series filled by aggregate values and filter rows greater like 10 in boolean indexing with Series.gt and then subtract columns:
df = df[df.groupby('A')['B'].transform('sum').gt(10)].copy()
df['E'] = df['C'].sub(df['D'])
print (df)
A B C D E
0 A1 5 3 4 -1
1 A1 8 1 0 1
Similar idea if need sum column:
df['sum'] = df.groupby('A')['B'].transform('sum')
df['E'] = df['C'].sub(df['D'])
df = df[df['sum'].gt(10)].copy()
print (df)
A B C D sum E
0 A1 5 3 4 13 -1
1 A1 8 1 0 13 1
Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
4 B 8 Z10
I want to obtain ratio of A/B for each UniqueID and put it in a new dataframe. For example, for UniqueID 1, its ratio of A/B = 5/6.
What is the most efficient way to do this in Python?
Want:
UniqueID RatioAB
1 5/6
2 10/11
3 Inf
4 0
Thank you.
One approach is using pivot_table, aggregating with the sum in the case there are multiple occurrences of the same letters (otherwise a simple pivot will do), and evaluating on columns A and B:
df.pivot_table(index='UniqueID', columns='Code', values='Value', aggfunc='sum').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If there is maximum one occurrence of each letter per group:
df.pivot(index='UniqueID', columns='Code', values='Value').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If you only care about A/B ratio:
df1 = df[df['Code'].isin(['A','B'])][['UniqueID', 'Code', 'Value']]
df1 = df1.pivot(index='UniqueID',
columns='Code',
values='Value')
df1['RatioAB'] = df1['A']/df1['B']
The most apparent way is via groupby.
df.groupby('UniqueID').apply(lambda g: g.query("Code == 'A'")['Value'].iloc[0] / g.query("Code == 'B'")['Value'].iloc[0])
I have two DataFrames, one looks something like this:
df1:
x y Counts
a b 1
a c 3
b c 2
c d 1
The other one has both as index and as columns the list of unique values in the first two columns:
df2
a b c d
a
b
c
d
What I wouldl like to do is to fill in the second DataFrame with values from the first one, given the intersection of column and index is the same line from the first DataFrame, e.g.:
a b c d
a 0 1 3 0
b 1 0 2 0
c 3 2 0 1
d 0 0 1 0
While I try to use two for loops with a double if-condition, it makes the computer block (given that a real DataFrame contains more than 1000 rows).
The piece of code I am trying to implement (and which makes calculations apparently too 'heavy' for a computer to perform):
for i in df2.index:
for j in df2.columns:
if (i==df1.x.any() and j==df1.y.any()):
df2.loc[i,j]=df1.Counts
Important to notice, the list of unique values (i.e., index and columns in the second DataFrame) is longer than the number of rows in the first columns, in my example they coincided.
If it is of any relevance, the first dataframe represents basically combinations of words in the first and in the second column and their occurences in the text. Occurrences are basically the weights of edges.
So, I am trying to create a matrix so as to plot a graph via igraph. I chose to first create a DataFrame, then its values taken as an array pass to igraph.
As far as I could understand, python-igraph cannot use dataframe to plot a graph, a numpy array only.
Tried some of the soulutions suggested for the similar issues, nothing worked out so far.
Any suggestions to improve my question are warmly welcomed (it's my first question here).
You can do something like this:
import pandas as pd
#df = pd.read_clipboard()
#df2 = df.copy()
df3=df2.pivot(index='x',columns='y',values='Counts')
print df3
print
new=sorted((set(df3.columns.tolist()+df3.index.tolist())))
df3 = df3.reindex(new,columns=new).fillna(0).applymap(int)
print df3
output:
y b c d
x
a 1.0 3.0 NaN
b NaN 2.0 NaN
c NaN NaN 1.0
y a b c d
x
a 0 1 3 0
b 0 0 2 0
c 0 0 0 1
d 0 0 0 0
stack df2 and fillna with df1
idx = pd.Index(np.unique(df1[['x', 'y']]))
df2 = pd.DataFrame(index=idx, columns=idx)
df2.stack(dropna=False).fillna(df1.set_index(['x', 'y']).Counts) \
.unstack().fillna(0).astype(int)
a b c d
a 0 1 3 0
b 0 0 2 0
c 0 0 0 1
d 0 0 0 0