Assign the frequency of each value to dataframe with new column - python

I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe

df['Frequency'] = df.groupby('Category').transform('count')

Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values ​​is the count. So you can use map or pandas.Series.replace to create a series with the category values ​​replaced by those in the count. And finally assign this series to the frequency column

you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)

Related

Maximum value in a given group, increasing row by row

The goal is to put the current highest digit in the new column increasing row by row in a given group of letters. The expected, correct value, as a result formula, was entered by me manually in the column "col_ok". The only thing I have achieved so far is assigning the highest value to a given group and this result is in the fourth column called "cumulatively". However, for the first row in the "A" group it is not true, because the correct value according to the assumptions described is: "1" Similarly, the values ​​in the second and third rows. Only the value in the fourth row is true, but the value in the fifth row is not. Forgive me the inconsistency of my post, I'm not an IT specialist and I don't know English. Thanks in advance for your support.
df = pd.read_csv('C:/Users/.../a.csv',names=['group_letter', 'digit', 'col_ok'] ,
index_col=0,)
df = df.assign(cumulatively = df.groupby('group_letter')['col_ok'].transform('max'))
print(df)
group_letter digit col_ok cumulatively
A 1 1 5
A 3 3 5
A 2 2 5
A 5 5 5
A 1 5 5
B 1 1 3
B 2 2 3
B 1 2 3
B 1 2 3
B 3 3 3
C 5 5 6
C 6 6 6
C 1 6 6
C 2 6 6
C 3 6 6
D 4 4 7
D 3 4 7
D 2 4 7
D 5 5 7
D 7 7 7
IIUC use:
df = df.assign(cumulatively = df.groupby('group_letter')['col_ok'].cummax())

How to repeat the cumsum for previous values in a Pandas Series, when the count group is restarted?

I have a Pandas Series that represents a group count.
How to create a new series with the maximum values from the series up to alter the count group?
Minimal example:
import pandas as pd
s_count = pd.Series([1,2,3,1,2,3,4,5,1,2,3,4])
Desired:
s_max_count_group = pd.Series([3,3,3,5,5,5,5,5,4,4,4,4])
Print result:
df = pd.DataFrame({
'counts': s_count,
'expected': s_max_count_group
})
print(df)
Display:
counts expected
0 1 3
1 2 3
2 3 3
3 1 5
4 2 5
5 3 5
6 4 5
7 5 5
8 1 4
9 2 4
10 3 4
11 4 4
I looked for similar questions, tested some answers, so i'm trying to use fill, cumsum, diff and mask methods, but no success up to now.
We can identify the individual groups by comparing the count group with 1 followed by cumsum, then group the given series on these indentified groups and transform using max
s_count.groupby(s_count.eq(1).cumsum()).transform('max')
0 3
1 3
2 3
3 5
4 5
5 5
6 5
7 5
8 4
9 4
10 4
11 4
dtype: int64

How to append a specific string according to each value in a string pandas dataframe column?

Let's take these sample dataframes :
df = pd.DataFrame({'Id':['1','2','3','4','5'], 'Value':[9,8,7,6,5]})
Id Value
0 1 9
1 2 8
2 3 7
3 4 6
4 5 5
df_name = pd.DataFrame({'Id':['1','2','4'], 'Name':['Andrew','Jason','John']})
Id Name
0 1 Andrew
1 2 Jason
2 4 John
I would like to add in the Id column of df the Name of the person (obtainable in df_name) if it exists, in brackets. I know how to do this with a for loop over the Id column of df but it is inefficient with large dataframes. Do you know please a better way do to this ?
Expected output :
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5
Use Series.map for match values, add () and replace non matche values by original column in Series.fillna:
df['Id'] = ((df['Id'] + ' (' + df['Id'].map(df_name.set_index('Id')['Name']) + ')')
.fillna(df['Id']))
print (df)
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5

how can i assign other values for coressponding value in dataframe column

I have a pandas dataframe column that contains a different numbers and each number has a different frequencies. there are 532 to unique value in column. and totally 59000 value are there.
0 135715
1 138775
2 134915
3 134335
4 134555
5 144995
6 136515
7 135185
8 145555
9 135245
...
How can i replace these values somehow corresponding to values in column that starts from 1 to 532. something like this.
0 1
1 2
2 3
3 4
4 5
5 5
6 5
7 6
8 7
9 1
10 1
11 4
...
I tried np.where() with np.arange() but it raise error.

Create a new column and assign value for each group starting using groupby

I want to create a new column as 'fold' and assign new values to it depending on group of quote_id.Let's say if 3 quote_id is same then it should assign 1 and next 4 quote_id is same then it should assign 2.
In short it should assign a number to a particular group of quote_id.
I have been trying from long time but I am not getting expected results.
i=1 def func(x): x['fold']=i return x in_df.groupby('quote_id').apply(func) i=i+1
My output should look like below.
quote_id fold
1300079-DE 1
1300079-DE 1
1300079-DE 1
1300185-DE 2
1300560-DE 3
1301011-DE 4
1301011-DE 4
1301011-DE 4
1301644-DE 5
1301907-DE 6
1301907-DE 6
1301907-DE 6
call rank with method='dense':
In [10]:
df['fold'] = df['quote_id'].rank(method='dense')
df
Out[10]:
quote_id fold
0 1300079-DE 1
1 1300079-DE 1
2 1300079-DE 1
3 1300185-DE 2
4 1300560-DE 3
5 1301011-DE 4
6 1301011-DE 4
7 1301011-DE 4
8 1301644-DE 5
9 1301907-DE 6
10 1301907-DE 6
11 1301907-DE 6

Categories