I have the following df:
Doc Item
1 1
1 1
1 2
1 3
2 1
2 2
I want to add third column with repeating values that (1) increment by one if there is a change in column "Item" and that also (2) restarts if there is a change in column "Doc"
Doc Item NewCol
1 1 1
1 1 1
1 2 2
1 3 3
2 1 1
2 2 2
What is the best way to achieve this?
Thanks a lot.
Use GroupBy.transform wth custom lambda function with factorize:
df['NewCol'] = df.groupby('Doc')['Item'].transform(lambda x: pd.factorize(x)[0]) + 1
print (df)
Doc Item NewCol
0 1 1 1
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 1
5 2 2 2
If values in Item are integers is possible use GroupBy.rank:
df['NewCol'] = df.groupby('Doc')['Item'].rank(method='dense').astype(int)
Related
I have two different dataframe in pandas.
First
A
B
C
D
VALUE
1
2
3
5
0
1
5
3
2
0
2
5
3
2
0
Second
A
B
C
D
Value
5
3
3
2
1
1
5
4
3
1
I want column values A and B in the first dataframe to be searched in the second dataframe. If A and B values match then update the Value column.Search only 2 columns in other dataframe and update only 1 column. Actually the process we know in sql.
Result
A
B
C
D
VALUE
1
2
3
5
0
1
5
3
2
1
2
5
3
2
0
If you focus on the bold text, you can understand it more easily.Despite my attempts, I could not succeed. I only want 1 column to change but it also changes A and B. I only want the Value column of matches to change.
You can use a merge:
cols = ['A', 'B']
df1['VALUE'] = (df2.merge(df1[cols], on=cols, how='right')
['Value'].fillna(df1['VALUE'], downcast='infer')
)
output:
A B C D VALUE
0 1 2 3 5 0
1 1 5 3 2 1
2 2 5 3 2 0
I have a dataframe like this, where the codes column is currently strings.
Station
Codes
1
1,2
1
1
2
1
2
2,5
2
2,3
3
1
I want to see the count of each code ordered by station. I have tried to use the explode function but the default behavior is to overwrite all strings with only one number as NaN.
Station
Codes
Count
1
1
2
1
2
1
2
1
1
2
2
2
2
3
1
2
5
1
3
1
1
print(
df.assign(Codes=df.Codes.str.split(","))
.explode("Codes")
.groupby(["Station", "Codes"], as_index=False)
.size()
.rename(columns={"size": "Count"})
)
Prints:
Station Codes Count
0 1 1 2
1 1 2 1
2 2 1 1
3 2 2 2
4 2 3 1
5 2 5 1
6 3 1 1
df['Codes'] = df['Codes'].str.split(',')
df.explode('Codes').groupby('Station')['Codes'].value_counts().reset_index(name='Count')
I`m trying to calculate count of some values in data frame like
user_id event_type
1 a
1 a
1 b
2 a
2 b
2 c
and I want to get table like
user_id event_type event_type_count
1 a 2
1 a 2
1 b 1
2 a 1
2 b 1
2 c 2
2 c 2
In other words, I want to insert count of value instead value in data frame.
I've tried use df.join(pd.crosstab)..., but I get a large data frame with many columns.
Which way is better to solve this problem ?
Use GroupBy.transform by both columns with GroupBy.size:
df['event_type_count'] = df.groupby(['user_id','event_type'])['event_type'].transform('size')
print (df)
user_id event_type event_type_count
0 1 a 2
1 1 a 2
2 1 b 1
3 2 a 1
4 2 b 1
5 2 c 2
6 2 c 2
I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == #row.A & B == #row.B & C == #row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!
Step1 : Assign a new column with values 0
df['new'] = 0
Step2 : Create a mask with repetition more than 1 i.e
mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)
Step3 : Assign the values factorizing based on mask i.e
df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1
# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1
Output:
A B C new
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.
df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()
gives the following output
A B C unique_id
0 2 1 2 3
1 2 2 1 4
2 1 2 1 1
3 1 2 2 2
4 1 1 1 0
5 1 2 1 1
6 1 1 1 0
7 2 2 2 5
8 1 2 2 2
9 1 2 2 2
The groups are given ids based on the order they would be iterated over.
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups
I have the following dataset:
id window Rank member
1 2 2 0
1 3 2 0
2 3 1 0
2 2 1 0
I want to make member to be equal to Rank when window==3. To do that, I have the following command:
df["member"]= df[df['window']==3]['Rank'][0]
However, I want to do that in a groupby statement grouping on id. The command below returns an error. It is probably a simple thing that I am missing here, but I cannot get around it how to use groupby in the above command.Any help is greatly appreciated.
df["member"]= df.groupby("id")[df[df['window']==3]['Rank'][0]]
You can achieve this by using pandas.DataFrame.where -
df = pd.DataFrame({'id':[1,1,2,2],'window':[2,3,3,2],'Rank':[2,2,1,1],'member':[0,0,0,0]})
=>
Rank id member window
0 2 1 0 2
1 2 1 0 3
2 1 2 0 3
3 1 2 0 2
df['member'] = df['Rank'].where(df['window']==3, df['member'])
print(df)
=>
Rank id member window
0 2 1 0 2
1 2 1 2 3
2 1 2 1 3
3 1 2 0 2
You can use numpy.where or DataFrame.loc:
df['member'] = np.where(df['window']==3, df['Rank'], df['member'])
print (df)
id window Rank member
0 1 2 2 0
1 1 3 2 2
2 2 3 1 1
3 2 2 1 0
df.loc[df['window']==3, 'member'] = df['Rank']
print (df)
id window Rank member
0 1 2 2 0
1 1 3 2 2
2 2 3 1 1
3 2 2 1 0