I have this data, save it in dataframe
data = {'Eventname': ['100m','200m','Discus','100m','200m','Discus'],
'Year': [2030,2030,2031,2030,2031,2032],
'FirstPlace': ['John Smith', 'Shar jean', 'Abi whi', 'mik jon','joh doe', 'John Smith'],
'SecPlace': ['joh doe', 'John Smith', 'Shar jean', 'Hen Hun','Tom Will', 'Gord Jay'],
'thiPlace': ['mik jon', 'Lisa tru', 'John Smith', 'Bret Tun','Tim Smith', 'Jack Mann'] }
df = pd.DataFrame(data)
I want to create a new dataframe, the 1st coulmn should contain the name of all people occurd in [first place], [secondplace], [thirdplace] columns without duplications. And then count how many times each name happend in each column
I wrote this code,
NewArr=pd.DataFrame()
NewArr['first']=df['FirstPlace'].value_counts()
NewArr['second']=df['SecPlace'].value_counts()
NewArr['third']=df['thiPlace'].value_counts()
The code has these problems:
it only shows me the first 5 names from the [firstplace]
I want the value (0) not a Nan in the result
I want to add the coulmn title "AthleteName" above the names
A possible solution:
df1 = df.melt(value_vars=['FirstPlace', 'SecPlace', 'thiPlace'])
pd.crosstab(df1.value, df1.variable).reset_index(
names='Names').rename_axis(None, axis=1)
Output:
Names FirstPlace SecPlace thiPlace
0 Abi whi 1 0 0
1 Bret Tun 0 0 1
2 Gord Jay 0 1 0
3 Hen Hun 0 1 0
4 Jack Mann 0 0 1
5 John Smith 2 1 1
6 Lisa tru 0 0 1
7 Shar jean 1 1 0
8 Tim Smith 0 0 1
9 Tom Will 0 1 0
10 joh doe 1 1 0
11 mik jon 1 0 1
Related
I Have a data frame with the column name and I need to create the column seq, which allows me identify the different times that a name appears in the data frame, it's important to preserve the order.
import pandas as pd
data = {'name': ['Tom', 'Joseph','Joseph','Joseph', 'Tom', 'Tom', 'John','Tom','Tom','John','Joseph']
, 'seq': ['Tom 0', 'Joseph 0','Joseph 0','Joseph 0', 'Tom 1', 'Tom 1', 'John 0','Tom 2','Tom 2','John 1','Joseph 1']}
df = pd.DataFrame(data)
print(df)
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
Create a boolean mask to know if the name has changed from the previous row. Then filter out the second, third, ... names of a sequence before grouping by name. cumcount increment the sequence number and finally concatenate name and sequence number.
# Boolean mask
m = df['name'].ne(df['name'].shift())
# Create sequence number
seq = df.loc[m].groupby('name').cumcount().astype(str) \
.reindex(df.index, fill_value=pd.NA).ffill()
# Concatenate name and seq
df['seq'] = df['name'] + ' ' + seq
Output:
>>> df
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>>> m
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 False
9 True
10 True
Name: name, dtype: bool
You need check for the existence of a new name and then create a new index for each name using groupby and cumsum, the resulting string Series can be concatenated with str.cat
df['seq'] = df['name'].str.cat(
df['name'].ne(df['name'].shift()).groupby(df['name']).cumsum().sub(1).astype(str),
sep=' '
)
Assuming your data frame is indexes sequentiallly (0, 1, 2, 3, ...):
Group the data frame by name
For each group, apply a gap-and-island algorithm: every time the index jumps by more than 1, create a new island
def sequencer(group):
idx = group.index.to_series()
# Every time the index has a gap >1, create a new island
return idx.diff().ne(1).cumsum().sub(1)
seq = df.groupby('name').apply(sequencer).droplevel(0).rename('seq')
df.merge(seq, left_index=True, right_index=True)
I have the following dataframe df:
names
status
John
Completed
James
To Do
Jill
To Do
Robert
In Progress
Jill
To Do
Jill
To Do
Marina
Completed
Evy
Completed
Evy
Completed
Now I want the count of each type of status for each user. I can get it like this for all types of statuses.
df = pd.crosstab(df.names,df.status).reset_index("names")
So now the resulting df is
status
names
Completed
In Progress
To Do
0
James
0
0
1
1
Robert
0
1
0
2
John
1
0
0
3
Marina
1
0
0
4
Jill
0
0
3
5
Evy
2
0
0
So my problem is how can I specify only a particular type of status value to be counted? For eg: I want only the values of In Progress and Completed and not To Do. And how can I add a extra column to the above called as Total Statuses, that will actually be the total number of rows for each name in the original dataframe?
Desired Dataframe:
status
names
Completed
In Progress
Total
0
James
0
0
1
1
Robert
0
1
1
2
John
1
0
1
3
Marina
1
0
1
4
Jill
0
0
3
5
Evy
2
0
2
Another way:
pass margins and margins_name parameters in pd.crosstab():
df=(pd.crosstab(df.names,df.status,margins=True,margins_name='Total').iloc[:-1]
.reset_index().drop('To Do',1))
OR
via crosstab()+assign()
df=(pd.crosstab(df.names,df.status).assign(Total=lambda x:x.sum(1))
.reset_index().drop('To Do',1))
OR
In 2 steps:
df=pd.crosstab(df.names,df.status)
df=df.assign(Total=df.sum(1)).drop('To Do',1).reset_index()
You can create the total from the addition of the three previous columns:
df['Total'] = (df['Completed'] + df['In Progress'] + df['To Do'])
Then you can drop the 'to-do' from your new data frame as follows :
df = df.drop(columns=['To Do'])
df = pd.DataFrame({'names': ['John', 'James', 'Jill', 'Robert', 'Jill', 'Jill', 'Marina', 'Evy', 'Evy'],
'status':['Completed', 'To Do', 'To Do', 'In Progress', 'To Do', 'To Do', 'Completed', 'Completed', 'Completed']})
df = pd.crosstab(df.names,df.status).reset_index("names")
df['Total'] = df['Completed'] + df['In Progress'] + df['To Do']
df = df.drop(columns=['To Do'])
print(df)
Output:
status names Completed In Progress Total
0 Evy 2 0 2
1 James 0 0 1
2 Jill 0 0 3
3 John 1 0 1
4 Marina 1 0 1
5 Robert 0 1 1
I can't comprehend what kind of sorting system you are using. But I think you will manage to do that yourself.
I have a data frame (df) in Python with 4 columns (ID, Status, Person, Output). Each ID is repeated 6 times and the Output is the same for each ID. For each ID, the Status will be On/Off (3 of each).
I need to generate a new column with a list of people for each unique ID/Status combination. I also need a second new column with a group ID for each unique list of people.
This is my current code which works but is very slow when working with a large data frame due to the apply(list) function. Is there a more efficient way to do this?
people = df.groupby(['ID','Status'])['Person'].apply(list).reset_index(name='Names_ID')
people['Group_ID'] = people['Names_ID'].rank(method='dense')
df = df.drop_duplicates(subset=['ID','Status'])
df = df.merge(people, on = ('ID', 'Status'))
Here is an example input data frame:
df=
ID Status Person Output
0 On John 1
0 On Mark 1
0 On Michael 1
0 Off Peter 1
0 Off Tim 1
0 Off Jake 1
1 On Peter 0.5
1 On Dennis 0.5
1 On Jasper 0.5
1 Off John 0.5
1 Off Mark 0.5
1 Off Michael 0.5
2 On John 2
2 On Mark 2
2 On Larry 2
2 Off Peter 2
2 Off Dennis 2
2 Off Jasper 2
The desired output is:
df =
ID Status People Group_ID Output
0 On [John, Mark, Michael ] 0 1
0 Off [Peter, Tim, Jake ] 1 1
1 On [Peter, Dennis, Jasper ] 2 0.5
1 Off [John, Mark, Michael ] 0 0.5
2 On [John, Mark, Larry ] 3 2
2 Off [Peter, Dennis, Jasper ] 2 2
Try this:
df_out = df.groupby(['ID', 'Status'])['Person'].apply(list).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'].apply(tuple))[0]
df_out
Output:
ID Status Person Group_ID
0 0 Off [Peter, Tim, Jake] 0
1 0 On [John, Mark, Michael] 1
2 1 Off [John, Mark, Michael] 1
3 1 On [Peter, Dennis, Jasper] 2
4 2 Off [Peter, Dennis, Jasper] 2
5 2 On [John, Mark, Larry] 3
OR
df_out = df.groupby(['ID', 'Status'])['Person'].apply(', '.join).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'])[0]
df_out
import pandas as pd
df = pd.read_clipboard()
df
One method is to use shift twice and join the three columns into a list. Then use groupby to figure out the Group_ID and merge it back into the dataframe.
df['Person1'] = df['Person'].shift(-1)
df['Person2'] = df['Person'].shift(-2)
df['People'] = '[' + df['Person'] + ',' + df['Person1'] + ',' + df['Person2'] + ']'
mult_3 = []
for i in df.index:
if i==0 or i%3 == 0:
mult_3.append(i)
df = df.loc[df.index.isin(mult_3)].drop(['Person', 'Person1', 'Person2'], axis=1)
df_people = df.groupby('People').Status.count().reset_index().drop(['Status'], axis=1).reset_index()
df = df.merge(df_people, how='left', on='People').rename(columns={'index':'Group_ID'})
df = df[['ID', 'Status', 'People', 'Group_ID', 'Output']]
df
Python 3.7.6 and Pandas 1.0.3: The bottleneck here are probably the apply calls.
people = df.groupby(['ID','Status', "Output"])['Person'].apply(list).reset_index(name = 'People')
people['Group_ID'] = people["People"].apply(str).astype('category').cat.codes
Output:
ID Status Output People Group_ID
0 0 Off 1 [Peter, Tim, Jake] 3
1 0 On 1 [John, Mark, Michael] 1
2 1 Off 0.5 [John, Mark, Michael] 1
3 1 On 0.5 [Peter, Dennis, Jasper] 2
4 2 Off 2 [Peter, Dennis, Jasper] 2
5 2 On 2 [John, Mark, Larry] 0
I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.
Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32
This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))
By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)
Hello I have the following dataframe
df =
A B
John Tom
Homer Bart
Tom Maggie
Lisa John
I would like to assign to each name a unique ID and returns
df =
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
What I have done is the following:
LL1 = pd.concat([df.a,df.b],ignore_index=True)
LL1 = pd.DataFrame(LL1)
LL1.columns=['a']
nameun = pd.unique(LL1.a.ravel())
LLout['c'] = 0
LLout['d'] = 0
NN = list(nameun)
for i in range(1,len(LLout)):
LLout.c[i] = NN.index(LLout.a[i])
LLout.d[i] = NN.index(LLout.b[i])
But since I have a very large dataset this process is very slow.
Here's one way. First get the array of unique names:
In [11]: df.values.ravel()
Out[11]: array(['John', 'Tom', 'Homer', 'Bart', 'Tom', 'Maggie', 'Lisa', 'John'], dtype=object)
In [12]: pd.unique(df.values.ravel())
Out[12]: array(['John', 'Tom', 'Homer', 'Bart', 'Maggie', 'Lisa'], dtype=object)
and make this a Series, mapping names to their respective numbers:
In [13]: names = pd.unique(df.values.ravel())
In [14]: names = pd.Series(np.arange(len(names)), names)
In [15]: names
Out[15]:
John 0
Tom 1
Homer 2
Bart 3
Maggie 4
Lisa 5
dtype: int64
Now use applymap and names.get to lookup these numbers:
In [16]: df.applymap(names.get)
Out[16]:
A B
0 0 1
1 2 3
2 1 4
3 5 0
and assign it to the correct columns:
In [17]: df[["C", "D"]] = df.applymap(names.get)
In [18]: df
Out[18]:
A B C D
0 John Tom 0 1
1 Homer Bart 2 3
2 Tom Maggie 1 4
3 Lisa John 5 0
Note: This assumes that all the values are names to begin with, you may want to restrict this to some columns only:
df[['A', 'B']].values.ravel()
...
df[['A', 'B']].applymap(names.get)
(Note: I'm assuming you don't care about the precise details of the mapping -- which number John becomes, for example -- but only that there is one.)
Method #1: you could use a Categorical object as an intermediary:
>>> ranked = pd.Categorical(df.stack()).codes.reshape(df.shape)
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2
It feels like you should be able to treat a Categorical as providing an encoding dictionary somehow (whether directly or by generating a Series) but I can't see a convenient way to do it.
Method #2: you could use rank("dense"), which generates an increasing number for each value in order:
>>> ranked = df.stack().rank("dense").reshape(df.shape).astype(int)-1
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2