I Have a data frame with the column name and I need to create the column seq, which allows me identify the different times that a name appears in the data frame, it's important to preserve the order.
import pandas as pd
data = {'name': ['Tom', 'Joseph','Joseph','Joseph', 'Tom', 'Tom', 'John','Tom','Tom','John','Joseph']
, 'seq': ['Tom 0', 'Joseph 0','Joseph 0','Joseph 0', 'Tom 1', 'Tom 1', 'John 0','Tom 2','Tom 2','John 1','Joseph 1']}
df = pd.DataFrame(data)
print(df)
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
Create a boolean mask to know if the name has changed from the previous row. Then filter out the second, third, ... names of a sequence before grouping by name. cumcount increment the sequence number and finally concatenate name and sequence number.
# Boolean mask
m = df['name'].ne(df['name'].shift())
# Create sequence number
seq = df.loc[m].groupby('name').cumcount().astype(str) \
.reindex(df.index, fill_value=pd.NA).ffill()
# Concatenate name and seq
df['seq'] = df['name'] + ' ' + seq
Output:
>>> df
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>>> m
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 False
9 True
10 True
Name: name, dtype: bool
You need check for the existence of a new name and then create a new index for each name using groupby and cumsum, the resulting string Series can be concatenated with str.cat
df['seq'] = df['name'].str.cat(
df['name'].ne(df['name'].shift()).groupby(df['name']).cumsum().sub(1).astype(str),
sep=' '
)
Assuming your data frame is indexes sequentiallly (0, 1, 2, 3, ...):
Group the data frame by name
For each group, apply a gap-and-island algorithm: every time the index jumps by more than 1, create a new island
def sequencer(group):
idx = group.index.to_series()
# Every time the index has a gap >1, create a new island
return idx.diff().ne(1).cumsum().sub(1)
seq = df.groupby('name').apply(sequencer).droplevel(0).rename('seq')
df.merge(seq, left_index=True, right_index=True)
Related
I have this data, save it in dataframe
data = {'Eventname': ['100m','200m','Discus','100m','200m','Discus'],
'Year': [2030,2030,2031,2030,2031,2032],
'FirstPlace': ['John Smith', 'Shar jean', 'Abi whi', 'mik jon','joh doe', 'John Smith'],
'SecPlace': ['joh doe', 'John Smith', 'Shar jean', 'Hen Hun','Tom Will', 'Gord Jay'],
'thiPlace': ['mik jon', 'Lisa tru', 'John Smith', 'Bret Tun','Tim Smith', 'Jack Mann'] }
df = pd.DataFrame(data)
I want to create a new dataframe, the 1st coulmn should contain the name of all people occurd in [first place], [secondplace], [thirdplace] columns without duplications. And then count how many times each name happend in each column
I wrote this code,
NewArr=pd.DataFrame()
NewArr['first']=df['FirstPlace'].value_counts()
NewArr['second']=df['SecPlace'].value_counts()
NewArr['third']=df['thiPlace'].value_counts()
The code has these problems:
it only shows me the first 5 names from the [firstplace]
I want the value (0) not a Nan in the result
I want to add the coulmn title "AthleteName" above the names
A possible solution:
df1 = df.melt(value_vars=['FirstPlace', 'SecPlace', 'thiPlace'])
pd.crosstab(df1.value, df1.variable).reset_index(
names='Names').rename_axis(None, axis=1)
Output:
Names FirstPlace SecPlace thiPlace
0 Abi whi 1 0 0
1 Bret Tun 0 0 1
2 Gord Jay 0 1 0
3 Hen Hun 0 1 0
4 Jack Mann 0 0 1
5 John Smith 2 1 1
6 Lisa tru 0 0 1
7 Shar jean 1 1 0
8 Tim Smith 0 0 1
9 Tom Will 0 1 0
10 joh doe 1 1 0
11 mik jon 1 0 1
The Problem
I had a hard time phrasing this question but essentially I have a series of X columns that represent weights at specific points in time. Then another set of X columns that represent the names of those people that were measured.
That table looks like this (there's more than two columns, this is just a toy example):
a_weight
b_weight
a_name
b_name
10
5
John
Michael
1
2
Jake
Michelle
21
3
Alice
Bob
2
1
Ashley
Brian
What I Want
I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col]) but it is very slow.
So the final table would look something like this:
a_weight
b_weight
a_name
b_name
max_weight
max_name
10
5
John
Michael
a_weight
John
1
2
Jake
Michelle
b_weight
Michelle
21
3
Alice
Bob
a_weight
Alice
2
1
Ashley
Brian
a_weight
Ashley
What I've Tried
I've been able to create a mirror df_subset with just the weights, then use the idxmax function to make a max_weight column:
df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")
This returns a column that is the max_weight column in the section above. Now I run:
df["max_name_col"] = max_weight_col.str.replace("_weight","_name")
and I have this:
a_weight
b_weight
a_name
b_name
max_weight
max_name_col
10
5
John
Michael
a_weight
a_name
1
2
Jake
Michelle
b_weight
b_name
21
3
Alice
Bob
a_weight
a_name
2
1
Ashley
Brian
a_weight
a_name
I basically want to run a code similar to the one below without a for-loop:
df["max_name"] = [row[row["max_name_col"]] for row in df]
How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.
You can do that for sure just pass to numpy argmax
v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael 10 John
1 1 2 Jake Michelle 2 Michelle
2 21 3 Alice Bob 21 Alice
3 2 1 Ashley Brian 2 Ashley
This would do the trick assuming you only have 2 weight columns:
df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]
We could use idxmax to find the column names; then use factorize + numpy advanced indexing to get the names:
df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
[np.arange(len(df)), df['max_weight'].factorize()[0]])
Output:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael a_weight John
1 1 2 Jake Michelle b_weight Michelle
2 21 3 Alice Bob a_weight Alice
3 2 1 Ashley Brian a_weight Ashley
I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.
Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32
This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))
By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)
Given a DataFrame:
name email
0 Carl carl#yahoo.com
1 Bob bob#gmail.com
2 Alice alice#yahoo.com
3 David dave#hotmail.com
4 Eve eve#gmail.com
How can it be sorted according to the email's domain name (alphabetically, ascending), and then, within each domain group, according to the string before the "#"?
The result of sorting the above should then be:
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Use:
df = df.reset_index(drop=True)
idx = df['email'].str.split('#', expand=True).sort_values([1,0]).index
df = df.reindex(idx).reset_index(drop=True)
print (df)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Explanation:
First reset_index with drop=True for unique default indices
Then split values to new DataFrame and sort_values
Last reindex to new order
Option 1
sorted + reindex
df = df.set_index('email')
df.reindex(sorted(df.index, key=lambda x: x.split('#')[::-1])).reset_index()
email name
0 bob#gmail.com Bob
1 eve#gmail.com Eve
2 dave#hotmail.com David
3 alice#yahoo.com Alice
4 carl#yahoo.com Carl
Option 2
sorted + pd.DataFrame
As an alternative, you can ditch the reindex call from Option 1 by re-creating a new DataFrame.
pd.DataFrame(
sorted(df.values, key=lambda x: x[1].split('#')[::-1]),
columns=df.columns
)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Hello I have the following dataframe
df =
A B
John Tom
Homer Bart
Tom Maggie
Lisa John
I would like to assign to each name a unique ID and returns
df =
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
What I have done is the following:
LL1 = pd.concat([df.a,df.b],ignore_index=True)
LL1 = pd.DataFrame(LL1)
LL1.columns=['a']
nameun = pd.unique(LL1.a.ravel())
LLout['c'] = 0
LLout['d'] = 0
NN = list(nameun)
for i in range(1,len(LLout)):
LLout.c[i] = NN.index(LLout.a[i])
LLout.d[i] = NN.index(LLout.b[i])
But since I have a very large dataset this process is very slow.
Here's one way. First get the array of unique names:
In [11]: df.values.ravel()
Out[11]: array(['John', 'Tom', 'Homer', 'Bart', 'Tom', 'Maggie', 'Lisa', 'John'], dtype=object)
In [12]: pd.unique(df.values.ravel())
Out[12]: array(['John', 'Tom', 'Homer', 'Bart', 'Maggie', 'Lisa'], dtype=object)
and make this a Series, mapping names to their respective numbers:
In [13]: names = pd.unique(df.values.ravel())
In [14]: names = pd.Series(np.arange(len(names)), names)
In [15]: names
Out[15]:
John 0
Tom 1
Homer 2
Bart 3
Maggie 4
Lisa 5
dtype: int64
Now use applymap and names.get to lookup these numbers:
In [16]: df.applymap(names.get)
Out[16]:
A B
0 0 1
1 2 3
2 1 4
3 5 0
and assign it to the correct columns:
In [17]: df[["C", "D"]] = df.applymap(names.get)
In [18]: df
Out[18]:
A B C D
0 John Tom 0 1
1 Homer Bart 2 3
2 Tom Maggie 1 4
3 Lisa John 5 0
Note: This assumes that all the values are names to begin with, you may want to restrict this to some columns only:
df[['A', 'B']].values.ravel()
...
df[['A', 'B']].applymap(names.get)
(Note: I'm assuming you don't care about the precise details of the mapping -- which number John becomes, for example -- but only that there is one.)
Method #1: you could use a Categorical object as an intermediary:
>>> ranked = pd.Categorical(df.stack()).codes.reshape(df.shape)
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2
It feels like you should be able to treat a Categorical as providing an encoding dictionary somehow (whether directly or by generating a Series) but I can't see a convenient way to do it.
Method #2: you could use rank("dense"), which generates an increasing number for each value in order:
>>> ranked = df.stack().rank("dense").reshape(df.shape).astype(int)-1
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2