I have this dataset:
mydf = pd.DataFrame({'source':['a','b','a','b'],
'text':['November rain','Sweet child omine','Paradise City','Patience']})
mydf
source text
0 a November rain
1 b Sweet child omine
2 a Paradise City
3 b Patience
And I want to split the text inside column text. This is the expected result:
source text
0 a November
1 a rain
2 b Sweet
3 b child
4 b omine
5 a Paradise
6 a City
7 b Patience
This is what I have tried:
mydf['text'] = mydf['text'].str.split(expand=True)
But it returns me an error:
ValueError: Columns must be same length as key
What I am doing wrong? Is there a way to do this without creating an index?
str.split(expand=True) returns a dataframe, normally with more than one column, so you can't assign back to your original column:
# output of `str.split(expand=True)`
0 1 2
0 November rain None
1 Sweet child omine
2 Paradise City None
3 Patience None None
I think you mean:
# expand=False is default
mydf['text'] = mydf['text'].str.split()
mydf = mydf.explode('text')
You can also chain with assign:
mydf.assign(text=mydf['text'].str.split()).explode('text')
Output:
source text
0 a November
0 a rain
1 b Sweet
1 b child
1 b omine
2 a Paradise
2 a City
3 b Patience
Related
I have the dataframe that has employees, and their level.
import pandas as pd
d = {'employees': ["John", "Jamie", "Ann", "Jane", "Kim", "Steve"], 'Level': ["A/Ba", "C/A", "A", "C", "Ba/C", "D"]}
df = pd.DataFrame(data=d)
How do I add a new column that measures the number of employees with the same levels. For example, John would have 3 as there are 2 A's (Jamie and Ann) and one other Ba (Kim). Note it does not count the employee in this case John level(s) to that count.
My goal is for the end dataframe to be this.
Try this:
df['Number of levels'] = df['Level'].str.split('/').explode().map(df['Level'].str.split('/').explode().value_counts()).sub(1).groupby(level=0).sum()
Output:
>>> df
employees Level Number of levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
exploded = df.Level.str.split("/").explode()
counts = exploded.groupby(exploded).transform("count").sub(1)
df["Num Levels"] = counts.groupby(level=0).sum()
We first explode the "Level" column by splitting over "/" so we can reach each level:
>>> exploded = df.Level.str.split("/").explode()
>>> exploded
0 A
0 Ba
1 C
1 A
2 A
3 C
4 Ba
4 C
5 D
Name: Level, dtype: object
We now need counts of each element in this series so we group by itself and transform by counts:
>>> exploded.groupby(exploded).transform("count")
0 3
0 2
1 3
1 3
2 3
3 3
4 2
4 3
5 1
Name: Level, dtype: int64
Since it counts elements themselves but you look at other places, we subtract 1 to get counts:
>>> counts = exploded.groupby(exploded).transform("count").sub(1)
>>> counts
0 2
0 1
1 2
1 2
2 2
3 2
4 1
4 2
5 0
Name: Level, dtype: int64
Now, we need to "come back", and the index is our helper for that; we group by it (level=0 means that) and sum the counts thereof:
>>> counts.groupby(level=0).sum()
0 3
1 4
2 2
3 2
4 3
5 0
Name: Level, dtype: int64
This is the end result and is assigned to df["Num Levels"].
to get
employees Level Num Levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
This is all writable in "1 line" but it may hinder readability and further debuggings!
df["Num Levels"] = (df.Level
.str.split("/")
.explode()
.pipe(lambda ex: ex.groupby(ex))
.transform("count")
.sub(1)
.groupby(level=0)
.sum())
I was trying to remove the rows with nan values in a python dataframe and when i do so, i want the row identifiers to shift in such way that the identifiers in the new data frame start from 0 and are one number away from each other. By identifiers i mean the numbers at the left of the following example. Notice that this is not an actual column of my df. This is rather placed by default in every dataframe.
If my Df is like:
name toy born
0 a 1 2020
1 na 2 2020
2 c 5 2020
3 na 1 2020
4 asf 1 2020
i want after dropna()
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
I dont want, but this is what i get:
name toy born
0 a 1 2020
2 c 5 2020
4 asf 1 2020
You can simply add df.reset_index(drop=True)
By default, df.dropna and df.reset_index are not performed in place. Therefore, the complete answer would be as follows.
df = df.dropna().reset_index(drop=True)
Results and Explanations
The above code yields the following result.
>>> df = df.dropna().reset_index(drop=True)
>>> df
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
We use the argument drop=True to drop the index column. Otherwise, the result would look like this.
>>> df = df.dropna().reset_index()
>>> df
index name toy born
0 0 a 1 2020
1 2 c 5 2020
2 4 asf 1 2020
Here is the code I used to create my dataframe:
data = [['Anna',1,1,2,2,3],['Bob',2,2,3,1,1],['Chloe',1,1,2,3,4],
['David',1,2,2,2,1]]
df = pd.DataFrame(data, columns = ['Name', 'A','B','C','D','E'])
I want to create a column which would state if a specific change occurred across the table. For example for this dataset I would like the column to express whether either the person went from '1 to 2 to 3' or '1 to 2 to 3 to 4'. So for this specific dataframe both Anna and Chloe would have an indicator in that column to convey that they went through these changes.
The expected outcome should have the following column to the dataframe:
df['Column'] = ['1-2-3','NA','1-2-3-4','NA']
You can take the below approach:
cond=(~m.diff(axis=1).lt(0).any(axis=1))
df=df.assign(new_col=np.where(cond,
m.apply(lambda x: '-'.join(map(str,(dict.fromkeys(x).keys()))),axis=1),'NA'))
print(df)
Name A B C D E new_col
0 Anna 1 1 2 2 3 1-2-3
1 Bob 2 2 3 1 1 NA
2 Chloe 1 1 2 3 4 1-2-3-4
3 David 1 2 2 2 1 NA
Here You go:
def path(input):
nodes = []
for i in input:
if len(nodes):
if i < nodes[-1]:
return np.nan
if i not in nodes:
nodes.append(i)
return '-'.join(str(i) for i in nodes)
df['path'] = [path(row) for row in np.array(df[['A', 'B', 'C', 'D', 'E']])]
I have two data frame, the first one is:
id code
1 2
2 3
3 3
4 1
and the second one is:
id code name
1 1 Mary
2 2 Ben
3 3 John
I would like to map the data frame 1 so that it looks like:
id code name
1 2 Ben
2 3 John
3 3 John
4 1 Mary
I try to use this code:
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
My mapping is correct, but the mapping value are all NAN:
mapping = {1:"Mary", 2:"Ben", 3:"John"}
id code name
1 2 NaN
2 3 NaN
3 3 NaN
4 1 NaN
Can anyone know why an how to solve?
Problem is different type of values in column code so necessary converting to integers or strings by astype for same types in both:
print (df1['code'].dtype)
object
print (df2['code'].dtype)
int64
print (type(df1.loc[0, 'code']))
<class 'str'>
print (type(df2.loc[0, 'code']))
<class 'numpy.int64'>
mapping = dict(df2[['code','name']].values)
#same dtypes - integers
df1['name'] = df1['code'].astype(int).map(mapping)
#same dtypes - object (obviously strings)
df2['code'] = df2['code'].astype(str)
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
print (df1)
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mary
Alternate way is using dataframe.merge
df.merge(df2.drop(['id'],1), how='left', on=['code'])
Output:
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mery
I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5