I have two data frame, the first one is:
id code
1 2
2 3
3 3
4 1
and the second one is:
id code name
1 1 Mary
2 2 Ben
3 3 John
I would like to map the data frame 1 so that it looks like:
id code name
1 2 Ben
2 3 John
3 3 John
4 1 Mary
I try to use this code:
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
My mapping is correct, but the mapping value are all NAN:
mapping = {1:"Mary", 2:"Ben", 3:"John"}
id code name
1 2 NaN
2 3 NaN
3 3 NaN
4 1 NaN
Can anyone know why an how to solve?
Problem is different type of values in column code so necessary converting to integers or strings by astype for same types in both:
print (df1['code'].dtype)
object
print (df2['code'].dtype)
int64
print (type(df1.loc[0, 'code']))
<class 'str'>
print (type(df2.loc[0, 'code']))
<class 'numpy.int64'>
mapping = dict(df2[['code','name']].values)
#same dtypes - integers
df1['name'] = df1['code'].astype(int).map(mapping)
#same dtypes - object (obviously strings)
df2['code'] = df2['code'].astype(str)
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
print (df1)
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mary
Alternate way is using dataframe.merge
df.merge(df2.drop(['id'],1), how='left', on=['code'])
Output:
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mery
Related
I have a dataframe where one of the column names is a variable:
xx = pd.DataFrame([{'ID':1, 'Name': 'Abe', 'HasCar':1},
{'ID':2, 'Name': 'Ben', 'HasCar':0},
{'ID':3, 'Name': 'Cat', 'HasCar':1}])
ID Name HasCar
0 1 Abe 1
1 2 Ben 0
2 3 Cat 1
In this dummy example column 2 could be "HasCar", or "IsStaff", or some other unknowable value. I want to select all rows, where column 2 is True, whatever the column name is.
I've tried the following without success:
xx.iloc[:,[2]] == 1
HasCar
0 True
1 False
2 True
and then trying to use that as an index results in:
xx[xx.iloc[:,[2]] == 1]
ID Name HasCar
0 NaN NaN 1.0
1 NaN NaN NaN
2 NaN NaN 1.0
Which isn't helpful. I suppose I could go about renaming column 2 but that feels a little wrong. The issue seems to be that xx.iloc[:,[2]] returns a dataframe while xx['hasCar'] returns a series. I can't figure out how to force a (x,1) shaped dataframe into a series without knowing the column name, as described here .
Any ideas?
It was almost correct, but you sliced in 2D, use a Series slicing instead:
xx[xx.iloc[:, 2] == 1]
Output:
ID Name HasCar
0 1 Abe 1
2 3 Cat 1
difference:
# 2D slicing, this gives a DataFrame (with a single column)
xx.iloc[:,[2]]
HasCar
0 1
1 0
2 1
# 1D slicing, as Series
xx.iloc[:,2]
0 1
1 0
2 1
Name: HasCar, dtype: int64
I have the dataframe that has employees, and their level.
import pandas as pd
d = {'employees': ["John", "Jamie", "Ann", "Jane", "Kim", "Steve"], 'Level': ["A/Ba", "C/A", "A", "C", "Ba/C", "D"]}
df = pd.DataFrame(data=d)
How do I add a new column that measures the number of employees with the same levels. For example, John would have 3 as there are 2 A's (Jamie and Ann) and one other Ba (Kim). Note it does not count the employee in this case John level(s) to that count.
My goal is for the end dataframe to be this.
Try this:
df['Number of levels'] = df['Level'].str.split('/').explode().map(df['Level'].str.split('/').explode().value_counts()).sub(1).groupby(level=0).sum()
Output:
>>> df
employees Level Number of levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
exploded = df.Level.str.split("/").explode()
counts = exploded.groupby(exploded).transform("count").sub(1)
df["Num Levels"] = counts.groupby(level=0).sum()
We first explode the "Level" column by splitting over "/" so we can reach each level:
>>> exploded = df.Level.str.split("/").explode()
>>> exploded
0 A
0 Ba
1 C
1 A
2 A
3 C
4 Ba
4 C
5 D
Name: Level, dtype: object
We now need counts of each element in this series so we group by itself and transform by counts:
>>> exploded.groupby(exploded).transform("count")
0 3
0 2
1 3
1 3
2 3
3 3
4 2
4 3
5 1
Name: Level, dtype: int64
Since it counts elements themselves but you look at other places, we subtract 1 to get counts:
>>> counts = exploded.groupby(exploded).transform("count").sub(1)
>>> counts
0 2
0 1
1 2
1 2
2 2
3 2
4 1
4 2
5 0
Name: Level, dtype: int64
Now, we need to "come back", and the index is our helper for that; we group by it (level=0 means that) and sum the counts thereof:
>>> counts.groupby(level=0).sum()
0 3
1 4
2 2
3 2
4 3
5 0
Name: Level, dtype: int64
This is the end result and is assigned to df["Num Levels"].
to get
employees Level Num Levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
This is all writable in "1 line" but it may hinder readability and further debuggings!
df["Num Levels"] = (df.Level
.str.split("/")
.explode()
.pipe(lambda ex: ex.groupby(ex))
.transform("count")
.sub(1)
.groupby(level=0)
.sum())
I have a data frame like so
CategoryNumber
1
2
3
1
3
I want create a new column 'Category' that assigns values based on the value in the 'CategoryNumber' column, like so
CategoryNumber Category
1 First Category
2 Second Category
3 Third Category
1 First Category
3 Third Category
How do I do so using python and pandas
Using map
import string
df.CategoryNumber.map(dict(zip(range(1,26),string.ascii_lowercase)))
Out[472]:
0 a
1 b
2 c
3 a
4 c
Name: CategoryNumber, dtype: object
You can use CatCodes straight from pandas.
First make a column a category
Call cat.codes
Assign it to your new Column
df['Category2'] = df['CategoryNumber'].astype('category').cat.codes
CategoryNumber Category2
0 1 0
1 2 1
2 3 2
3 1 0
4 3 2
If you need to make it A,B,C, etc. look at map
df['Letters'] = df['Category2'].map(dict(zip(df['Category2'].tolist(),string.ascii_uppercase)))
CategoryNumber Category2 Letters
0 1 0 D
1 2 1 B
2 3 2 E
3 1 0 D
4 3 2 E
In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]
You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B
Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64
You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()
I have simple data:
type age
A 4
A 4
B 4
A 5
I want to get
type age count
A 4 2
A 5 1
B 4 1
How to perform such thing in panda: what shell I do after df.groupby(['type'])?
Let's use groupby with 'type' and 'age', then count and reset_index:
df.groupby(['type','age'])['age'].count().reset_index(name='count')
Output:
type age count
0 A 4 2
1 A 5 1
2 B 4 1
You could also do
df.groupby(['type','age']).size().reset_index(name='count')