Python pandas: map and return Nan

Python pandas: map and return Nan - python

I have two data frame, the first one is:
id code
1 2
2 3
3 3
4 1
and the second one is:
id code name
1 1 Mary
2 2 Ben
3 3 John
I would like to map the data frame 1 so that it looks like:
id code name
1 2 Ben
2 3 John
3 3 John
4 1 Mary
I try to use this code:
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
My mapping is correct, but the mapping value are all NAN:
mapping = {1:"Mary", 2:"Ben", 3:"John"}
id code name
1 2 NaN
2 3 NaN
3 3 NaN
4 1 NaN
Can anyone know why an how to solve?

Problem is different type of values in column code so necessary converting to integers or strings by astype for same types in both:
print (df1['code'].dtype)
object
print (df2['code'].dtype)
int64
print (type(df1.loc[0, 'code']))
<class 'str'>
print (type(df2.loc[0, 'code']))
<class 'numpy.int64'>
mapping = dict(df2[['code','name']].values)
#same dtypes - integers
df1['name'] = df1['code'].astype(int).map(mapping)
#same dtypes - object (obviously strings)
df2['code'] = df2['code'].astype(str)
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
print (df1)
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mary

Alternate way is using dataframe.merge
df.merge(df2.drop(['id'],1), how='left', on=['code'])
Output:
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mery

Related

Pandas: Select column by location and rows by value

I have a dataframe where one of the column names is a variable:
xx = pd.DataFrame([{'ID':1, 'Name': 'Abe', 'HasCar':1},
{'ID':2, 'Name': 'Ben', 'HasCar':0},
{'ID':3, 'Name': 'Cat', 'HasCar':1}])
ID Name HasCar
0 1 Abe 1
1 2 Ben 0
2 3 Cat 1
In this dummy example column 2 could be "HasCar", or "IsStaff", or some other unknowable value. I want to select all rows, where column 2 is True, whatever the column name is.
I've tried the following without success:
xx.iloc[:,[2]] == 1
HasCar
0 True
1 False
2 True
and then trying to use that as an index results in:
xx[xx.iloc[:,[2]] == 1]
ID Name HasCar
0 NaN NaN 1.0
1 NaN NaN NaN
2 NaN NaN 1.0
Which isn't helpful. I suppose I could go about renaming column 2 but that feels a little wrong. The issue seems to be that xx.iloc[:,[2]] returns a dataframe while xx['hasCar'] returns a series. I can't figure out how to force a (x,1) shaped dataframe into a series without knowing the column name, as described here .
Any ideas?

It was almost correct, but you sliced in 2D, use a Series slicing instead:
xx[xx.iloc[:, 2] == 1]
Output:
ID Name HasCar
0 1 Abe 1
2 3 Cat 1
difference:
# 2D slicing, this gives a DataFrame (with a single column)
xx.iloc[:,[2]]
HasCar
0 1
1 0
2 1
# 1D slicing, as Series
xx.iloc[:,2]
0 1
1 0
2 1
Name: HasCar, dtype: int64

How to create a column that measures the number of items that exits in another string column?

I have the dataframe that has employees, and their level.
import pandas as pd
d = {'employees': ["John", "Jamie", "Ann", "Jane", "Kim", "Steve"], 'Level': ["A/Ba", "C/A", "A", "C", "Ba/C", "D"]}
df = pd.DataFrame(data=d)
How do I add a new column that measures the number of employees with the same levels. For example, John would have 3 as there are 2 A's (Jamie and Ann) and one other Ba (Kim). Note it does not count the employee in this case John level(s) to that count.
My goal is for the end dataframe to be this.

Try this:
df['Number of levels'] = df['Level'].str.split('/').explode().map(df['Level'].str.split('/').explode().value_counts()).sub(1).groupby(level=0).sum()
Output:
>>> df
employees Level Number of levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0

exploded = df.Level.str.split("/").explode()
counts = exploded.groupby(exploded).transform("count").sub(1)
df["Num Levels"] = counts.groupby(level=0).sum()
We first explode the "Level" column by splitting over "/" so we can reach each level:
>>> exploded = df.Level.str.split("/").explode()
>>> exploded
0 A
0 Ba
1 C
1 A
2 A
3 C
4 Ba
4 C
5 D
Name: Level, dtype: object
We now need counts of each element in this series so we group by itself and transform by counts:
>>> exploded.groupby(exploded).transform("count")
0 3
0 2
1 3
1 3
2 3
3 3
4 2
4 3
5 1
Name: Level, dtype: int64
Since it counts elements themselves but you look at other places, we subtract 1 to get counts:
>>> counts = exploded.groupby(exploded).transform("count").sub(1)
>>> counts
0 2
0 1
1 2
1 2
2 2
3 2
4 1
4 2
5 0
Name: Level, dtype: int64
Now, we need to "come back", and the index is our helper for that; we group by it (level=0 means that) and sum the counts thereof:
>>> counts.groupby(level=0).sum()
0 3
1 4
2 2
3 2
4 3
5 0
Name: Level, dtype: int64
This is the end result and is assigned to df["Num Levels"].
to get
employees Level Num Levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
This is all writable in "1 line" but it may hinder readability and further debuggings!
df["Num Levels"] = (df.Level
.str.split("/")
.explode()
.pipe(lambda ex: ex.groupby(ex))
.transform("count")
.sub(1)
.groupby(level=0)
.sum())

Assigning a value to a column based on another column

I have a data frame like so
CategoryNumber
1
2
3
1
3
I want create a new column 'Category' that assigns values based on the value in the 'CategoryNumber' column, like so
CategoryNumber Category
1 First Category
2 Second Category
3 Third Category
1 First Category
3 Third Category
How do I do so using python and pandas

Using map
import string
df.CategoryNumber.map(dict(zip(range(1,26),string.ascii_lowercase)))
Out[472]:
0 a
1 b
2 c
3 a
4 c
Name: CategoryNumber, dtype: object

You can use CatCodes straight from pandas.
First make a column a category
Call cat.codes
Assign it to your new Column
df['Category2'] = df['CategoryNumber'].astype('category').cat.codes
CategoryNumber Category2
0 1 0
1 2 1
2 3 2
3 1 0
4 3 2
If you need to make it A,B,C, etc. look at map
df['Letters'] = df['Category2'].map(dict(zip(df['Category2'].tolist(),string.ascii_uppercase)))
CategoryNumber Category2 Letters
0 1 0 D
1 2 1 B
2 3 2 E
3 1 0 D
4 3 2 E

In pandas, how to display the most frequent diagnoses in dataframe, but only count 1 occurrence of the same diagnoses per patient

In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]

You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B

Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64

You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()

How to perform LINQ ThenBy in pandas?

I have simple data:
type age
A 4
A 4
B 4
A 5
I want to get
type age count
A 4 2
A 5 1
B 4 1
How to perform such thing in panda: what shell I do after df.groupby(['type'])?

Let's use groupby with 'type' and 'age', then count and reset_index:
df.groupby(['type','age'])['age'].count().reset_index(name='count')
Output:
type age count
0 A 4 2
1 A 5 1
2 B 4 1

You could also do
df.groupby(['type','age']).size().reset_index(name='count')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas: map and return Nan - python

Alternate way is using dataframe.merge df.merge(df2.drop(['id'],1), how='left', on=['code']) Output: id code name 0 1 2 Ben 1 2 3 John 2 3 3 John 3 4 1 Mery

Related

Pandas: Select column by location and rows by value

How to create a column that measures the number of items that exits in another string column?

Assigning a value to a column based on another column

In pandas, how to display the most frequent diagnoses in dataframe, but only count 1 occurrence of the same diagnoses per patient

How to perform LINQ ThenBy in pandas?

Categories

Resources