Pandas label encoding column with default label for invalid row values - python

For a data frame I replaced a set of items in a column with a range of values as follows:
df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])
The issue that I want to replace all the rest of elements in 'Borough' that not mentioned before with the value 0
also I need to use regex because there are looks like data eg. 07 BRONX, I need it also to be replaced by 5 not 0

To replace all other values by 0, you can do:
# create maps
new_values = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
maps = dict(zip(new_values, [1]*len(new_values)))
# map the values
df['borough_num'] = df['Borough'].apply(lambda x: maps.get(x, 0))

Data from cold using map with fillna, all the value not in the map dict will return NaN, then we just fillna
df.Borough.map(dict(zip(['QUEENS', 'BRONX'],[1,2]))).fillna(0).astype(int)
0 1
1 2
2 2
3 0
Name: Borough, dtype: int32

I see you want to perform category encoding with some imposed order. I would recommend using pd.Categorical with ordered=True:
df = pd.DataFrame({
'Borough': ['QUEENS', 'BRONX', 'MANHATTAN', 'BROOKLYN', 'INVALID']})
df
Borough
0 QUEENS
1 BRONX
2 MANHATTAN
3 BROOKLYN
4 INVALID
keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['borough_num'] = pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes+1
df
Borough borough_num
0 QUEENS 3
1 BRONX 5
2 MANHATTAN 1
3 BROOKLYN 2
4 INVALID 0
pd.Categorical returns invalid strings as -1:
pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes
array([ 2, 4, 0, 1, -1], dtype=int8)
This should be much faster than using replace, anyway, but for reference, you would do this with replace and a dictionary:
from collections import defaultdict
d = defaultdict(int)
d.update(dict(zip(keys, range(len(keys)))))
df['borough_num'] = df['Borough'].map(d)
df
Borough borough_num
0 QUEENS 2
1 BRONX 4
2 MANHATTAN 0
3 BROOKLYN 1
4 INVALID 0

You can also use np.where:
Creating a dummy DataFrame
df = pd.DataFrame({'Borough': ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX', 'TEST']})
df
Borough
0 MANHATTAN
1 BROOKLYN
2 QUEENS
3 STATEN ISLAND
4 BRONX
5 TEST
Your Operation:
df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])
df
Borough borough_num
0 MANHATTAN 1
1 BROOKLYN 2
2 QUEENS 3
3 STATEN ISLAND 4
4 BRONX 5
5 TEST TEST
Replacing values of column Borough not in keys with 0 using np.where:
keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['Borough'] = np.where(~df['Borough'].isin(keys), 0 ,df['Borough'])
df
Borough borough_num
0 MANHATTAN 1
1 BROOKLYN 2
2 QUEENS 3
3 STATEN ISLAND 4
4 BRONX 5
5 0 TEST

Related

Conditionals with NaN in python [duplicate]

I have a simple DataFrame like the following:
I want to select all values from the 'First Season' column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the 'First Season' column.
How can I replace just the values from that column?
You need to select that column:
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
You can check the docs and also the 10 minutes to pandas which shows the semantics
EDIT
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003
A bit late to the party but still - I prefer using numpy where:
import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])
df.loc[df['First season'] > 1990, 'First Season'] = 1
Explanation:
df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.
df['First Season'].loc[(df['First Season'] > 1990)] = 1
strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
Another option is to use a list comprehension:
df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]
You can also use mask which replaces the values where the condition is met:
df['First Season'].mask(lambda col: col > 1990, 1)
We can update the First Season column in df with the following syntax:
df['First Season'] = expression_for_new_values
To map the values in First Season we can use pandas‘ .map() method with the below syntax:
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})

How can I compare two dataframes row by row in pandas with insensitive cases?

Here is my data:
d = {'ID': [14, 14, 14, 14, 14, 14, 15, 15, 14], 'NAME': ['KWI', 'NED', 'RICK', 'NICH', 'DIONIC', 'RICHARD', 'ROCKY', 'CARLOS', 'SIDARTH'], 'ID_COUNTRY':[1, 2, 3,4,5,6,7,8,9], 'COUNTRY':['MEXICO', 'ITALY', 'CANADA', 'ENGLAND', 'GERMANY', 'UNITED STATES', 'JAPAN', 'SPAIN', 'BRAZIL'], 'ID_CITY':[10, 20, 21, 31, 18, 27, 36, 86, 28], 'CITY':['MX', 'IT', 'CA', 'ENG', 'GE', 'US', 'JP', 'SP', 'BZ'], 'STATUS': ['OK', 'OK', 'OK', 'OK', 'OK', 'NOT', 'OK', 'NOT', 'OK']}
df = pd.DataFrame(data=d)
df:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 KWI 1 MEXICO 10 MX OK
1 14 NED 2 ITALY 20 IT OK
2 14 RICK 3 CANADA 21 CA OK
3 14 NICH 4 ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 GE OK
5 14 RICHARD 6 UNITED STATES 27 US NOT
6 14 ROCKY 7 JAPAN 36 JP OK
7 15 CARLOS 8 SPAIN 86 SP NOT
8 15 SIDHART 9 BRAZIL 28 BZ OK
The df is the base data. The data that I need to compare with df is df1:
d1 = {'ID': [14, 10, 14, 11, 14], 'NAME': ['Kwi', 'NED', 'riCK', 'nich', 'DIONIC'], 'ID_COUNTRY':[1, 2, 3, 6, 5], 'COUNTRY':['MXICO', 'itaLY', 'CANADA', 'ENGLAND', 'GERMANY'], 'ID_CITY':[10, 22, 21, 31, 18], 'CITY':['MX', 'AT', 'CA', 'ENG', 'EG'], 'STATUS': ['', 'OK', '', 'OK', '']}
df1 = pd.DataFrame(data=d1)
df1:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 Kwi 1 MXICO 10 MX
1 10 NED 2 itaLY 22 AT OK
2 14 riCK 3 CANADA 21 CA
3 11 nich 6 ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 EG
Desired output 1 (The values that not match must appear highlighted):
The data in df1 that not match with df is:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 Kwi 1 *MXICO* 10 MX **
1 *10* NED 2 itaLY *22* AT OK
2 14 riCK 3 CANADA 21 CA **
3 *11* nich 6 ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 *EG* **
*TWO ROWS ARE MISSING*
Note: In this output is necessary that the comparisons row by row will be insensitive to strings as itaLY, Kwi, riCK, nich that values are ok because are the same.
Desired output 2:
The data in df1 that not match with df is in :
COUNTRY, STATUS with ID 14, NAME Kwi, ID_COUNTRY 1.
ID, ID_CITY, CITY with ID 10, NAME NED, ID_COUNTRY 2.
STATUS with ID 14, NAME riCK, ID_COUNTRY 3.
ID, ID_COUNTRY with ID 11, NAME nich, ID_COUNTRY 6.
CITY, STATUS with ID 14, NAME DIONIC, ID_COUNTRY 5.
TWO ROWS ARE MISSING.
The result it just need to be a comparison of the data that match the length of df1, but also there is the possibility that rows mismatch following the ID from df as I show here (14) the values 15 in ID are no considered. I think the second output is more specific and efficient and first one it will be slow to visualize if there are many data to compare.
I hope everyone understand what is the point of this issue and found and answer. I have been struggling with this some time and did not get the solution I want, that's why I came here with you guys. Thanks for read and hope contribute to this platform.
When one wants a case insensitive comparison between strings in python, one would like to set both strings to upper or lower case and then do a traditional == or != comparison.
When using pandas, this can be achieved by the .str Series method, which allows the use of string methods such as .upper() and .lower(). In your case, a possible solution would be:
df, df1 = df.astype(str), df1.astype(str)
_df = df1.copy()
for i in df1.index:
comparison = df.loc[i].str.upper() != df1.loc[i].str.upper()
_df.loc[i, comparison] = '*' + df1.loc[i, comparison].astype(str) + '*'
If we print the resulting dataframe _df we get your desired output 1:
ID NAME ID_COUNTRY COUNTRY ID_CITY CITY STATUS
0 14 Kwi 1 *MXICO* 10 MX **
1 *10* NED 2 itaLY *22* *AT* OK
2 14 riCK 3 CANADA 21 CA **
3 *11* nich *6* ENGLAND 31 ENG OK
4 14 DIONIC 5 GERMANY 18 *EG* **
In this case I'm assuming that corresponding rows have the same index across both dataframes.
For your second desired output, you can just iterate over each row again:
print("Data in df1 that does't match df:")
for i in _df.index:
not_matching_cols = _df.loc[i].str.endswith('*')
if not_matching_cols.any():
print(','.join(_df.loc[i, not_matching_cols].index), end=' ')
print('with', 'NAME', df1.loc[i, 'NAME'], 'ID_COUNTRY', df1.loc[i, 'ID_COUNTRY'])
If you also want to print the numbers of rows missing on df1 you can just add
print(df.shape[0] - df1.shape[0], 'ROWS ARE MISSING')
The output of this last part should be:
Data in df1 that does't match df:
COUNTRY,STATUS with NAME Kwi ID_COUNTRY 1
ID,ID_CITY,CITY with NAME NED ID_COUNTRY 2
STATUS with NAME riCK ID_COUNTRY 3
ID,ID_COUNTRY with NAME nich ID_COUNTRY 6
CITY,STATUS with NAME DIONIC ID_COUNTRY 5
4 ROWS ARE MISSING
I am not sure what code you have been using to compare it row by row or what your conditions are, but one thing you can try is converting all the string rows to lowercase strings...
df.update(df.select_dtypes('object').applymap(str.lower))
# 'object' is used to refer to strings in pandas dtypes
Or if you want to preserve the original columns, you could try making new, temporary columns...
df['name_lower'] = df['name'].apply(str.lower)
df1['name_lower'] = df1['name'].apply(str.lower)

Missing first row when construction a Series from a DataFrame

I have a dictionary I call 'test_dict'
test_dict = {'OBJECTID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'Country': {0: 'Vietnam',
1: 'Vietnam',
2: 'Vietnam',
3: 'Vietnam',
4: 'Vietnam'},
'Location': {0: 'Nha Trang',
1: 'Hue',
2: 'Phu Quoc',
3: 'Chu Lai',
4: 'Lao Bao'},
'Lat': {0: 12.250000000000057,
1: 16.401000000000067,
2: 10.227000000000032,
3: 15.406000000000063,
4: 16.627300000000048},
'Long': {0: 109.18333300000006,
1: 107.70300000000009,
2: 103.96700000000004,
3: 108.70600000000007,
4: 106.59970000000004}}
That I convert to a DataFrame
test_df = pd.DataFrame(test_dict)
and I get this:
OBJECTID Country Location Lat Long
0 1 Vietnam Nha Trang 12.2500 109.183333
1 2 Vietnam Hue 16.4010 107.703000
2 3 Vietnam Phu Quoc 10.2270 103.967000
3 4 Vietnam Chu Lai 15.4060 108.706000
4 5 Vietnam Lao Bao 16.6273 106.599700
I want to construct a series with location names and I would like the column "ObjectID" to be the index. When I try it, I lose the first row.
pd.Series(test_df.Location, index=test_df.OBJECTID)
I get this:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object
What I was hoping to get was this:
OBJECTID
1 Nha Trang
2 Hue
3 Phu Quoc
4 Chu Lai
5 Lao Bao
What am I doing wrong here? Why is the process of converting into a Series losing the first row?
You can fix your code via
pd.Series(test_df.Location.values, index=test_df.OBJECTID)
because the problem is that test_df.Location has an index itself that starts at 0.
Edit - my preferred alternative:
test_df.set_index('OBJECTID')['Location']
You can use:
pd.Series(test_df.Location).reindex(test_df.OBJECTID)
Result:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object

Operations on multiple data frame in PANDAS

I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?
First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0
See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN

ignoring hierarchical index during matrix operations

In the last statement of this routine I get a TypeError
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Missouri'],
'year': [2000, 2001, 2002, 2001, 2002],
'items': [5, 12, 6, 45, 0]}
frame = DataFrame(data)
def summary_pivot(df, row=['state'],column=['year'],value=['items'],func=len):
return df.pivot_table(value, rows=row,cols=column,
margins=True, aggfunc=func, fill_value=0)
test = summary_pivot(frame)
In [545]: test
Out[545]:
items
year 2000 2001 2002 All
state
Missouri 0 0 1 1
Nevada 0 1 0 1
Ohio 1 1 1 3
All 1 2 2 5
price = DataFrame(index=['Missouri', 'Ohio'], columns = ['price'], data = [200, 250])
In [546]: price
Out[546]:
price
Missouri 200
Ohio 250
test * price
TypeError: can only call with other hierarchical index objects
How can I get past this error, so I can multiply correctly the number of items in each state by the corresponding price?
In [659]: price = Series(index = ['Missouri', 'Ohio'], data = [200, 250])
In [660]: test1 = test.items
In [661]: test1.mul(price, axis='index')
Out[661]:
year 2000 2001 2002 All
All NaN NaN NaN NaN
Missouri 0 0 200 200
Nevada NaN NaN NaN NaN
Ohio 250 250 250 750

Categories