pandas.DataFrame.round() not accepting pd.NA or pd.NAN - python

pandas version: 1.2
I have a dataframe that columns as 'float64' with null values represented as pd.NAN. Is there way to round without converting to string then decimal:
df = pd.DataFrame([(.21, .3212), (.01, .61237), (.66123, .03), (.21, .18),(pd.NA, .18)],
columns=['dogs', 'cats'])
df
dogs cats
0 0.21 0.32120
1 0.01 0.61237
2 0.66123 0.03000
3 0.21 0.18000
4 <NA> 0.18000
Here is what I wanted to do, but it is erroring:
df['dogs'] = df['dogs'].round(2)
TypeError: float() argument must be a string or a number, not 'NAType'
Here is another way I tried but this silently fails and no conversion occurs:
tn.round({'dogs': 1})
dogs cats
0 0.21 0.32120
1 0.01 0.61237
2 0.66123 0.03000
3 0.21 0.18000
4 <NA> 0.18000

While annoying, the pandas.NA is still relatively new and doesn't support ALL numpy ufuncs. Oddly I'm also encountering errors trying to change the "dogs" column's dtype from object -> float which seems like a bug to me. There's a couple of alternatives that you can achieve your desired result though:
mask the NA away and round the rest of the column
na_mask = df["dogs"].notnull()
df.loc[na_mask, "dogs"] = df.loc[na_mask, "dogs"].astype(float).round(1)
print(df)
dogs cats
0 0.2 0.32120
1 0 0.61237
2 0.7 0.03000
3 0.2 0.18000
4 <NA> 0.18000
Replace the pd.NA with np.nan and then round
df = df.replace(pd.NA, np.nan).round({"dogs": 1})
print(df)
dogs cats
0 0.2 0.32120
1 0.0 0.61237
2 0.7 0.03000
3 0.2 0.18000
4 NaN 0.18000

df['dogs'] = df['dogs'].apply(lambda x: round(x,2) if str(x) != '<NA>' else x)

Does the following code work?
import pandas as pd
import numpy as np
df = pd.DataFrame([(.21, .3212), (.01, .61237), (.66123, .03), (.21, .18), (np.nan, .18)],
columns=['dogs', 'cats'])
df['dogs'] = df['dogs'].round(2)
print(df)

Related

How to decode column value from rare label by matching column names

I have two dataframes like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
My objective is to
a) Replace the Rare or rare values in feature column of tdf dataframe by original value from cdf dataframe.
b) To identify original value, we can make use of the string before = Rare or =rare or = rare etc. That string represents the column name in cdf dataframe (from where original value to replace rare can be found)
I was trying something like the below but not sure how to go from here
replace_df = cdf.merge(tdf,how='inner',on='Id')
replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))])
I have to apply this on a big data where I have million rows and more than 1000 replacements to be made like this.
I expect my output to be like as shown below
Here is one possible approach using MultiIndex.map to substitute values from cdf into tdf:
s = tdf['feature'].str.split('=')
m = s.str[1].isin(['rare', 'Rare'])
v = tdf[m].set_index(['Id', s[m].str[0]]).index.map(cdf.set_index('Id').stack())
tdf.loc[m, 'feature'] = s[m].str[0] + '=' + v.astype(str)
print(tdf)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
# list comprehension to find where rare is in the feature col
tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
# create a mask where feature is in columns of cdf
mask = tdf['feature'].isin(cdf.columns)
# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
right_index=True, left_on=['Id', 'feature'])[0].astype(str)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
My feeling is there's no need to look for Rare values.
Extract the column name from tdf to lookup in cdf. After, flatten your cdf dataframe to extract the right values:
r = tdf.set_index('Id')['feature'].str.split('=').str[0].str.lower()
tdf['feature'] = r.values + '=' + cdf.set_index('Id').unstack() \
.loc[zip(r.values, r.index)] \
.astype(str).values
Output:
>>> tdf
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=A 3.70
>>> r
Id # <- the index is the row of cdf
1 grade # <- the values are the column of cdf
1 dash
1 dumma
1 dumeel
3 dash
3 dumma
3 grade
Name: feature, dtype: object

Pandas : Top N results with float and string list

idx float str+list
1 -0.2 [A,B]
1 -0.1 [A,D]
1 0.2 [B,C]
To know the best result :
df.loc[df['float'].idxmax()]['str+list']
How can I have the top 2 idxmax results?
nlargest gives me error
Use DataFrame.nlargest:
s = df.nlargest(2, 'float')['str+list']
print (s)
2 [B,C]
1 [A,D]
Name: str+list, dtype: object
Or sorting with select top N values:
df.sort_values('float', ascending=False)['str+list'].head(2)

How to check whether value is between two numbers which is in one cell

i have dataset with two columns:
import pandas as pd
dict = {'val':["3.2", "2.4", "-2.3", "-4.9"],
'conf_interval': ["[-0.83, -1.78]", "[0.71, 2.78]", "[-0.91, -2.28]", "[-0.69, -2.14]"]}
df = pd.DataFrame(dict)
df
val conf_interval
0 3.2 [-0.83, -1.78]
1 2.4 [0.71, 2.78]
2 -2.3 [-0.91, -2.28]
3 -4.9 [-0.69, -2.14]
I want to check which of the values in column val is between two values in column conf_interval. Is the only way is to splitconf_interval column to two columns? Or there are also other way without splitting this column?
The desirede output is something like this:
val conf_interval result
0 3.2 [-1.78, -0.83] False
1 2.4 [0.71, 2.78] True
2 -2.3 [-2.28, -0.91] False
3 -4.9 [0.69, 2.14] False
Use Series.between with converted column conf_interval float series by Series.str.split:
df1 = df['conf_interval'].str.strip('[]').str.split(', ', expand=True).astype(float)
df['result'] = df['val'].astype(float).between(df1[0], df1[1])
print (df)
val conf_interval result
0 3.2 [-0.83, -1.78] False
1 2.4 [0.71, 2.78] True
2 -2.3 [-0.91, -2.28] False
3 -4.9 [-0.69, -2.14] False
I've used the intervals from the expected output's dataframe, where the left hand side is lower than the right hand. Here's one approach using pd.IntervalIndex:
from ast import literal_eval
df['conf_interval'] = df.conf_interval.map(literal_eval)
df['val'] = pd.to_numeric(df.val)
intervals = pd.IntervalIndex.from_tuples(list(map(tuple, df.conf_interval)))
df['result'] = intervals.contains(df.val)
print(df)
val conf_interval result
0 3.2 [-1.78, -0.83] False
1 2.4 [0.71, 2.78] True
2 -2.3 [-2.28, -0.91] False
3 -4.9 [0.69, 2.14] False

Pandas Dataframe: Multiplying Two Columns

I am trying to multiply two columns (ActualSalary * FTE) within the dataframe (OPR) to create a new column (FTESalary), but somehow it has stopped at row 21357, I don't understand what went wrong or how to fix it. The two columns came from importing a csv file using the line: OPR = pd.read_csv('OPR.csv', encoding='latin1')
[In] OPR
[out]
ActualSalary FTE
44600 1
58,000.00 1
70,000.00 1
17550 1
34693 1
15674 0.4
[In] OPR["FTESalary"] = OPR["ActualSalary"].str.replace(",", "").astype("float")*OPR["FTE"]
[In] OPR
[out]
ActualSalary FTE FTESalary
44600 1 44600
58,000.00 1 58000
70,000.00 1 70000
17550 1 NaN
34693 1 NaN
15674 0.4 NaN
I am not expecting any NULL values as an output at all, I am really struggling with this. I would really appreciate the help.
Many thanks in advance! (I am new to both coding and here, please let me know via message if I have made mistakes or can improve the way I post questions here)
Sharing the data #oppresiveslayer
[In] OPR[0:6].to_dict()
[out]
{'ActualSalary': {0: '44600',
1: '58,000.00',
2: '70,000.00',
3: '39,780.00',
4: '0.00',
5: '78,850.00'},
'FTE': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0}}
For more information on the two columns #charlesreid1
[in] OPR['ActualSalary'].astype
[out]
Name: ActualSalary, Length: 21567, dtype: object>
[in] OPR['FTE'].astype
[out]
Name: FTE, Length: 21567, dtype: float64>
The version I am using:
python: 3.7.3, pandas: 0.25.1 on jupyter Notebook 6.0.0
I believe that your ActualSalary column is a mix of strings and integers. That is the only way I've been able to recreate your error:
df = pd.DataFrame(
{'ActualSalary': ['44600', '58,000.00', '70,000.00', 17550, 34693, 15674],
'FTE': [1, 1, 1, 1, 1, 0.4]})
>>> df['ActualSalary'].str.replace(',', '').astype(float) * df['FTE']
0 44600.0
1 58000.0
2 70000.0
3 NaN
4 NaN
5 NaN
dtype: float64
The issue arises when you try to remove the commas:
>>> df['ActualSalary'].str.replace(',', '')
0 44600
1 58000.00
2 70000.00
3 NaN
4 NaN
5 NaN
Name: ActualSalary, dtype: object
First convert them to strings, before converting back to floats.
fte_salary = (
df['ActualSalary'].astype(str).str.replace(',', '') # Remove commas in string, e.g. '55,000.00' -> '55000.00'
.astype(float) # Convert string column to floats.
.mul(df['FTE']) # Multiply by new salary column by Full-Time-Equivalent (FTE) column.
)
>>> df.assign(FTESalary=fte_salary) # Assign new column to dataframe.
ActualSalary FTE FTESalary
0 44600 1.0 44600.0
1 58,000.00 1.0 58000.0
2 70,000.00 1.0 70000.0
3 17550 1.0 17550.0
4 34693 1.0 34693.0
5 15674 0.4 6269.6
This should work:
OTR['FTESalary'] = OTR.apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1)
output
ActualSalary FTE FTESalary
0 44600 1.0 44600.0
1 58,000.00 1.0 58000.0
2 70,000.00 1.0 70000.0
3 17550 1.0 17550.0
4 34693 1.0 34693.0
5 15674 0.4 6269.6
ok, i think you need to do this:
OTR['FTESalary'] = OTR.reset_index().apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1).to_numpy().tolist()
I was able to do it in a couple steps, but with list comprehension which might be less readable for a beginner. It makes an intermediate column, which does the float conversion, since your ActualSalary column is full of strings at the start.
OPR["X"] = [float(x.replace(",","")) for x in OPR["ActualSalary"]]
OPR["FTESalary"] = OPR["X"]*OPR["FTE"]

iterating re.split() on a dataframe

I am trying to use re.split() to split a single variable in a pandas dataframe into two other variables.
My data looks like:
xg
0.05+0.43
0.93+0.05
0.00
0.11+0.11
0.00
3.94-2.06
I want to create
e a
0.05 0.43
0.93 0.05
0.00
0.11 0.11
0.00
3.94 2.06
I can do this using a for loop and and indexing.
for i in range(len(df)):
if df['xg'].str.len()[i] < 5:
df['e'][i] = df['xg'][i]
else:
df['e'][i], df['a'][i] = re.split("[\+ \-]", df['xg'][i])
However this is slow and I do not believe is a good way of doing this and I am trying to improve my code/python understanding.
I had made various attempts by trying to write it using np.where, or using a list comprehension or apply lambda but I can't get it too run. I think all the issues I have are because I am trying to apply the functions to the whole series rather than the positional value.
If anyone has an idea of a better method than my ugly for loop I would be very interested.
Borrowed from this answer using the str.split method with the expand argument:
https://stackoverflow.com/a/14745484/3084939
df = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
df[['left','right']] = df['col'].str.split('[+|-]', expand=True)
df.head()
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6+1.6 0.6 1.6
This may be what you want. Not sure it's elegant, but should be faster than a python loop.
import pandas as pd
import numpy as np
data = ['0.05+0.43','0.93+0.05','0.00','0.11+0.11','0.00','3.94-2.06']
df = pd.DataFrame(data, columns=['xg'])
# Solution
tmp = df['xg'].str.split(r'[ \-+]')
df['e'] = tmp.apply(lambda x: x[0])
df['a'] = tmp.apply(lambda x: x[1] if len(x) > 1 else np.nan)
del(tmp)
Regex to retain - ve sign
import pandas as pd
import re
df1 = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
data = [[i] + re.findall('-*[0-9.]+', i) for i in df1['col']]
df = pd.DataFrame(data, columns=["col", "left", "right"])
print(df.head())
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6-1.6 0.6 -1.6
[Program finished]

Categories