Finding specific digit pattern with regex in python - python

I want to replace all values in a dataframe column that starts with "-99." using regex with NaN as these are the outliers.
I used df['Item'].replace(r(^[-][9][9]\d.*$),np.NaN) but it did not work.

TL;DR
The regular expression posted by #tripleee is fine to detect numbers (encoded as string) starting with -99. The problem here is you are dealing with number and regular expression are only suited for string.
MCVE
Lets build a comprehensive example:
import numpy as np
import pandas as pd
df = pd.DataFrame([-999, -99.9, -9, 9, 99.9, 0., 1, -999], columns=['Item'])
Item
0 -999.0
1 -99.9
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
Regular Expression
Then you can match outliers using the regular expression (provided the string format is suitable for), then all you need is to cast (astype) into string before applying regular expression (which resides in str toolsuite of Series).
q1 = df['Item'].astype(str).str.match(r'^-99\..*')
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
But if you intend to replace those value by nan using the replace function of string object then it will requires extra steps as this replace function expect another string and nothing else (using np.nan or None will fail). Then you will have to execute:
df['Item'].astype(str).str.replace(r'^-99\..*', 'nan').astype(float)
IMO this is a pretty bad one-liner because of "unnecessary" casting which spoils the very nature of your data.
Logical Indexing
You better go for logical indexing using the boolean vector above, either by replacing by sentinel:
df.loc[q1] = np.nan
Item
0 -999.0
1 NaN
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
or slicing:
df = df.loc[~q1,:]
Item
0 -999.0
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
Anyway converting number into string to detect outlier seems a bit odd (poor performance, complex behaviour hard to debug, extra copy of data).
Float Arithmetic
Simple filter
If there is no reason that numbers less than -99. are still valid, then you can filter them out using a simple numerical criterion:
q2 = df['Item'] <= -99.
df = df.loc[~q2,:]
Item
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
Which will perform way better and avoid to cast numbers to string and vice versa. It also avoid the need of extra copy of data (string, then float again, then overwrite initial data). So it will be both memory (copy of data) and computationally (regular expression are intensive) efficient with regards to your first choice.
Epsilon ball filter
If numbers less than the cut off must be kept then you can still perform it with float arithmetic. Just change the less than criterion for an epsilon ball criterion around the desired value. To capture all numbers within [-100., -99.] you can use the following setup:
target = -99.5
epsilon = 0.5
q3 = np.abs(df['Item'] - target) <= epsilon
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
Off course you can change the target and make epsilon as small as possible with regard to your machine precision.

Dunno about Pandas, but the code you show lacks quotes, and of course the regex doesn't do what you say you want to do. \d*.$ says it has to end with a digit followed by any character. Probably you mean
df['Item'].replace(r'^-99\..*',np.NaN)
where the ^ anchor means beginning of line (or, here, beginning of the cell) and -99 just matches literal text. Finally \. matches a literal dot, and .* matches anything after that, up until the end of the cell.

Related

Python calculations using groups to sum

Good afternoon all,
Bit stuck with the last stage of a calculation.
I have a dataframe which outputs as such:
LaCode Group Frequency
0 718 NaN 2
1 718 3 1
2 719 1 4
3 719 2 10
I'm struggling with the percentage calculation which is for each LaCode, ignore where Group is NaN (and just put NaN (or blank) and calculate percentage of the frequency's where Group is known.
Should output as such:
Percentage
NaN
100
28.571
71.428
Can anyone help with this? My code doesn't take into account the change in LaCode and I can't work out the correct syntax to incorporate that issue.
Thanks.
Edit: For completeness, I have converted the NaN to an integer that stands out so I can see it (in this instance 0 as that isn't a valid group in the survey)
The code I'm using for calculation was provided to me and I tweaked a little. Works ok when just one LaCode:
df['Percentage'] = df[df['Value'] != 0]['Count'].apply(lambda x: x/sum(df[df['Value'] != 0]['Count']))

Pandas Concat Different Sized DataFrame to End of Column

Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.
Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

Pandas inconsistency with regex "." dot metacharacter?

Consider
df
Cost
Store 1 22.5
Store 1 .........
Store 2 ...
To convert these the dots to nan, I can use:
df.replace('^\.+$', np.nan, regex=True)
Cost
Store 1 22.5
Store 1 NaN
Store 2 NaN
What I don't understand is why the following pattern also works:
df.replace('^.+$', np.nan, regex=True)
Cost
Store 1 22.5
Store 1 NaN
Store 2 NaN
Note that, in this case, I haven't escaped the ., so it should be treated as a matchall character, resulting in every single row being converted to NaN... but it isn't.... only the .... rows are matched... even though I used the matchall character.
Contrast this with:
import re
re.sub('^.+$', '', '22.5')
''
Which returns an empty string.
So what's going on?
Halfway through writing this question, I realised what the problem was:
df.Cost.dtype
dtype('O')
df.Cost.values
array([22.5, '.........', '...'], dtype=object)
So, the 22.5 happens to be a numeric value, and the regex pattern simply skips over non-string values when attempting to replace. Doing an astype conversion makes it obvious:
df.astype(str).replace('.+', np.nan, regex=True)
Cost
Store 1 NaN
Store 1 NaN
Store 2 NaN
Problem solved. Leaving this up in case anyone else is confused by this.

scale numerical values for different groups in python

I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

Categories