Pandas inconsistency with regex "." dot metacharacter? - python

Consider
df
Cost
Store 1 22.5
Store 1 .........
Store 2 ...
To convert these the dots to nan, I can use:
df.replace('^\.+$', np.nan, regex=True)
Cost
Store 1 22.5
Store 1 NaN
Store 2 NaN
What I don't understand is why the following pattern also works:
df.replace('^.+$', np.nan, regex=True)
Cost
Store 1 22.5
Store 1 NaN
Store 2 NaN
Note that, in this case, I haven't escaped the ., so it should be treated as a matchall character, resulting in every single row being converted to NaN... but it isn't.... only the .... rows are matched... even though I used the matchall character.
Contrast this with:
import re
re.sub('^.+$', '', '22.5')
''
Which returns an empty string.
So what's going on?

Halfway through writing this question, I realised what the problem was:
df.Cost.dtype
dtype('O')
df.Cost.values
array([22.5, '.........', '...'], dtype=object)
So, the 22.5 happens to be a numeric value, and the regex pattern simply skips over non-string values when attempting to replace. Doing an astype conversion makes it obvious:
df.astype(str).replace('.+', np.nan, regex=True)
Cost
Store 1 NaN
Store 1 NaN
Store 2 NaN
Problem solved. Leaving this up in case anyone else is confused by this.

Related

Add commas to decimal column without rounding off

I have pandas column named Price_col. which look like this.
Price_col
1. 1000000.000
2. 234556.678900
3. 2345.00
4.
5. 23.56
I am trying to add commas to my Price_col to look like this.
Price_col
1. 1,000,000.000
2. 234,556.678900
3. 2,345.00
4.
5. 23.56
when I try convert the values it always round off. is there way that I can have original value without rounding off.
I tried below code. this what I got for the value 234556.678900.
n = "{:,}".format(234556.678900)
print(n)
>>> 234,556.6789
Add f for fixed-point
>>> "{:,}".format(234556.678900)
'234,556.6789'
>>> "{:,f}".format(234556.678900)
'234,556.678900'
You can also control the precision with .p where p is the number of digits (and should probably do so) .. beware, as you're dealing with floats, you'll have some IEEE 754 aliasing, though representation via format should be quite nice regardless of the backing data
>>> "{:,.5f}".format(234556.678900)
'234,556.67890'
>>> "{:,.20f}".format(234556.678900)
'234,556.67889999999897554517'
The full Format Specification Mini-Language can be found here:
https://docs.python.org/3/library/string.html#format-specification-mini-language
From your comment, I realized you may really want something else as described in How to display pandas DataFrame of floats using a format string for columns? and only change the view of the data
Creating a new string column formatted as a string
>>> df = pd.DataFrame({"Price_col": [1000000.000, 234556.678900, 2345.00, None, 23.56]}
>>> df["price2"] = df["Price_col"].apply(lambda x: f"{x:,f}")
>>> df
Price_col price2
0 1000000.0000 1,000,000.000000
1 234556.6789 234,556.678900
2 2345.0000 2,345.000000
3 NaN nan
4 23.5600 23.560000
>>> df.dtypes
Price_col float64
price2 object
dtype: object
Temporarily changing how data is displayed
>>> df = pd.DataFrame({"Price_col": [1000000.000, 234556.678900, 2345.00, None, 23.56]}
>>> print(df)
Price_col
0 1000000.0000
1 234556.6789
2 2345.0000
3 NaN
4 23.5600
>>> with pd.option_context('display.float_format', '€{:>18,.6f}'.format):
... print(df)
...
Price_col
0 € 1,000,000.000000
1 € 234,556.678900
2 € 2,345.000000
3 NaN
4 € 23.560000
>>> print(df)
Price_col
0 1000000.0000
1 234556.6789
2 2345.0000
3 NaN
4 23.5600

Finding specific digit pattern with regex in python

I want to replace all values in a dataframe column that starts with "-99." using regex with NaN as these are the outliers.
I used df['Item'].replace(r(^[-][9][9]\d.*$),np.NaN) but it did not work.
TL;DR
The regular expression posted by #tripleee is fine to detect numbers (encoded as string) starting with -99. The problem here is you are dealing with number and regular expression are only suited for string.
MCVE
Lets build a comprehensive example:
import numpy as np
import pandas as pd
df = pd.DataFrame([-999, -99.9, -9, 9, 99.9, 0., 1, -999], columns=['Item'])
Item
0 -999.0
1 -99.9
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
Regular Expression
Then you can match outliers using the regular expression (provided the string format is suitable for), then all you need is to cast (astype) into string before applying regular expression (which resides in str toolsuite of Series).
q1 = df['Item'].astype(str).str.match(r'^-99\..*')
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
But if you intend to replace those value by nan using the replace function of string object then it will requires extra steps as this replace function expect another string and nothing else (using np.nan or None will fail). Then you will have to execute:
df['Item'].astype(str).str.replace(r'^-99\..*', 'nan').astype(float)
IMO this is a pretty bad one-liner because of "unnecessary" casting which spoils the very nature of your data.
Logical Indexing
You better go for logical indexing using the boolean vector above, either by replacing by sentinel:
df.loc[q1] = np.nan
Item
0 -999.0
1 NaN
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
or slicing:
df = df.loc[~q1,:]
Item
0 -999.0
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
Anyway converting number into string to detect outlier seems a bit odd (poor performance, complex behaviour hard to debug, extra copy of data).
Float Arithmetic
Simple filter
If there is no reason that numbers less than -99. are still valid, then you can filter them out using a simple numerical criterion:
q2 = df['Item'] <= -99.
df = df.loc[~q2,:]
Item
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
Which will perform way better and avoid to cast numbers to string and vice versa. It also avoid the need of extra copy of data (string, then float again, then overwrite initial data). So it will be both memory (copy of data) and computationally (regular expression are intensive) efficient with regards to your first choice.
Epsilon ball filter
If numbers less than the cut off must be kept then you can still perform it with float arithmetic. Just change the less than criterion for an epsilon ball criterion around the desired value. To capture all numbers within [-100., -99.] you can use the following setup:
target = -99.5
epsilon = 0.5
q3 = np.abs(df['Item'] - target) <= epsilon
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
Off course you can change the target and make epsilon as small as possible with regard to your machine precision.
Dunno about Pandas, but the code you show lacks quotes, and of course the regex doesn't do what you say you want to do. \d*.$ says it has to end with a digit followed by any character. Probably you mean
df['Item'].replace(r'^-99\..*',np.NaN)
where the ^ anchor means beginning of line (or, here, beginning of the cell) and -99 just matches literal text. Finally \. matches a literal dot, and .* matches anything after that, up until the end of the cell.

Getting a int value from a string value of a column in a dataframe

How to extract only a integer value from a string of values consisting integers, braces, characters.
Ex:- I have an issue with this -> 946.73 [1](June 2020). I want to remove [1](June 2020) from that string or i want to extract 946.73 from that string.
i have used filter method
mobile is dataframe and Total subscribers is column and this column consists of values like 946.73 [1](June 2020).
so i need to get only the integer value from that column values.
i tried this method.
mobile['Total Subscribers']= int(filter(str.isdigit, mobile['Total Subscribers']))
url="https://en.wikipedia.org/wiki/List_of_mobile_network_operators"
mobile=pd.read_html(url,match="Company")
mobile=mobile[0]
mobile=mobile.set_index('Rank').rename(columns={'Totalsubscriptions(in
millions)':'Total Subscribers','Ownership(100% ownership unless stated
otherwise)':'Ownership'})
mobile['Total Subscribers']=mobile['Total Subscribers'].apply(lambda x:
re.search(r'\d+', x).group())
mobile['Total Subscribers']
for i in mobile['Total Subscribers']:
a=re.sub("[^\d\.]", "", i)
mobile['Total Subscribers']=a
return mobile['Total Subscribers']
This is my code. please solve
try a regex replacement which will target the square bracket and parenthesis and its contents.
\[.*\]\(.*\)
i.e
df = pd.DataFrame({'data' : ['946.73 [1](June 2020)']})
print(df)
data
0 946.73 [1](June 2020)
df['data'].replace(r'\[.*\]\(.*\)','',regex=True)
0 946.73
Name: data, dtype: object
edit - changed requirement.
mobile['Total Subscribers'].str.extract(r'(\d+.\d+)')[0]
Rank
1.0 946.73
2.0 420.00
3.0 398.30
4.0 343.47
5.0 309.52
6.0 279.80
7.0 277.50
8.0 261.46
9.0 261.34
10.0 256.20
11.0 207.96
12.0 204.60
13.0 182.42
14.0 185.50
15.0 171.41
16.0 162.57
17.0 146.10
18.0 145.84
19.0 123.22
20.0 119.87
21.0 118.32
22.0 110.0
23.0 98.49
24.0 89.32
25.0 86.40
26.0 79.67
27.0 75.10
28.0 73.08
29.0 54.5
30.0 52.42
NaN NaN
Your questions is bit confusing for me if you say you need all numeric values from a string then your regex or any other is.digit function with return the 1 and 2020 as well.
Example if I write a regex which will keep all numeric values and handling float as well then the output result would be something like :
import re
a=re.sub("[^\d\.]", "", "946.73 [1](June 2020)")
Output : 946.7312020
Or what you can do is convert the whole thing to a dataframe and try to acknowledge all the parenthesis and replace them as well.
You will have to use .replace function for that please do read documentation for more clarity.
I guess this should work for you
.replace(r'\[.*\]\(.*\)','',regex=True)
you can set regex=False if you need to use it as a switch .
.replace(r'\[.*\]\(.*\)','',regex=False)
If your value to be extracted is always a floating point you could use a simple regular expression which matches only this:
import re
text="946.73 [1](June 2020)"
matches = re.findall("\d+\.\d+", text)
if len(matches) == 1:
print(matches[0])
else:
raise ValueError

Removing dash string from mixed dtype column in pandas Dataframe

I have a dataframe with possible objects mixed with numerical values.
My target is to change every value to a simple integer, however, some of these values have - between numbers.
A minimal working example looks like:
import pandas as pd
d = {'API':[float(4433), float(3344), 6666, '6-9-11', '8-0-11', 9990]}
df = pd.DataFrame(d)
I try:
df['API'] = df['API'].str.replace('-','')
But this leaves me with nan for the numeric types because it's searching the entire frame for the strings only.
The output is:
API
nan
nan
nan
6911
8011
nan
I'd like an output:
API
4433
3344
6666
6911
8011
9990
Where all types are int.
Is there an easy way to take care of just the object types in the Series but leaving the actual numericals in tact? I'm using this technique on large data sets (300,000+ lines) so something like lambda or series operations would be preferred over a loop search.
Use df.replace with regex=True
df = df.replace('-', '', regex=True).astype(int)
API
0 4433
1 3344
2 6666
3 6911
4 8011
5 9990
Also,
df['API'] = df['API'].astype(str).apply(lambda x: x.replace('-', '')).astype(int)

Read flat file to DataFrames using Pandas with field specifiers in-line

I'm attempting to read in a flat-file to a DataFrame using pandas but can't seem to get the format right. My file has a variable number of fields represented per line and looks like this:
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCinpt|MIME=application/synthesis+ssml|TXID=NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAAA-txt|TXSZ=1167|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCsynd|INPT=1167|DURS=5120|RSTT=stop|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOClise|LUSED=0|LMAX=100|OMAX=95|LFEAT=tts|UCPU=0|SCPU=0
I have the field separator at |, I've pulled a list of all unique keys into keylist, and am trying to use the following to read in the data:
keylist = ['TIME',
'CHAN',
# [truncated]
'DURS',
'RSTT']
test_fp = 'c:\\temp\\test_output.txt'
df = pd.read_csv(test_fp, sep='|', names=keylist)
This incorrectly builds the DataFrame as I'm not specifying any way to recognize the key label in the line. I'm a little stuck and am not sure which way to research -- should I be using .read_json() for example?
Not sure if there's a slick way to do this. Sometimes when the data structure is different enough from the norm it's easiest to preprocess it on the Python side. Sure, it's not as fast, but since you could immediately save it in a more standard format it's usually not worth worrying about.
One way:
with open("wfield.txt") as fp:
rows = (dict(entry.split("=",1) for entry in row.strip().split("|")) for row in fp)
df = pd.DataFrame.from_dict(rows)
which produces
>>> df
CHAN DURS EVNT INPT LFEAT LMAX LUSED \
0 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOCinpt NaN NaN NaN NaN
1 FCJNJKDCAAANPCKEAAAAAAAA 5120 NVOCsynd 1167 NaN NaN NaN
2 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOClise NaN tts 100 0
MIME OMAX RSTT SCPU TIME \
0 application/synthesis+ssml NaN NaN 15 20131203004552049
1 NaN NaN stop 15 20131203004552049
2 NaN 95 NaN 0 20131203004552049
TXID TXSZ UCPU
0 NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAA... 1167 31
1 NaN NaN 31
2 NaN NaN 0
[3 rows x 15 columns]
After you've got this, you can reshape as needed. (I'm not sure if you wanted to combine rows with the same TIME & CHAN or not.)
Edit: if you're using an older version of pandas which doesn't support passing a generator to from_dict, you can built it from a list instead:
df = pd.DataFrame(list(rows))
but note that you haev have to convert columns to numerical columns from strings after the fact.

Categories