Extracting number from unicode string with regex

Extracting number from unicode string with regex - python

I have the following dictionary which contains some product data:
dictionary = {'price': [u'3\xa0590 EUR'],
'name': [u'Product name with unicode chars]}
All values are in unicode. As you can see I'm using lists as dictionary values because sometimes I need to concatenate the information from several different sources.
I'm looking for a way to extract the digits from the price value without the non-breaking space (\xa0) and currency at the end (EUR) by using a regex.
In this case I would like to see the following as a result:
3590
Can you please suggest a solution?
[SOLUTION]
Adding the solution here because the comments field wrapped my code unexpectedly:
I used .sub() method from Python's re module which is a replace function. Here is the final code that gives me the expected result:
p = re.compile( '(\xa0| EUR|)')
result = p.sub( '', dictionary['price'][0])

Not sure about python, but here's a regex:
p = /\D/g;
s.replace(p, '');

Related

regex on python spiting a string into a specific sequence

I have a string that may look like
CITS/CPU/0218/2305CITS/VDU/0218/2305CITS/KEY/0218/2305
or
CITS/CPU/0218/2305CITS/VDU/0218/2305 CITS/KEY/0218/2305
or
CITS/CPU/0218/2305 CITS/VDU/0218/2305 CITS/KEY/0218/2305
or
CITS/CPU/0218/2305
I was trying to come up with a regex that would match against a sequence like CITS/CPU/0218/2305 so that I can split any string into a list that matches this case only.
Essentially I just need to extract the */*/*/* part into a list from incoming strings
My code
product_code = CITS/CPU/0218/2305CITS/VDU/0218/2305 CITS/KEY/0218/2305
(re.split(r'^((?:[a-z][a-z]+))(.)((?:[a-z][a-z]+))((?:[a-z][a-z]+))(.)(\\d+)(.)(\\d+)$', product_code))
Any suggestions?

Try using re.findall here:
inp = "CITS/CPU/0218/2305CITS/VDU/0218/2305CITS/KEY/0218/2305"
matches = re.findall(r'[A-Z]+/[A-Z]+/[0-9]+/[0-9]+', inp)
print(matches)
This prints:
['CITS/CPU/0218/2305', 'CITS/VDU/0218/2305', 'CITS/KEY/0218/2305']
If you only want the first match, then just access it:
print(matches[0])
['CITS/CPU/0218/2305']

What's a better way to process inconsistently structured strings?

I have an output string like this:
read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec
And I want to just extract one of the numerical values for computation, say iops. I'm processing it like this:
if 'read ' in key:
my_read_iops = value.split(",")[2].split("=")[1]
result['test_details']['read'] = my_read_iops
But there are slight inconsistencies with some of the strings I'm reading in and my code is getting super complicated and verbose. So instead of manually counting the number of commas vs "=" chars, what's a better way to handle this?

You can use regular expression \s* to handle inconsistent spacing, it matches zero or more whitespaces:
import re
s = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
for m in re.finditer(r'\s*(?P<name>\w*)\s*=\s*(?P<value>[\w/]*)\s*', s):
print(m.group('name'), m.group('value'))
# io 131220KB
# bw 14016KB/s
# iops 3504
# runt 9362msec
Using group name, you can construct pattern string from a list of column names and do it like:
names = ['io', 'bw', 'iops', 'runt']
name_val_pat = r'\s*{name}\s*=\s*(?P<{group_name}>[\w/]*)\s*'
pattern = ','.join([name_val_pat.format(name=name, group_name=name) for name in names])
# '\s*io\s*=\s*(?P<io>[\w/]*)\s*,\s*bw\s*=\s*(?P<bw>[\w/]*)\s*,\s*iops\s*=\s*(?P<iops>[\w/]*)\s*,\s*runt\s*=\s*(?P<runt>[\w/]*)\s*'
match = re.search(pattern, s)
data_dict = {name: match.group(name) for name in names}
print(data_dict)
# {'io': '131220KB', 'bw': '14016KB/s', 'runt': '9362msec', 'iops': '3504'}
In this way, you only need to change names and keep the order correct.

If I were you,I'd use regex(regular expression) as first choice.
import re
s= "read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec"
re.search(r"iops=(\d+)",s).group(1)
By this python code, I find the string pattern that starts 'iops=' and continues number expression at least 1 digit.I extract the target string(3504) by using round bracket.
you can find more information about regex from
https://docs.python.org/3.6/library/re.html#module-re
regex is powerful language for complex pattern matching with simple syntax.

from re import match
string = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
iops = match(r'.+(iops=)([0-9]+)', string).group(2)
iops
'3504'

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']

I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)

Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Extracting 2 strings from regular expression Python

I am trying to extract city, state and/or zip code from a string using a regular expression. The regex I am using (from here get city, state or zip from a string in python) is ([^\d]+)?(\d{5})? and when I tested it on http://regex101.com/ it accurately selects the two strings I want to match.
However I'm not sure how to separate these two strings in Python. Here is what I have tried:
import re
string = "binghamton ny 13905"
reg = re.compile('([^\d]+)?(\d{5})?')
match = reg.match(string)
return match.group()
This simply returns the entire string. Is there a way to pull each match individually?
I have also tried separating the regular expression into two distinct regular expressions (one for city, state and one for zip code) however the zip code regex either returns an empty string or None. All help is appreciated, thanks.

Probably the easiest way is to name the two capturing groups:
reg = re.compile('(?P<city>[^\d]+)?(?P<zip>\d{5})?')
and then access the groupdict:
>>> match = reg.match("binghamton ny 13905")
>>> match.groupdict()
{'city': 'binghamton ny ', 'zip': '13905'}
This gives you easy access to the two pieces of information by name, rather than index.

I would agree with jonrsharpe
string = "binghamton ny 13905"
reg = re.compile('(?P<city>[^\d]+)?(?P<zip>\d{5})?')
result = re.match(reg, string)
Additionally you can access the variables by name like this:
result.group('city')
result.group('zip')
Python re reference page

r = re.search("([^\d]+)?(\d{5})?")
r.groups()
(u'binghamton ny ', u'13905')

Parsing a string using regular expression?

my_string = "Value1=Product Registered;Value2=Linux;Value3=C:5;C++:5;Value4=43;"
I was using the following regex:
tokens = re.findall(r'([^;]+)=([^;]+)', line, re.I)
I need to parse value1, value2, etc and put their values into the database. For example, I need to store "C:5;C++:5" for value3 -- but by using the above regex I can only store C:5, because I parse based on ";". What would be a better way to do this?
Thanks!

It seems reasonable to assume that the key names don't contain semicolons. If this isn't true, then as Philipp pointed out the language is ambiguous. But if not, you can use a lookahead to tell which ; is the separator: it has to be followed by a sequence of things that aren't either ; or =, and then either an = or end-of-string:
>>> my_string = "Value1=Product Registered;Value2=Linux;Value3=C:5;C++:5;Value4=43;"
>>> r = re.compile(r'([^;]+)=([^=]+);(?=[^;=]*(?:=|$))')
>>> r.findall(my_string)
[('Value1', 'Product Registered'),
('Value2', 'Linux'),
('Value3', 'C:5;C++:5'),
('Value4', '43')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting number from unicode string with regex - python

Not sure about python, but here's a regex: p = /\D/g; s.replace(p, '');

Related

regex on python spiting a string into a specific sequence

What's a better way to process inconsistently structured strings?

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

Extracting 2 strings from regular expression Python

Parsing a string using regular expression?

Categories

Resources