Extract fields from the string in python

Extract fields from the string in python - python

I have text line by line which contains many field name and their value seperated by : , if any line does not have any field value then that field would not exist in that line
for example
First line:
A:30 B: 40 TS:1/1/1990 22:22:22
Second line
A:30 TS:1/1/1990 22:22:22
third line
A:30 B: 40
But it is confirmed that at max 3 fields are possible in single line and their name will be A,B,TS.
while writing python script for this, i am facing below issues:
1) I have to extract from each line which are the field exist and what are their values
2) Field value of field TS also have seperator ' '(SPACE).so unable retrieve full value of TS(1/1/1990 22:22:22)
Output valueshould be extracted like that
First LIne:
A=30
B=40
TS=1/1/1990 22:22:22
Second Line:
A=30
TS=1/1/1990 22:22:22
Third Line
A=30
B=40
Please help me in solving this issue.

import re
a = ["A:30 B: 40 TS:1/1/1990 22:22:22", "A:30 TS:1/1/1990 22:22:22", "A:30 B: 40"]
regex = re.compile(r"^\s*(?:(A)\s*:\s*(\d+))?\s*(?:(B)\s*:\s*(\d+))?\s*(?:(TS)\s*:\s*(.*))?$")
for item in a:
matches = regex.search(item).groups()
print {k:v for k,v in zip(matches[::2], matches[1::2]) if k}
will output
{'A': '30', 'B': '40', 'TS': '1/1/1990 22:22:22'}
{'A': '30', 'TS': '1/1/1990 22:22:22'}
{'A': '30', 'B': '40'}
Explanation of the regex:
^\s* # match start of string, optional whitespace
(?: # match the following (optionally, see below)
(A) # identifier A --> backreference 1
\s*:\s* # optional whitespace, :, optional whitespace
(\d+) # any number --> backreference 2
)? # end of optional group
\s* # optional whitespace
(?:(B)\s*:\s*(\d+))?\s* # same with identifier B and number --> backrefs 3 and 4
(?:(TS)\s*:\s*(.*))? # same with id. TS and anything that follows --> 5 and 6
$ # end of string

You could use regular expressions, something like this would work if the order was assumed the same every time, otherwise you would have to match each part individually if you're unsure of the order.
import re
def parseInput(input):
m = re.match(r"A:\s*(\d+)\s*B:\s*(\d+)\s*TS:(.+)", input)
return {"A": m.group(1), "B": m.group(2), "TS": m.group(3)}
print parseInput("A:30 B: 40 TS:1/1/1990 22:22:22")
This prints out {'A': '30', 'B': '40', 'TS': '1/1/1990 22:22:22'} Which is just a dictionary containing the values.
P.S. You should accept some answers and familiarize yourself with the etiquette of site and people will be more willing to help you out.

Related

How to split the string using python3

How to split the string using regex
input :
result = '1,000.03AM2,97.2323,089.301,903.230.0034,928.9911,24.30AM'
Want to split this so that I can store into different strings for further use like following
o/p should be :
a = 1,000.03AM, b = 2,97.23, c = 23,089.30, d = 1,903.23, e = 0.00, f = 34,928.99, g = 11,24.30AM
I have tried like this but it's showing wrong output
import re
print(re.findall(r'[0-9.]+|[^0-9.]', result))

You may extract the strings using
re.findall(r'\d+(?:,\d+)*(?:\.\d{2})?[^,\d]*', text)
See the regex demo
Details
\d+ - 1+ digits
(?:,\d+)* - 0 or more repetitions of a comma and 1+ digits
(?:\.\d{2})? - an optional occurrence of a dot and 2 digits
[^,\d]* - any 0 or more chars other than a comma and digit.
Python demo:
import re
text = "1,000.03AM2,97.2323,089.301,903.230.0034,928.9911,24.30AM"
print( re.findall(r'\d+(?:,\d+)*(?:\.\d{2})?[^,\d]*', text) )
# => ['1,000.03AM', '2,97.23', '23,089.30', '1,903.23', '0.00', '34,928.99', '11,24.30AM']

For your result you need following regex:
re.findall(r"[\d,]+\.\d{2}(?:AM)?", result)
This produce following:
['1,000.03AM', '2,97.23', '23,089.30', '1,903.23', '0.00', '34,928.99', '11,24.30AM']
Regex explanation:
[\d,] - match digits and comma
[\d,]+\.\d{2} - match whole float value (with two digest after dot)
(?:AM)? - matching optional AM in non-capturing group, in example below I use (?=AM)? to not include it into result
In case on the place of AM you have anything else, you may edit (?:AM) to (?:AM|Other|...)
If you need to parse it as float, I have two suggestion for you. First is removing comma:
map(lambda x: float(x.replace(",", "")), re.findall(r"[\d,]+\.\d{2}(?=AM)?", s))
Result:
[1000.03, 297.23, 23089.3, 1903.23, 0.0, 34928.99, 1124.3]
Another variant is using locale:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
'en_US.UTF8'
>>> list(map(lambda x: locale.atof(x), re.findall(r"[\d,]+\.\d{2}(?=AM)?", s)))
[1000.03, 297.23, 23089.3, 1903.23, 0.0, 34928.99, 1124.3]

Provided if string length and its parameter remains same.
Most efficient solution would be.
a = result[0:10]
b = result[10:17]
c = result[17:26]
d = result[26:34]
e = result[34:38]
f = result[38:47]
Hope this helps.

Regex: (date/time) same item between each part

Writing a simple regex to find dates and times within strings.
There's a small issue with identifying time-items when there's specific dates in the sting. Here's the regex:
TIME_REGEX = "([0-1][0-9]|2[0-3])[:\-\_]?([0-5][0-9])[:\-\_]?([0-5][0-9])"
The issue is that I need to accept time-values without anything between the numbers, hence the two "[:-_]?" parts. However, the regex matches even if the two are different from each other. So this will also match the date "2011-07-30" as being the time 20:11:07.
Can I change the regex so both items inbetween the numbers are the same, so it matches "201107" and "20-11-07", but not "2011-07" or "20:11-07"?

You can store the delimiter in a group and reuse it:
TIME_REGEX = "([0-1][0-9]|2[0-3])(?P<sep>[:\-\_]?)([0-5][0-9])(?P=sep)([0-5][0-9])"
Here, (?P<sep>...) stores the content of this group under the name sep, which we ruse with (?P+<sep>). This way, both items always have to be equal.
Example:
for test in ['201107', '20-11-07', '20-11:07']:
match = re.match(TIME_REGEX, test)
if match:
print test, match.group(1, 3, 4), "delimiter: '{}'".format(match.group('sep'))
yields:
201107 ('20', '11', '07') delimiter: ''
20-11-07 ('20', '11', '07') delimiter: '-'

I suggest you to match the first intermediate character into a group, and use the result of this group to match the second character, as follows. You just have to retrieve the correct groups at the end:
import re
times = ['20-11-07', '2011-07', '20-1107', '201107', '20:11-07', '20-10:07', '20:11:07']
TIME_REGEX = r'([0-1][0-9]|2[0-3])([:\-\_]*)([0-5][0-9])(\2)([0-5][0-9])'
for time in times:
m = re.search(TIME_REGEX, time)
if m:
print(time, "matches with following groups:", m.group(1), m.group(3), m.group(5))
else:
print(time, "does not match")
# 20-11-07 matches with following groups: 20 11 07
# 2011-07 does not match
# 20-1107 does not match
# 201107 matches with following groups: 20 11 07
# 20:11-07 does not match
# 20-10:07 does not match
# 20:11:07 matches with following groups: 20 11 07

Extract Numeric Data from a Text file in Python

Say I have a text file with the data/string:
Dataset #1: X/Y= 5, Z=7 has been calculated
Dataset #2: X/Y= 6, Z=8 has been calculated
Dataset #10: X/Y =7, Z=9 has been calculated
I want the output to be on a csv file as:
X/Y, X/Y, X/Y
Which should display:
5, 6, 7
Here is my current approach, I am using string.find, but I feel like this is rather difficult in solving this problem:
data = open('TestData.txt').read()
#index of string
counter = 1
if (data.find('X/Y=')==1):
#extracts segment out of string
line = data[r+6:r+14]
r = data.find('X/Y=')
counter += 1
print line
else:
r = data.find('X/Y')`enter code here`
line = data[r+6:r+14]
for x in range(0,counter):
print line
print counter
Error: For some reason, I'm only getting the value of 5. when I setup a #loop, i get infinite 5's.

If you want the numbers and your txt file is formatted like the first two lines i.e X/Y= 6, not like X/Y =7:
import re
result=[]
with open("TestData.txt") as f:
for line in f:
s = re.search(r'(?<=Y=\s)\d+',line) # pattern matches up to "Y" followed by "=" and a space "\s" then a digit or digits.
if s: # if there is a match i.e re.search does not return None, add match to the list.
result.append(s.group())
print result
['5', '6', '7']
To match the pattern in your comment, you should escape the period like . or you will match strings like 1.2+3 etc.. the "." has special meaning re.
So re.search(r'(?<=Counting Numbers =\s)\d\.\d\.\d',s).group()
will return only 1.2.3
If it makes it more explicit, you can use s=re.search(r'(?<=X/Y=\s)\d+',line) using the full X/Y=\s pattern.
Using the original line in your comment and updated line would return :
['5', '6', '7', '5', '5']
The (?<=Y=\s)is called a positive lookbehind assertion.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position
There are lots of nice examples here in the re documentation. The items in the parens are not returned.

Since it appears that the entities are all on a single line, I would recommend using readline in a loop to read the file line-by-line and then using a regex to parse out the components you're looking for from that line.
Edit re: OP's comment:
One regex pattern that could be used to capture the number given the specified format in this case would be: X/Y\s*=\s*(.+),

creating Dictionary-object from string that looks like dictionaries

I have a string in that looks something similiar to the following:
myString = "major: 11, minor: 31, name: A=1,B=1,C=1,P=1, severity: 0, comment: this is down"
I have tried this so far:
dict(elem.split(':') for elem in myString.split(','))
It works fine until it catches the name-element above which can not be split() with ':'.
Element in those format I would like to have as a new dictionary e.g.
myDic = {'major':'11', 'minor': '31', 'name':{'A':'1', 'B':'1', 'C':'1', 'P', '1'}, 'severity': '0', 'comment': 'this is down'}
If possible I would like to avoid complicated parsing as these turn out to be hard to maintain.
Also I do not know the name/amount of the keys or values in the string above. I just know the format. This is not a JSON-response, this is part of a text in a file and I have no control over the current format.

FYI, This is NOT the complete solution ..
If this is the concrete structure of your input, and will be the constant pattern within your source, you can distinguish the comma-separated Tokens.
The difference between major: 11, and name: A=1,B=1,C=1,P=1, is that there is SPACE after the first token which makes the difference from the second token. So simply by adding a space into second split method, you can render your string properly.
So, the code should be something like this:
dict(elem.split(':') for elem in myString.split(', '))
Pay attention to send split method. There is a SPACE and comma ...
Regarding to the JSON format, it needs more work I guess. I have no idea now ..

Here's another suggestion.
Why don't you transform it into a dictionary notation.
E.g. in a first step, you replace everything between a ':' and (comma or end of input) that contains a '=' (and mybe no whitespace, I don't know) by wrapping it in braces and replacing '=' by ':'.
In a second step, wrap everything between a ':' and (comma or end of input) in ', removing trailing and leading whitespace.
Finally, you wrap it all in braces.
I still don't trust that syntax, though... maybe after a few thousand lines have been processed successfully...

At least, this parses the given example correctly...
import re
def parse(s):
rx = r"""(?x)
(\w+) \s* : \s*
(
(?: \w+ = \w+,)*
(?: \w+ = \w+)
|
(?: [^,]+)
)
"""
r = {}
for key, val in re.findall(rx, s):
if '=' in val:
val = dict(x.split('=') for x in val.split(','))
r[key] = val
return r
myString = "major: 11, minor: 31, name: A=1,B=1,C=1,P=1, severity: 0, comment: this is down"
print parse(myString)
# {'comment': 'this is down', 'major': '11', 'name': {'A': '1', 'P': '1', 'C': '1', 'B': '1'}, 'minor': '31', 'severity': '0'}

Convert, or unformat, a string to variables (like format(), but in reverse) in Python

I have strings of the form Version 1.4.0\n and Version 1.15.6\n, and I'd like a simple way of extracting the three numbers from them. I know I can put variables into a string with the format method; I basically want to do that backwards, like this:
# So I know I can do this:
x, y, z = 1, 4, 0
print 'Version {0}.{1}.{2}\n'.format(x,y,z)
# Output is 'Version 1.4.0\n'
# But I'd like to be able to reverse it:
mystr='Version 1.15.6\n'
a, b, c = mystr.unformat('Version {0}.{1}.{2}\n')
# And have the result that a, b, c = 1, 15, 6
Someone else I found asked the same question, but the reply was specific to their particular case: Use Python format string in reverse for parsing
A general answer (how to do format() in reverse) would be great! An answer for my specific case would be very helpful too though.

Just to build on Uche's answer, I was looking for a way to reverse a string via a pattern with kwargs. So I put together the following function:
def string_to_dict(string, pattern):
regex = re.sub(r'{(.+?)}', r'(?P<_\1>.+)', pattern)
values = list(re.search(regex, string).groups())
keys = re.findall(r'{(.+?)}', pattern)
_dict = dict(zip(keys, values))
return _dict
Which works as per:
>>> p = 'hello, my name is {name} and I am a {age} year old {what}'
>>> s = p.format(name='dan', age=33, what='developer')
>>> s
'hello, my name is dan and I am a 33 year old developer'
>>> string_to_dict(s, p)
{'age': '33', 'name': 'dan', 'what': 'developer'}
>>> s = p.format(name='cody', age=18, what='quarterback')
>>> s
'hello, my name is cody and I am a 18 year old quarterback'
>>> string_to_dict(s, p)
{'age': '18', 'name': 'cody', 'what': 'quarterback'}

>>> import re
>>> re.findall('(\d+)\.(\d+)\.(\d+)', 'Version 1.15.6\n')
[('1', '15', '6')]

The pypi package parse serves this purpose well:
pip install parse
Can be used like this:
>>> import parse
>>> result=parse.parse('Version {0}.{1}.{2}\n', 'Version 1.15.6\n')
<Result ('1', '15', '6') {}>
>>> values=list(result)
>>> print(values)
['1', '15', '6']
Note that the docs say the parse package does not EXACTLY emulate the format specification mini-language by default; it also uses some type-indicators specified by re. Of special note is that s means "whitespace" by default, rather than str. This can be easily modified to be consistent with the format specification by changing the default type for s to str (using extra_types):
result = parse.parse(format_str, string, extra_types=dict(s=str))
Here is a conceptual idea for a modification of the string.Formatter built-in class using the parse package to add unformat capability that I have used myself:
import parse
from string import Formatter
class Unformatter(Formatter):
'''A parsable formatter.'''
def unformat(self, format, string, extra_types=dict(s=str), evaluate_result=True):
return parse.parse(format, string, extra_types, evaluate_result)
unformat.__doc__ = parse.Parser.parse.__doc__
IMPORTANT: the method name parse is already in use by the Formatter class, so I have chosen unformat instead to avoid conflicts.
UPDATE: You might use it like this- very similar to the string.Formatter class.
Formatting (identical to '{:d} {:d}'.format(1, 2)):
>>> formatter = Unformatter()
>>> s = formatter.format('{:d} {:d}', 1, 2)
>>> s
'1 2'
Unformatting:
>>> result = formatter.unformat('{:d} {:d}', s)
>>> result
<Result (1, 2) {}>
>>> tuple(result)
(1, 2)
This is of course of very limited use as shown above. However, I've put up a pypi package (parmatter - a project originally for my own use but maybe others will find it useful) that explores some ideas of how to put this idea to more useful work. The package relies heavily on the aforementioned parse package. EDIT: a few years of experience under my belt later, I realized parmatter (my first package!) was a terrible, embarrassing idea and have since deleted it.

Actually the Python regular expression library already provides the general functionality you are asking for. You just have to change the syntax of the pattern slightly
>>> import re
>>> from operator import itemgetter
>>> mystr='Version 1.15.6\n'
>>> m = re.match('Version (?P<_0>.+)\.(?P<_1>.+)\.(?P<_2>.+)', mystr)
>>> map(itemgetter(1), sorted(m.groupdict().items()))
['1', '15', '6']
As you can see, you have to change the (un)format strings from {0} to (?P<_0>.+). You could even require a decimal with (?P<_0>\d+). In addition, you have to escape some of the characters to prevent them from beeing interpreted as regex special characters. But this in turm can be automated again e.g. with
>>> re.sub(r'\\{(\d+)\\}', r'(?P<_\1>.+)', re.escape('Version {0}.{1}.{2}'))
'Version\\ (?P<_0>.+)\\.(?P<_1>.+)\\.(?P<_2>.+)'

Some time ago I made the code below that does the reverse of format but limited to the cases I needed.
And, I never tried it, but I think this is also the purpose of the parse library
My code:
import string
import re
_def_re = '.+'
_int_re = '[0-9]+'
_float_re = '[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?'
_spec_char = '[\^$.|?*+()'
def format_parse(text, pattern):
"""
Scan `text` using the string.format-type `pattern`
If `text` is not a string but iterable return a list of parsed elements
All format-like pattern cannot be process:
- variable name cannot repeat (even unspecified ones s.t. '{}_{0}')
- alignment is not taken into account
- only the following variable types are recognized:
'd' look for and returns an integer
'f' look for and returns a float
Examples::
res = format_parse('the depth is -42.13', 'the {name} is {value:f}')
print res
print type(res['value'])
# {'name': 'depth', 'value': -42.13}
# <type 'float'>
print 'the {name} is {value:f}'.format(**res)
# 'the depth is -42.130000'
# Ex2: without given variable name and and invalid item (2nd)
versions = ['Version 1.4.0', 'Version 3,1,6', 'Version 0.1.0']
v = format_parse(versions, 'Version {:d}.{:d}.{:d}')
# v=[{0: 1, 1: 4, 2: 0}, None, {0: 0, 1: 1, 2: 0}]
"""
# convert pattern to suitable regular expression & variable name
v_int = 0 # available integer variable name for unnamed variable
cur_g = 0 # indices of current regexp group name
n_map = {} # map variable name (keys) to regexp group name (values)
v_cvt = {} # (optional) type conversion function attached to variable name
rpattern = '^' # stores to regexp pattern related to format pattern
for txt,vname, spec, conv in string.Formatter().parse(pattern):
# process variable name
if len(vname)==0:
vname = v_int
v_int += 1
if vname not in n_map:
gname = '_'+str(cur_g)
n_map[vname] = gname
cur_g += 1
else:
gname = n_map[vname]
# process type of required variables
if 'd' in spec: vtype = _int_re; v_cvt[vname] = int
elif 'f' in spec: vtype = _float_re; v_cvt[vname] = float
else: vtype = _def_re;
# check for regexp special characters in txt (add '\' before)
txt = ''.join(map(lambda c: '\\'+c if c in _spec_char else c, txt))
rpattern += txt + '(?P<'+gname+'>' + vtype +')'
rpattern += '$'
# replace dictionary key from regexp group-name to the variable-name
def map_result(match):
if match is None: return None
match = match.groupdict()
match = dict((vname, match[gname]) for vname,gname in n_map.iteritems())
for vname, value in match.iteritems():
if vname in v_cvt:
match[vname] = v_cvt[vname](value)
return match
# parse pattern
if isinstance(text,basestring):
match = re.search(rpattern, text)
match = map_result(match)
else:
comp = re.compile(rpattern)
match = map(comp.search, text)
match = map(map_result, match)
return match
for your case, here is a use example:
versions = ['Version 1.4.0', 'Version 3.1.6', 'Version 0.1.0']
v = format_parse(versions, 'Version {:d}.{:d}.{:d}')
# v=[{0: 1, 1: 4, 2: 0}, {0: 3, 1: 1, 2: 6}, {0: 0, 1: 1, 2: 0}]
# to get the versions as a list of integer list, you can use:
v = [[vi[i] for i in range(3)] for vi in filter(None,v)]
Note the filter(None,v) to remove unparsable versions (which return None). Here it is not necessary.

This
a, b, c = (int(i) for i in mystr.split()[1].split('.'))
will give you int values for a, b and c
>>> a
1
>>> b
15
>>> c
6
Depending on how regular or irregular, i.e., consistent, your number/version formats will be, you may want to consider the use of regular expressions, though if they will stay in this format, I would favor the simpler solution if it works for you.

Here's a solution in case you don't want to use the parse module. It converts format strings into regular expressions with named groups. It makes a few assumptions (described in the docstring) that were okay in my case, but may not be okay in yours.
def match_format_string(format_str, s):
"""Match s against the given format string, return dict of matches.
We assume all of the arguments in format string are named keyword arguments (i.e. no {} or
{:0.2f}). We also assume that all chars are allowed in each keyword argument, so separators
need to be present which aren't present in the keyword arguments (i.e. '{one}{two}' won't work
reliably as a format string but '{one}-{two}' will if the hyphen isn't used in {one} or {two}).
We raise if the format string does not match s.
Example:
fs = '{test}-{flight}-{go}'
s = fs.format('first', 'second', 'third')
match_format_string(fs, s) -> {'test': 'first', 'flight': 'second', 'go': 'third'}
"""
# First split on any keyword arguments, note that the names of keyword arguments will be in the
# 1st, 3rd, ... positions in this list
tokens = re.split(r'\{(.*?)\}', format_str)
keywords = tokens[1::2]
# Now replace keyword arguments with named groups matching them. We also escape between keyword
# arguments so we support meta-characters there. Re-join tokens to form our regexp pattern
tokens[1::2] = map(u'(?P<{}>.*)'.format, keywords)
tokens[0::2] = map(re.escape, tokens[0::2])
pattern = ''.join(tokens)
# Use our pattern to match the given string, raise if it doesn't match
matches = re.match(pattern, s)
if not matches:
raise Exception("Format string did not match")
# Return a dict with all of our keywords and their values
return {x: matches.group(x) for x in keywords}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract fields from the string in python - python

Related

How to split the string using python3

Regex: (date/time) same item between each part

Extract Numeric Data from a Text file in Python

creating Dictionary-object from string that looks like dictionaries

Convert, or unformat, a string to variables (like format(), but in reverse) in Python

Categories

Resources