Extract date string from (more) complex string (possibly a regex match)

Extract date string from (more) complex string (possibly a regex match) - python

I have a string template that looks like 'my_index-{year}'.
I do something like string_template.format(year=year) where year is some string. Result of this is some string that looks like my_index-2011.
Now. to my question. I have a string like my_index-2011 and my template 'my_index-{year}' What might be a slick way to extract the {year} portion?
[Note: I know of the existence of parse library]

There is this module called parse which provides an opposite to format() functionality:
Parse strings using a specification based on the Python format() syntax.
>>> from parse import parse
>>> s = "my_index-2011"
>>> f = "my_index-{year}"
>>> parse(f, s)['year']
'2011'
And, an alternative option and, since you are extracting a year, would be to use the dateutil parser in a fuzzy mode:
>>> from dateutil.parser import parse
>>> parse("my_index-2011", fuzzy=True).year
2011

Use the split() string function to split the string into two parts around the dash, then grab just the second part.
mystring = "my_index-2011"
year = mystring.split("-")[1]

I assume "year" is 4 digits and you have multiple indexes
import re
res = ''
patterns = [ '%s-[0-9]{4}'%index for index in idx ]
for index,pattern in zip(idx,patterns):
res +=' '.join( re.findall(pattern ,data) ).replace(index+'-','') + ' '
---update---
dummyString = 'adsf-1234 fsfdr lkjdfaif ln ewr-1234 adsferggs sfdgrsfgadsf-3456'
dummyIdx = ['ewr','adsf']
output
1234 1234 3456

Yes, a regex would be helpful here.
In [1]: import re
In [2]: s = 'my_string-2014'
In [3]: print( re.search('\d{4}', s).group(0) )
2014
Edit: I should have mentioned your regex can be more sophisticated. You can haul out a subcomponent of a more specific string, for example:
In [4]: print( re.search('my_string-(\d{4})$', s).group(1) )
2014
Given the problem you presented, I think any "find the year" formula should be expressible in terms of a regular expression.

You are going to want to use the string method split to split on "-", and then catch the last element as your year:
year = "any_index-2016".split("-")[-1]
Because you caught the last element (using -1 as the index), your index can have hyphens in them, and you will still extract the year appropriately.

Related

extract hour from a string _ unclear format

this question maybe is duplicated but I didn't find any exact solution for this. I have this type of string that includes date and time.
"check_in": "10/25/2019 14:30"
I need to extract an hour from it but this is not always a valid format. I tried this pattern so far but it includes the ":" character.
\d+?(:)
(\d+:)
(\d+)*:

Regular expressions aren't always the best way to deal with strings representing dates, especially if you can't rely on the input format to be consistent. Use a specialized parser instead:
>>> from dateutil import parser
>>> parser.parse("10/25/2019 14:30").hour
14
>>> parser.parse("10/25/2019 2:30 PM").hour
14
>>> parser.parse("2019-10-25T143000").hour
14
The module dateutil isn't in the standard library but is well worth the trouble of downloading.

\d+(?=:)
Demo
You don't need match the :, but need check it. So use Positive Lookahead (?=:).

First, this is what is wrong with your regexes:
\d+?(:) - finds number and column (14:) and puts the column into a group
(\d+:) - finds number and column (14:) and puts all of it into a group
(\d+)*: - finds (optionally, because of *) number and column (14:) and puts the number into a group
So, the last one could work:
>>> match = re.search(r'(\d+)*:', "10/25/2019 14:30")
>>> match.group(0) # whole result
'14:'
>>> match.group(1) # just the number
'14'
But then again, it would give wrong result (instead of breaking) on something like "time: 14:30", making it difficult to debug the error later. What you want is to use a more strict search, e.g. matching the whole string and labelling all groups:
>>> regex = r'(?P<month>\d\d)/(?P<day>\d\d)/(?P<year>\d{4}) (?P<hour>\d\d):(?P<minute>\d\d)'
>>> re.search(regex, "10/25/2019 14:30").group('hour')
'14'
Another, easier and even safer way is to use strptime:
>>> import datetime
>>> datetime.datetime.strptime("10/25/2019 14:30", "%m/%d/%Y %H:%M")
datetime.datetime(2019, 10, 25, 14, 30)
Now you have the complete datetime object and you can extract the .hour if you want.

How to replace a pattern using regular expression?

string1 = "2018-Feb-23-05-18-11"
I would like to replace a particular pattern in a string.
Output should be 2018-Feb-23-5-18-11.
How can i do that by using re.sub ?
Example:
import re
output = re.sub(r'10', r'20', "hello number 10, Agosto 19")
#hello number 20, Agosto 19
Fetching the current_datetime from datetime module. i'm formatting the obtained datetime in a desired format.
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime("%Y-%b-%d-%I-%M-%S")
I thought, re.sub is the best way to do that.
ex1 :
string1 = "2018-Feb-23-05-18-11"
output : 2018-Feb-23-5-18-11
ex2 :
string1 = "2018-Feb-23-05-8-11"
output : 2018-Feb-23-5-08-11

When working with dates and times, it is almost always best to convert the date first into a Python datetime object rather than trying to attempt to alter it using a regular expression. This can then be converted back into the required date format more easily.
With regards to leading zeros though, the formatting options only give leading zero options, so to get more flexibility it is sometimes necessary to mix the formatting with standard Python formatting:
from datetime import datetime
for test in ['2018-Feb-23-05-18-11', '2018-Feb-23-05-8-11', '2018-Feb-1-0-0-0']:
dt = datetime.strptime(test, '%Y-%b-%d-%H-%M-%S')
print '{dt.year}-{}-{dt.day}-{dt.hour}-{dt.minute:02}-{dt.second}'.format(dt.strftime('%b'), dt=dt)
Giving you:
2018-Feb-23-5-18-11
2018-Feb-23-5-08-11
2018-Feb-1-0-00-0
This uses a .format() function to combine the parts. It allows objects to be passed and the formatting is then able to access the object's attributes directly. The only part that needs to be formatted using strftime() is the month.
This would give the same results:
import re
for test in ['2018-Feb-23-05-18-11', '2018-Feb-23-05-8-11', '2018-Feb-1-0-0-0']:
print re.sub(r'(\d+-\w+)-(\d+)-(\d+)-(\d+)-(\d+)', lambda x: '{}-{}-{}-{:02}-{}'.format(x.group(1), int(x.group(2)), int(x.group(3)), int(x.group(4)), int(x.group(5))), test)

Use the datetime module.
Ex:
import datetime
string1 = "2018-Feb-23-05-18-11"
d = datetime.datetime.strptime(string1, "%Y-%b-%d-%H-%M-%S")
print("{0}-{1}-{2}-{3}-{4}-{5}".format(d.year, d.strftime("%b"), d.day, d.hour, d.minute, d.second))
Output:
2018-Feb-23-5-18-11

What's a better way to process inconsistently structured strings?

I have an output string like this:
read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec
And I want to just extract one of the numerical values for computation, say iops. I'm processing it like this:
if 'read ' in key:
my_read_iops = value.split(",")[2].split("=")[1]
result['test_details']['read'] = my_read_iops
But there are slight inconsistencies with some of the strings I'm reading in and my code is getting super complicated and verbose. So instead of manually counting the number of commas vs "=" chars, what's a better way to handle this?

You can use regular expression \s* to handle inconsistent spacing, it matches zero or more whitespaces:
import re
s = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
for m in re.finditer(r'\s*(?P<name>\w*)\s*=\s*(?P<value>[\w/]*)\s*', s):
print(m.group('name'), m.group('value'))
# io 131220KB
# bw 14016KB/s
# iops 3504
# runt 9362msec
Using group name, you can construct pattern string from a list of column names and do it like:
names = ['io', 'bw', 'iops', 'runt']
name_val_pat = r'\s*{name}\s*=\s*(?P<{group_name}>[\w/]*)\s*'
pattern = ','.join([name_val_pat.format(name=name, group_name=name) for name in names])
# '\s*io\s*=\s*(?P<io>[\w/]*)\s*,\s*bw\s*=\s*(?P<bw>[\w/]*)\s*,\s*iops\s*=\s*(?P<iops>[\w/]*)\s*,\s*runt\s*=\s*(?P<runt>[\w/]*)\s*'
match = re.search(pattern, s)
data_dict = {name: match.group(name) for name in names}
print(data_dict)
# {'io': '131220KB', 'bw': '14016KB/s', 'runt': '9362msec', 'iops': '3504'}
In this way, you only need to change names and keep the order correct.

If I were you,I'd use regex(regular expression) as first choice.
import re
s= "read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec"
re.search(r"iops=(\d+)",s).group(1)
By this python code, I find the string pattern that starts 'iops=' and continues number expression at least 1 digit.I extract the target string(3504) by using round bracket.
you can find more information about regex from
https://docs.python.org/3.6/library/re.html#module-re
regex is powerful language for complex pattern matching with simple syntax.

from re import match
string = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
iops = match(r'.+(iops=)([0-9]+)', string).group(2)
iops
'3504'

Python - Most elegant way to extract a substring, being given left and right borders [duplicate]

This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 4 years ago.
I have a string - Python :
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
Expected output is :
"Atlantis-GPS-coordinates"
I know that the expected output is ALWAYS surrounded by "/bar/" on the left and "/" on the right :
"/bar/Atlantis-GPS-coordinates/"
Proposed solution would look like :
a = string.find("/bar/")
b = string.find("/",a+5)
output=string[a+5,b]
This works, but I don't like it.
Does someone know a beautiful function or tip ?

You can use split:
>>> string.split("/bar/")[1].split("/")[0]
'Atlantis-GPS-coordinates'
Some efficiency from adding a max split of 1 I suppose:
>>> string.split("/bar/", 1)[1].split("/", 1)[0]
'Atlantis-GPS-coordinates'
Or use partition:
>>> string.partition("/bar/")[2].partition("/")[0]
'Atlantis-GPS-coordinates'
Or a regex:
>>> re.search(r'/bar/([^/]+)', string).group(1)
'Atlantis-GPS-coordinates'
Depends on what speaks to you and your data.

What you haven't isn't all that bad. I'd write it as:
start = string.find('/bar/') + 5
end = string.find('/', start)
output = string[start:end]
as long as you know that /bar/WHAT-YOU-WANT/ is always going to be present. Otherwise, I would reach for the regular expression knife:
>>> import re
>>> PATTERN = re.compile('^.*/bar/([^/]*)/.*$')
>>> s = '/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/'
>>> match = PATTERN.match(s)
>>> match.group(1)
'Atlantis-GPS-coordinates'

import re
pattern = '(?<=/bar/).+?/'
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
result = re.search(pattern, string)
print string[result.start():result.end() - 1]
# "Atlantis-GPS-coordinates"
That is a Python 2.x example. What it does first is:
1. (?<=/bar/) means only process the following regex if this precedes it (so that /bar/ must be before it)
2. '.+?/' means any amount of characters up until the next '/' char
Hope that helps some.
If you need to do this kind of search a bunch it is better to 'compile' this search for performance, but if you only need to do it once don't bother.

Using re (slower than other solutions):
>>> import re
>>> string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
>>> re.search(r'(?<=/bar/)[^/]+(?=/)', string).group()
'Atlantis-GPS-coordinates'

Regex Expression not matching correctly

I'm tackling a python challenge problem to find a block of text in the format xXXXxXXXx (lower vs upper case, not all X's) in a chunk like this:
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
I have tested the following RegEx and found it correctly matches what I am looking for from this site (http://www.regexr.com/):
'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])'
However, when I try to match this expression to the block of text, it just returns the entire string:
In [1]: import re
In [2]: example = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
In [3]: expression = re.compile(r'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])')
In [4]: found = expression.search(example)
In [5]: print found.string
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
Any ideas? Is my expression incorrect? Also, if there is a simpler way to represent that expression, feel free to let me know. I'm fairly new to RegEx.

You need to return the match group instead of the string attribute.
>>> import re
>>> s = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
>>> rgx = re.compile(r'[a-z][A-Z]{3}[a-z][A-Z]{3}[a-z]')
>>> found = rgx.search(s).group()
>>> print found
nJDKoJIWh

The string attribute always returns the string passed as input to the match. This is clearly documented:
string
The string passed to match() or search().
The problem has nothing to do with the matching, you're just grabbing the wrong thing from the match object. Use match.group(0) (or match.group()).

Based on xXXXxXXXx if you want upper letters with len 3 and lower with len 1 between them this is what you want :
([a-z])(([A-Z]){3}([a-z]))+
also you can get your search function with group()
print expression.search(example).group(0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract date string from (more) complex string (possibly a regex match) - python

Use the split() string function to split the string into two parts around the dash, then grab just the second part. mystring = "my_index-2011" year = mystring.split("-")[1]

Related

extract hour from a string _ unclear format

How to replace a pattern using regular expression?

What's a better way to process inconsistently structured strings?

Python - Most elegant way to extract a substring, being given left and right borders [duplicate]

Regex Expression not matching correctly

Categories

Resources