regular expression search in python

regular expression search in python - python

I am trying to parse some data and just started reading up on regular Expressions so I am pretty new to it. This is the code I have so far
String = "MEASUREMENT 3835 303 Oxygen: 235.78 Saturation: 90.51 Temperature: 24.41 DPhase: 33.07 BPhase: 29.56 RPhase: 0.00 BAmp: 368.57 BPot: 18.00 RAmp: 0.00 RawTem.: 68.21"
String = String.strip('\t\x11\x13')
String = String.split("Oxygen:")
print String[1]
String[1].lstrip
print String[1]
What I am trying to do is to do is remove the oxygen data (235.78) and put it in its own variable using an regular expression search. I realize that there should be an easy solution but I am trying to figure out how regular expressions work and they are making my head hurt. Thanks for any help
Richard

re.search( r"Oxygen: *([\d.]+)", String ).group( 1 )

import re
string = "blabla Oxygen: 10.10 blabla"
regex_oxygen = re.compile('''Oxygen:\W+([0-9.]*)''')
result = re.findall(regex_oxygen,string)
print result

What for?
print String.split()[4]

For general parsing of lists like this one could
import re
String = "MEASUREMENT 3835 303 Oxygen: 235.78 Saturation: 90.51"
String = String.replace(':','')
value_list=re.split("MEASUREMENT\W+[0-9]+\W+[0-9]+\W",String)[1].rstrip().split()
values = dict(zip(value_list[::2],map(float,value_list[1::2])))

I believe the answer to you specific problem has been posted. However I wanted to show you a few ressource for regular expression for python. The python documentation on regular expression is the place to start.
O'reilly also has many good books on the subject, either if you want to understand regular expression deep down or just enough to make things work.
Finally regular-expressions.info is a good ressource for regular expression among mainstream languages. You can even test your regular expression on the website.

I would like to share my ?is this an email? regex expresion, just to inspire you. :)
9 emailregex = "^[a-zA-Z.a-zA-Z]+#mycompany.org$"
10
11 def validateEmail(email):
12 """returns 1 if is an email, 0 if not """
13 # len(x.y#mycompany.org) = 17
14 if len(email)>=17:
15 if re.match(emailregex,email)!= None:
16 return 1
17 return 0

Related

How to validate hdfs result with python regex to find out if it is a folder, file or non exist

I have a str which is the result of hdfs command to show, for a given path, if it is a folder, or a file, or doesn't exist, here are examples:
str_file:
-rw-rw----+ 3 jdoe clouderausersdev 12267543 2018-02-05 16:41 hdfs://nameservice1/client/abc/part-00000-994917013a6a-c000.snappy.parquet
str_folder:
Found 3 items
-rw-rw----+ 3 jdoe clouderausersdev 0 2018-02-05 16:41 hdfs://nameservice1/client/abc/_SUCCESS
-rw-rw----+ 3 jdoe clouderausersdev 12267543 2018-02-05 16:41 hdfs://nameservice1/client/abc/part-00000-994917013a6a-c000.snappy.parquet
-rw-rw----+ 3 jdoe clouderausersdev 12267543 2018-02-05 16:41 hdfs://nameservice1/client/abc/part-00001-994917013a6a-c000.snappy.parquet
-rw-rw----+ 3 jdoe clouderausersdev 12267543 2018-02-05 16:41 hdfs://nameservice1/client/abc/part-00002-994917013a6a-c000.snappy.parquet
str_nonexist:
ls: `hdfs://nameservice1/client/abc/part-00000.parqu': No such file or directory
Now that I want to know the result with regex check, here is the problem:
import re
regex_folder = "Found [1-9]\d items"
regex_file = ".parquet"
regex_error = "No such file"
Testing result is as below:
So, how do I tell the difference between m and m1? apparently m means no finding while m1 means there is match.
Eventually I need to tell the three scenarios: folder, file, nonexist
Thank you very much.
Update (as per Филип Димитровски):
Still now working:m1 should show a match and m should not.

Your regex Found [1-9]\d items means you want to search for Found xy items where x is a digit between 1 and 9, and y is a digit between 0 and 9. This is problematic, and I assume it is not what you want. If you want to match one or more digits, just use \d+. You may use online regex testers to debug such issues.
The second problem is the misuse of string literals. When working with regular expressions in Python, you should always use raw string literals, which start with the letter 'r'.
This is good: re.search(r'regex \d here', ..
This is bad: re.search('regex \d here', ...
Once you fix those, result = re.search(r'Found \d+ items', some_string) will work. To check if you have don't have any matches, you can compare the result with None. If the result is not None, it will be a match object. Note: when an expression evaluates to None its result is blank and it is not displayed in the interpreter.
Here's a working demo:
import re
str1 = 'ffff'
str2 = 'Found 3 items ffff'
reg_folder = r'Found ([1-9]\d*) items'
if re.search(reg_folder, str1) is None:
print('Nothing found in str1')
result = re.search(reg_folder, str2)
if result is not None:
num = result[1]
print('Found the number: {}'.format(num))
else:
print('Nothing found in str2')
Also, keep in mind that regular expressions are not good at parsing human-friendly messages and there may be libraries for HDFS instead of parsing raw output.

How to find ellipses in text string Python?

Fairly new to Python (And Stack Overflow!) here. I have a data set with subject line data (text strings) that I am working on building a bag of words model with. I'm creating new variables that flags a 0 or 1 for various possible scenarios, but I'm stuck trying to identify where there is an ellipsis ("...") in the text. Here's where I'm starting from:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('(\w+)\.{2,}(.+)')
Inputting ('...') doesn't work for obvious reasons, but the above RegEx code was suggested--still not working. Also tried this:
Data_Frame['Elipses'] = Data_Frame.Subject_Line.str.match('.\.\.\')
No dice.
The above code shell works for other variables I've created, but I'm also having trouble creating a 0-1 output instead of True/False (would be an 'as.numeric' argument in R.) Any help here would also be appreciated.
Thanks!

Using search() instead of match() would spot an ellipses at any point in the text. In Pandas str.contains() supports regular expressions:
For example in Pandas:
import pandas as pd
df = pd.DataFrame({'Text' : ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]})
df['Ellipses'] = df.Text.str.contains(r'\w+(\.{3,})|…')
print(df)
Giving you:
Text Ellipses
0 hello.. False
1 again... this True
2 is......a test True
3 Real ellipses… here True
4 ...not here False
Or without pandas:
import re
for test in ["hello..", "again... this", "is......a test", "Real ellipses… here", "...not here"]:
print(int(bool(re.search(r'\w+(\.{3,})|…', test))))
This matches on the middle tests giving:
0
1
1
1
0
Take a look at search-vs-match for a good explanation in the Python docs.
To display the matching words:
import re
for test in ["hello..", "again... this", "is......a test", "...def"]:
ellipses = re.search(r'(\w+)\.{3,}', test)
if ellipses:
print(ellipses.group(1))
Giving you:
again
is

Python multiple if search in one string

I´m a network engineer with no experience in programming, recently in python, but making small improvements everyday.
I need some help in getting multiple matches in IF statements like:
if "access-class 30" in output and "exec-timeout 5 5" in output:
print ('###### ACL VTY OK!!! ######')
Is it possible to check multiple keywords in a single string ?
Thanks for all your time.

Use the all function with a generator expression:
data = ["access-class 30", "exec-timeout 5 5"]
if all(s in output for s in data):
print('###### ACL VTY OK!!! ######')

Yes it is possible.
You can use regular expressions(Regex).
import re
li = [] # List of all the keywords
for l in li
for m in re.finditer(l,output)
if m !=None:
print 'match found'

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!

Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".

The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:

\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)

Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Find all IPs on an HTML Page

I want to get an HTML page with python and then print out all the IPs from it.
I will define an IP as the following:
x.x.x.x:y
Where:
x = a number between 0 and 256.
y = a number with < 7 digits.
Thanks.

Right. The only part I cant do is the regular expression one. – das 9 mins ago If someone shows me that, I will be fine. – das 8 mins ago
import re
ip = re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?):\d{1,6}\b")
junk = " 1.1.1.1:123 2.2.2.2:321 312.123.1.12:123 "
print ip.findall(junk)
# outputs ['1.1.1.1:123', '2.2.2.2:321']
Here is a complete example:
import re, urllib2
f = urllib2.urlopen("http://www.samair.ru/proxy/ip-address-01.htm")
junk = f.read()
ip = re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?):\d{1,6}\b")
print ip.findall(junk)
# ['114.30.47.10:80', '118.228.148.83:80', '119.70.40.101:8080', '12.47.164.114:8888', '121.
# 17.161.114:3128', '122.152.183.103:80', '122.224.171.91:3128', '123.234.32.27:8080', '124.
# 107.85.115:80', '124.247.222.66:6588', '125.76.228.201:808', '128.112.139.75:3128', '128.2
# 08.004.197:3128', '128.233.252.11:3124', '128.233.252.12:3124']

The basic approach would be:
Use urllib2 to download the contents of the page
Use a regular expression to extract IPv4-like addresses
Validate each match according to the numeric constraints on each octet
Print out the list of matches
Please provide a clearer indication of what specific part you are having trouble with, along with evidence to show what it is you've tried thus far.

Not to turn this into a who's-a-better-regex-author-war but...
(\d{1,3}\.){3}\d{1,3}\:\d{1,6}

Try:
re.compile("\d?\d?\d.\d?\d?\d.\d?\d?\d.\d?\d?\d:\d+").findall(urllib2.urlopen(url).read())

In action:
\b(?: # A.B.C in A.B.C.D:port
(?:
25[0-5]
| 2[0-4][0-9]
| 1[0-9][0-9]
| [1-9]?[0-9]
)\.
){3}
(?: # D in A.B.C.D:port
25[0-5]
| 2[0-4][0-9]
| 1[0-9][0-9]
| [1-9]?[0-9]
)
:[1-9]\d{0,5} # port number any number in (0,999999]
\b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regular expression search in python - python

re.search( r"Oxygen: *([\d.]+)", String ).group( 1 )

import re string = "blabla Oxygen: 10.10 blabla" regex_oxygen = re.compile('''Oxygen:\W+([0-9.]*)''') result = re.findall(regex_oxygen,string) print result

What for? print String.split()[4]

Related

How to validate hdfs result with python regex to find out if it is a folder, file or non exist

How to find ellipses in text string Python?

Python multiple if search in one string

Parsing file name with RegEx - Python

Find all IPs on an HTML Page

Categories

Resources