Splitting a string similar to ip addresses using regex in Python - python

I want to have a regular expression which will split on seeing a '.'(dot)
For example:
Input: '1.2.3.4.5.6'
Output : ['1', '2', '3', '4', '5', '6']
What I have tried:-
>>> pattern = '(\d+)(\.(\d+))+'
>>> test = '192.168.7.6'
>>> re.findall(pat, test)
What I get:-
[('192', '.6', '6')]
What I expect from re.findall():-
[('192', '168', '7', '6')]
Could you please help in pointing what is wrong?
My thinking -
In pattern = '(\d+)(\.(\d+))+', initial (\d+) will find first number i.e. 192 Then (\.(\d+))+ will find one or more occurences of the form '.<number>' i.e. .168 and .7 and .6
[EDIT:]
This is a simplified version of the problem I am solving.
In reality, the input can be-
192.168 dot 7 {dot} 6
and expected output is still [('192', '168', '7', '6')].
Once I figure out the solution to extract .168, .7, .6 like patterns, I can then extend it to dot 168, {dot} 7 like patterns.

Since you only need to find the numbers, the regex \d+ should be enough to find numbers separated by any other token/separator:
re.findall("\d+", test)
This should work on any of those cases:
>>> re.findall("\d+", "192.168.7.6")
['192', '168', '7', '6']
>>> re.findall("\d+", "192.168 dot 7 {dot} 6 | 125 ; 1")
['192', '168', '7', '6', '125', '1']

Related

How to get all the substrings in string using Regex in Python

I have a string such as: "12345"
using the regex, how to get all of its substrings that consist of one up to three consecutive characters to get an output such as:
'1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345'
You can use re.findall with a positive lookahead pattern that matches a character repeated for a number of times that's iterated from 1 to 3:
[match for size in range(1, 4) for match in re.findall('(?=(.{%d}))' % size, s)]
However, it would be more efficient to use a list comprehension with nested for clauses to iterate through all the sizes and starting indices:
[s[start:start + size] for size in range(1, 4) for start in range(len(s) - size + 1)]
Given s = '12345', both of the above would return:
['1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345']

How can I use regex to match only one character in Python?

I am trying do precess a list of files
file_list = ['.DS_Store', '9', '7', '6', '8', '01', '4', '3', '2', '5']
the goal is to find the files whose name has only one character.
I tried this code
r = re.compile('[0-9]')
result_list = list(filter(r.match, file_list))
result_list
and got
['9', '7', '6', '8', '01', '4', '3', '2', '5']
where '01' should not be included.
I made a workaround
tmp = []
for i in file_list:
if len(i)==1:
tmp.append(i)
tmp
and I got
['9', '7', '6', '8', '4', '3', '2', '5']
this is exactly what I want. Although the method is ugly.
how can I use regex in Python to finish the task?
r = re.compile('^[0-9]$')
The ^ matches the beginning of a line and $ matches the end.
And if you really want it to match any character, not just numbers, it should be
r = re.compile('^.$')
The . in the regex is a single-character wildcard.
Match a string if it's simply any single character appearing at the beginning of the string (^.) right before the end of the string ($):
^.$
Regex101
Your Python then becomes:
r = re.compile('^.$')
result_list = list(filter(r.match, file_list))
Your code is equivalent to
[ i for i in file_list if len(i)==1]
And this method adapts to every case in which file's name has only one character.

Retrieve exactly 1 digit using regular expression in python

I want to print only ages that are less than 10. In this string, only the
value 1 should be printed. Somehow, that is not happening.
I used the following codes (using regular expression python)
import re
# This is my string
s5 = "The baby is 1 year old, Sri is 45 years old, Ann is 50 years old;
their father, Sumo is 78 years old and their grandfather, Kris, is 100 years
old"
# print all the single digits from the string
re.findall('[0-9]{1}', s5)
# Out[153]: ['1', '4', '5', '5', '0', '7', '8', '1', '0', '0']
re.findall('\d{1,1}', s5)
# Out[154]: ['1', '4', '5', '5', '0', '7', '8', '1', '0', '0']
re.findall('\d{1}', s5)
# Out[155]: ['1', '4', '5', '5', '0', '7', '8', '1', '0', '0']
The output should be 1 and not all the digits as displayed above.
What am i doing wrong ?
You are trying to match "any 1 number", but you want to match "any 1 number, not followed or preceded by another number".
One way to do that is to use lookarounds
re.findall(r'(?<![0-9])[0-9](?![0-9])', s5)
Possible lookarounds:
(?<!R)S // negative lookbehind: match S that is not preceded by R
(?<=R)S // positive lookbehind: match S that is preceded by R
(?!R)S // negative lookahead: match S that is not followed by R
(?=R)S // positive lookahead: match S that is followed by R
Maybe a simpler solution is to use a capturing group (). if regex in findall has one capturing group, it will return list of matches withing the group instead of whole matches:
re.findall(r'[^0-9]([0-9])[^0-9]', s5)
Also note that you can replace any 0-9 with \d - character group of numbers
Try this :
k = re.findall('(?<!\S)\d(?!\S)', s5)
print(k)
This also works :
re.findall('(?<!\S)\d(?![^\s.,?!])', s5)
import re
s = "The baby is 1 year old, Sri is 45 years old, Ann is 50 years old; their father, Sumo is 78 years old and their grandfather, Kris, is 100 years old"
m = re.findall('\d+',s)
for i in m:
if int(i)<10:
print(i)

Python regular expression retrieving numbers between two different delimiters

I have the following string
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
I would like to use regular expressions to extract the groups:
group1 56,7,1
group2 88,9,1
group3 58,8,1
group4 45
group5 100
group6 null
My ultimate goal is to have tuples such as (group1, group2), (group3, group4), (group5, group6). I am not sure if this all can be accomplished with regular expressions.
I have the following regular expression with gives me partial results
(?<=h=|d=)(.*?)(?=h=|d=)
The matches have an extra comma at the end like 56,7,1, which I would like to remove and d=, is not returning a null.
You likely do not need to use regex. A list comprehension and .split() can likely do what you need like:
Code:
def split_it(a_string):
if not a_string.endswith(','):
a_string += ','
return [x.split(',')[:-1] for x in a_string.split('=') if len(x)][1:]
Test Code:
tests = (
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,",
"h=56,7,1,d=88,9,1,d=,h=58,8,1,d=45,h=100",
)
for test in tests:
print(split_it(test))
Results:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], ['']]
[['56', '7', '1'], ['88', '9', '1'], [''], ['58', '8', '1'], ['45'], ['100']]
You could match rather than split using the expression
[dh]=([\d,]*),
and grab the first group, see a demo on regex101.com.
That is
[dh]= # d or h, followed by =
([\d,]*) # capture d and s 0+ times
, # require a comma afterwards
In Python:
import re
rx = re.compile(r'[dh]=([\d,]*),')
string = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
numbers = [m.group(1) for m in rx.finditer(string)]
print(numbers)
Which yields
['56,7,1', '88,9,1', '58,8,1', '45', '100', '']
You can use ([a-z]=)([0-9,]+)(,)?
Online demo
just you need add index to group
You could use $ in positive lookahead to match against the end of the string:
import re
input_str = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
groups = []
for x in re.findall('(?<=h=|d=)(.*?)(?=d=|h=|$)', input_str):
m = x.strip(',')
if m:
groups.append(m.split(','))
else:
groups.append(None)
print(groups)
Output:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], None]
Here, I have assumed that parameters will only have numerical values. If it is so, then you can try this.
(?<=h=|d=)([0-9,]*)
Hope it helps.

How to remove empty separators from read files in Python?

Here is my input file sample (z.txt)
>qrst
ABCDE-- 6 6 35 25 10
>qqqq
ABBDE-- 7 7 28 29 2
I store the alpha and numeric in separate lists. Here is the output of numerics list
#Output : ['', '6', '', '6', '35', '25', '10']
['', '7', '', '7', '28', '29', '', '2']
The output has an extra space when there are single digits because of the way the file has been created. Is there anyway to get rid of the '' (empty spaces)?
You can take advantage of filter with None as function for that:
numbers = ['', '7', '', '7', '28', '29', '', '2']
numbers = filter(None, numbers)
print numbers
See it in action here: https://eval.in/640707
If your input looks like this:
>>> li=[' 6 6 35 25 10', ' 7 7 28 29 2']
Just use .split() which will handle the repeated whitespace as a single delimiter:
>>> [e.split() for e in li]
[['6', '6', '35', '25', '10'], ['7', '7', '28', '29', '2']]
vs .split(" "):
>>> [e.split(" ") for e in li]
[['', '6', '', '6', '', '35', '', '25', '', '10'], ['', '7', '7', '28', '', '29', '2']]
I guess there are many ways to do this. I prefer using regular expressions, although this might be slower if you have a large input file with tens of thousands of lines. For smaller files, it's okay.
Few points:
Use context manager (with statement) to open files. When the with statement ends, the file will automatically be closed.
An alternative to re.findall() is re.match() or re.search(). Subsequent code will be slightly different.
It org, sequence and numbers are related element-wise, I suggest you maintain a list of 3-element tuples instead. Of course, you have buffer the org field and add to the list of tuples when the next line is obtained.
import re
org = []
sequence = []
numbers = []
with open('ddd', 'r') as f:
for line in f.readlines():
line = line.strip()
if re.search(r'^>', line):
org.append(line)
else:
m = re.findall(r'^([A-Z]+--)\s+(.*)\s+', line)
if m:
sequence.append(m[0][0])
numbers.append(map(int, m[0][1].split())) # convert from str to int
print(org, sequence, numbers)

Categories