regex number of repetitions from code - python

Can you use values from script to inform regexs dynamically how to operate?
For example:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{n_rep}'
line_matches = re.findall(new_pattern, some_text)
I keep getting problems with trying to get the grouping to work
Explanation
I am attempting to find the most common number of repetitions of a regex pattern in a text file in order to find table type data within files.
I have the idea to make a regex such as this:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
line_matches = np.array([re.findallbase_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
# Find where the text has similar number of words/data in each line
where_same_pattern= np.where(np.diff([len(x) for x in line_matches])==0)
line_matches_where_same = line_matches[where_same_pattern]
# Extract out just the lines which have data
interesting_lines = np.array([x for x in line_matches_where_same if x != []])
# Find how many words in each line of interest
len_of_lines = [len(l) for l in interesting_lines]
# Use the most prevalent as the most likely number of columns of data
n_cols = Counter(len_of_lines).most_common()[0][0]
# Rerun the data through a regex to find the columns
new_pattern = base_pattern + '{n_cols}'
line_matches = np.array([re.findall(new_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])

you need to use the value of the variable, not a string literal with the name of the variable, e.g.:
new_pattern = base_pattern + '{' + n_cols + '}'

Your pattern is just a string. So, all you need is to convert your number into a string. You can use format (for example, https://infohost.nmt.edu/tcc/help/pubs/python/web/new-str-format.html) to do that:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{{{0}}}'.format(n_rep)
print new_pattern ## '\\s*(([\\d.\\w]+)[ \\h]+){6}'
Note that the two first and the two last curly braces are creating the curly braces in the new pattern, while {0} is being replaced by the number n_rep

Related

Replacing String Text That Contains Double Quotes

I have a number series contained in a string, and I want to remove everything but the number series. But the double quotes are giving me errors. Here are examples of the strings and a sample command that I have used. All I want is 127.60-02-15, 127.60-02-16, etc.
<span id="lblTaxMapNum">127.60-02-15</span>
<span id="lblTaxMapNum">127.60-02-16</span>
I have tried all sorts of methods (e.g., triple double quotes, single quotes, quotes with backslashes, etc.). Here is one inelegant way that still isn't working because it's still leaving ">:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
Here is what I am working with (more specific code). I'm retrieving the data from an CSV and just trying to clean it up.
text = open("outputA.csv", "r")
text = ''.join([i for i in text])
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
outputB = open("outputB.csv", "w")
outputB.writelines(text)
outputB.close()
If you add a > in the second replace it is still not elegant but it works:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\">", "")
text = text.replace("</span>", "")
Alternatively, you could use a regex:
import re
text = "<span id=\"lblTaxMapNum\">127.60-02-16</span>"
pattern = r".*>(\d*.\d*-\d*-\d*)\D*" # the pattern in the brackets matches the number
match = re.search(pattern, text) # this searches for the pattern in the text
print(match.group(1)) # this prints out only the number
You can use beatifulsoup.
from bs4 import BeautifulSoup
strings = ['<span id="lblTaxMapNum">127.60-02-15</span>', '<span id="lblTaxMapNum">127.60-02-16</span>']
# Use BeautifulSoup to extract the text from the <span> tags
for string in strings:
soup = BeautifulSoup(string, 'html.parser')
number_series = soup.span.text
print(number_series)
output:
127.60-02-15
127.60-02-16
it's a little bit long , hope my documents are readable
with open(r'c:\users\GH\desktop\test.csv' , 'r') as f:
text = f.read().strip()
stRange = '<' # we will gonna remove the dump txt from our file by using (range
index) method
endRange = '>' # which means removing all extra literals between <>
text = list(text)
# casting our data to a list to be able to modify our data by reffering to its
components by index number
i = 0
length = len(text)
# we're gonna manipulate our text while we are iterating upon it
# so we have to declare a variable to be able to change it while iterating
while i < length:
if text[i] == stRange:
stRange = text.index(text[i])
elif text[i] != endRange and text[i] != stRange:
i += 1
continue
elif text[i] == endRange:
endRange = text.index(text[i]) # an integer to be used as rangeIndex
i = 0
del text[stRange : endRange + 1] # deleting the extra unwanted
characters
length = len(text) # getting the new length of our data
stRange = '<' # and again , assigning the specific characters to their
variables
endRange = '>'
i += 1
else:
result = str()
for l in text:
result += l
else:
with open(path , 'w') as f:
f.write(result)
with open(path , 'r') as f:
print('the result ==>')
print(f.read())

Finding an exact word in list

I am learning Python and am struggling with fining an exact word in each string in a list of strings.
Apologies if this is an already asked question for this situation.
This is what my code looks like so far:
with open('text.txt') as f:
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('text.txt')]
keyword = input("Enter a keyword: ")
matching = [x for x in lines if keyword.lower() in x.lower()]
match_count = len(matching)
print('\nNumber of matches: ', match_count, '\n')
print(*matching, sep='\n')
Right now, matching will return all strings containing the word, not strings contating the exact word. For example, if I enter in 'local' as the keyword, strings with 'locally' and 'localized' in addition to 'local' will be returned when I only want just instances of 'local' returned.
I have tried:
match_test = re.compile(r"\b" + keyword+ r"\b")
match_test = ('\b' + keyword + '\b')
match_test = re.compile('?:^|\s|$){0}'.format(keyword))
matching = [x for x in lines if keyword.lower() == x.lower()]
matching = [x for x in lines if keyword.lower() == x.lower().strip()]
And none of them shave worked, so I'm a bit stuck.
How do I take the keyword entered from the user, and then return all strings in a list that contain that exact keyword?
Thanks
in means contained in, 'abc' in 'abcd' is True. For exact match use ==
matching = [x for x in lines if keyword.lower() == x.lower()]
You might need to remove spaces\new lines as well
matching = [x for x in lines if keyword.lower().strip() == x.lower().strip()]
Edit:
To find a line containing the keyword you can use loops
matches = []
for line in lines:
for string in line.split(' '):
if string.lower().strip() == keyword.lower().strip():
matches.append(line)
This method avoids having to read the whole file into memory. It also deals with cases like "LocaL" or "LOCAL" assuming you want to capture all such variants. There is a bit of performance overhead on making the temp string each time the line is read, however:
import re
reader(filename, target):
#this regexp matches a word at the front, end or in the middle of a line stripped
#of all punctuation and other non-alpha, non-whitespace characters:
regexp = re.compile(r'(^| )' + target.lower() + r'($| )')
with open(filename) as fin:
matching = []
#read lines one at at time:
for line in fin:
line = line.rstrip('\n')
#generates a line of lowercase and whitespace to test against
temp = ''.join([x.lower() for x in line if x.isalpha() or x == ' '])
print(temp)
if regexp.search(temp):
matching.append(line) #store unaltered line
return matching
Given the following tests:
locally local! localized
locally locale nonlocal localized
the magic word is Local.
Localized or nonlocal or LOCAL
This is returned:
['locally local! localized',
'the magic word is Local.',
'Localized or nonlocal or LOCAL']
Please find my solution which should match only local among following mentioned text in text file . I used search regular expression to find the instance which has only 'local' in string and other strings containing local will not be searched for .
Variables which were provided in text file :
local
localized
locally
local
local diwakar
local
local##!
Code to find only instances of 'local' in text file :
import os
import sys
import time
import re
with open('C:/path_to_file.txt') as f:
for line in f:
a = re.search(r'local\W$', line)
if a:
print(line)
Output
local
local
local
Let me know if this is what you were looking for
Your first test seems to be on the right track
Using input:
import re
lines = [
'local student',
'i live locally',
'keyboard localization',
'what if local was in middle',
'end with local',
]
keyword = 'local'
Try this:
pattern = re.compile(r'.*\b{}\b'.format(keyword.lower()))
matching = [x for x in lines if pattern.match(x.lower())]
print(matching)
Output:
['local student', 'what if local was in middle', 'end with local']
pattern.match will return the first instance of the regex matching or None. Using this as your if condition will filter for strings that match the whole keyword in some place. This works because \b matches the begining/ending of words. The .* works to capture any characters that may occur at the start of the line before your keyword shows up.
For more info about using Python's re, checkout the docs here: https://docs.python.org/3.8/library/re.html
You can try
pattern = re.compile(r"\b{}\b".format(keyword))
match_test = pattern.search(line)
like shown in
Python - Concat two raw strings with an user name

split string every nth character and append ':'

I've read some switch MAC address table into a file and for some reason the MAC address if formatted as such:
'aabb.eeff.hhii'
This is not what a MAC address should be, it should follow: 'aa:bb:cc:dd:ee:ff'
I've had a look at the top rated suggestions while writing this and found an answer that may fit my needs but it doesn't work
satomacoto's answer
The MACs are in a list, so when I run for loop I can see them all as such:
Current Output
['8424.aa21.4er9','fa2']
['94f1.3002.c43a','fa1']
I just want to append ':' at every 2nd nth character, I can just remove the '.' with a simple replace so don't worry about that
Desired output
['84:24:aa:21:4e:r9','fa2']
['94:f1:30:02:c4:3a','fa1']
My code
info = []
newinfo = []
file = open('switchoutput')
newfile = file.read().split('switch')
macaddtable = newfile[3].split('\\r')
for x in macaddtable:
if '\\n' in x:
x = x.replace('\\n', '')
if carriage in x:
x = x.replace(carriage, '')
if '_#' in x:
x = x.replace('_#', '')
x.split('/r')
info.append(x)
for x in info:
if "Dynamic" in x:
x = x.replace('Dynamic', '')
if 'SVL' in x:
x = x.replace('SVL', '')
newinfo.append(x.split(' '))
for x in newinfo:
for x in x[:1]:
if '.' in x:
x = x.replace('.', '')
print(x)
Borrowing from the solution that you linked, you can achieve this as follows:
macs = [['8424.aa21.4er9','fa2'], ['94f1.3002.c43a','fa1']]
macs_fixed = [(":".join(map(''.join, zip(*[iter(m[0].replace(".", ""))]*2))), m[1]) for m in macs]
Which yields:
[('84:24:aa:21:4e:r9', 'fa2'), ('94:f1:30:02:c4:3a', 'fa1')]
If you like regular expressions:
import re
dotted = '1234.3456.5678'
re.sub('(..)\.?(?!$)', '\\1:', dotted)
# '12:34:34:56:56:78'
The template string looks for two arbitrary characters '(..)' and assigns them to group 1. It then allows for 0 or 1 dots to follow '\.?' and makes sure that at the very end there is no match '(?!$)'. Every match is then replaced with its group 1 plus a colon.
This uses the fact that re.sub operates on nonoverlapping matches.
x = '8424.aa21.4er9'.replace('.','')
print(':'.join(x[y:y+2] for y in range(0, len(x) - 1, 2)))
>> 84:24:aa:21:4e:r9
Just iterate through the string once you've cleaned it, and grab 2 string each time you loop through the string. Using range() third optional argument you can loop through every second elements. Using join() to add the : in between the two elements you are iterating.
You can use re module to achieve your desired output.
import re
s = '8424.aa21.4er9'
s = s.replace('.','')
groups = re.findall(r'([a-zA-Z0-9]{2})', s)
mac = ":".join(groups)
#'84:24:aa:21:4e:r9'
Regex Explanation
[a-zA-Z0-9]: Match any alphabets or number
{2}: Match at most 2 characters.
This way you can get groups of two and then join them on : to achieve your desired mac address format
wrong_mac = '8424.aa21.4er9'
correct_mac = ''.join(wrong_mac.split('.'))
correct_mac = ':'.join(correct_mac[i:i+2] for i in range(0, len(correct_mac), 2))
print(correct_mac)

Searching text file for string in python

I'm using Python to search a large text file for a certain string, below the string is the data that I am interested in performing data analysis on.
def my_function(filename, variable2, variable3, variable4):
array1 = []
with open(filename) as a:
special_string = str('info %d info =*' %variable3)
for line in a:
if special_string == array1:
array1 = [next(a) for i in range(9)]
line = next(a)
break
elif special_string != c:
c = line.strip()
In the special_string variable, whatever comes after info = can vary, so I am trying to put a wildcard operator as seen above. The only way I can get the function to run though is if I put in the exact string I want to search for, including everything after the equals sign as follows:
special_string = str('info %d info = more_stuff' %variable3)
How can I assign a wildcard operator to the rest of the string to make my function more robust?
If your special string always occurs at the start of a line, then you can use the below check (where special_string does not have the * at the end):
line.startswith(special_string)
Otherwise, please do look at the module re in the standard library for working with regular expressions.
Have you thought about using something like this?
Based on your input, I'm assuming the following:
variable3 = 100000
special_string = str('info %d info = more_stuff' %variable3)
import re
pattern = re.compile('(info\s*\d+\s*info\s=)(.*)')
output = pattern.findall(special_string)
print(output[0][1])
Which would return:
more_stuff

Can't get rid of hex characters

This program makes an array of verbs which come from a text file.
file = open("Verbs.txt", "r")
data = str(file.read())
table = eval(data)
num_table = len(table)
new_table = []
for x in range(0, num_table):
newstr = table[x].replace(")", "")
split = newstr.rsplit("(")
numx = len(split)
for y in range(0, numx):
split[y] = split[y].split(",", 1)[0]
new_table.append(split[y])
num_new_table = len(new_table)
for z in range(0, num_new_table):
print(new_table[z])
However the text itself contains hex characters such as in
('a\\xc4\\x9fr\\xc4\\xb1[Verb]+[Pos]+[Imp]+[A2sg]', ':', 17.6044921875)('A\\xc4\\x9fr\\xc4\\xb1[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]', ':', 11.5615234375)
I'm trying to get rid of those. How am supposed to do that?
I've looked up pretty much everywhere and decode() returns an error (even after importing codecs).
You could use parse, a python module that allows you to search inside a string for regularly-formatted components, and, from the components returned, you could extract the corresponding integers, replacing them from the original string.
For example (untested alert!):
import parse
# Parse all hex-like items
list_of_findings = parse.findall("\\x{:w}", your_string)
# For each item
for hex_item in list_of_findings:
# Replace the item in the string
your_string = your_string.replace(
# Retrieve the value from the Parse Data Format
hex_item[0],
# Convert the value parsed to a normal hex string,
# then to int, then to string again
str(int("0x"+hex_item[0]))
)
Obs: instead of "int", you could convert the found hex-like values to characters, using chr, as in:
chr(hex_item[0])

Categories