Positive lookbehind, slice words by \t until \n - python

I am new to regex. I am attempting to use regex with python to find a line in a file and extract all of the subsequent words separated by tab stops. My line looks like this.
#position 4450 4452 4455 4465 4476 4496 D110 D111 D112 D114 D116 D118 D23 D24 D27 D29 D30 D56 D59 D69 D85 D88 D90 D91 JW1 JW10 JW15 JW22 JW28 JW3 JW35 JW39 JW43 JW45 JW47 JW49 JW5 JW52 JW54 JW56 JW57 JW59 JW66 JW7 JW70 JW75 JW77 JW9 REF_OR74A
I have identified that the base of this expression involves the positive lookbehind.
(?<=#position).*
I do not expect this to separate the matches by tabstop. However, it does find my line in the file:
import re
file = open('src.txt','r')
f = list(file)
file.close()
pattern = '(?<=#position).*'
regex = re.compile(pattern)
regex.findall(''.join(f))
['\t4450\t4452\t4455\t4465\t4476\t4496\tD110\tD111\tD112\tD114\tD116\tD118\tD23\tD24\tD27\tD29\tD30\tD56\tD59\tD69\tD85\tD88\tD90\tD91\tJW1\tJW10\tJW15\tJW22\tJW28\tJW3\tJW35\tJW39\tJW43\tJW45\tJW47\tJW49\tJW5\tJW52\tJW54\tJW56\tJW57\tJW59\tJW66\tJW7\tJW70\tJW75\tJW77\tJW9\tREF_OR74A']
With some kludge and list slicing / string methods, I can manipulate this and get my data out. What I'd really like to do is have findall yield a list of just these entries. What would the regular expression look like to do that?

Do you need to use regex? List slicing and string methods don't appear to be as much of a kludge as you say.
something like:
f = open('src.txt','r')
for line in f:
if line.startswith("#position"):
l = line.split() # with no arguments it splits on all whitespace characters
l = l[1:] # get rid of the "#position" tag
break
and further manipulate from there?

Related

Create a list from searching a text file for [ ]

I'm trying to create a list from a text file that I am reading into Python. The text file contains a bunch of square brackets throughout the file [some text here]. What I am trying to do is first count how many of those square bracket pairings I have [] and then add whatever text is inside of them into a list.
Here is a super simplified version of the text file I am trying to use with the brackets:
"[name] is going to the store! It's going to be at [place] on [day-of-the-week]."
Here is what I have:
bracket_counter = 0
file_name = "example.txt"
readFile = open(file_name)
text_lines = readFile.readlines()
for line in text_lines:
val = line.split('[', 1)[1].split(']')[0]
print(val)
if line has [ and ]:
bracket_counter += 1
I'm super new to Python. I don't know if I should be using regular expressions for this or if that is overcomplicating things.
Thanks for your help!
You can of course use regular expressions for that. In the following example, you're extracting all content within square brackets and store them in the variable words.
words = re.findall(r'\[([^\]]+)\]', line)
Don't forget to import re at the top of your program.
Explanation of the regex:
\[ and \] match square brackets. As these are used in regex as well, you have to escape them with a backslash
(...) is a capturing group, all regex parts within normal brackets will be returned as a finding (for e.g. in the list words)
[^\]]+ will match all characters except the ]
To put it all together:
This regex looks for an opening square bracket, then matches all characters until a closing square bracket appears and returns the matches within a list.
If there's a lonely soul out there like me who wondered how this could be done without using regex, I guess this could be a solution also:
results = []
with open("example.txt") as read:
for line in read:
start = None; end = None
for i,v in enumerate(line):
if v == "[": start = i
if v == "]": end = i
if start is not None and end is not None:
results.append(line[start+1:end])
start = None; end = None
print(results)
Result:
['name', 'place', 'day-of-the-week']
Python doesn't have a has keyword.
You're probably looking for something like:
('[' in line) or (']' in line)
which will evaluate to True if the line includes [ or ].

Parsing String by regular expression in python

How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)

Dictionary replaces substrings Python 2.7

I want to replace numbers from a text file in a new text file. I tried to solve it with the function Dictionary, but now python also replaces the substrings.
For Example: I want to replace the number 014189 to 1489, with this code it also replaces 014896 to 1489 - how can i get rid of this? Thank you!!!
replacements = {'01489':'1489', '01450':'1450'}
infile = open('test_fplan.txt', 'r')
outfile = open('test_fplan_neu.txt', 'w')
for line in infile:
for src, target in replacements.iteritems():
line = line.replace(src, target)
outfile.write(line)
I don't know how your input file looks, but if the numbers are surrounded by spaces, this should work:
replacements = {' 01489 ':' 1489 ', ' 01450 ':' 1450 '}
It looks like your concern is that it's also modifying numbers that contain your src pattern as a substring. To avoid that, you'll need to first define the boundaries that should be respected. For instance, do you want to insist that only matched numbers surrounded by spaces get replaced? Or perhaps just that there be no adjacent digits (or periods or commas). Since you'll probably want to use regular expressions to constrain the matching, as suggested by JoshuaF in another answer, you'll likely need to avoid the simple replace function in favor of something from the re library.
Use regexp with negative lookarounds:
import re
replacements = {'01489':'1489', '01450':'1450'}
def find_replacement(match_obj):
number = match_obj.group(0)
return replacements.get(number, number)
with open('test_fplan.txt') as infile:
with open('test_fplan_neu.txt', 'w') as outfile:
outfile.writelines(
re.sub(r'(?<!\d)(\d+)(?!\d)', find_replacement, line)
for line in infile
)
Check out the regular expression syntax https://docs.python.org/2/library/re.html. It should allow you to match whatever pattern you're looking for exactly.

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.
Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

Python - trying to capture the middle of a line, regex or split

I have a text file with some names and emails and other stuff. I want to capture email addresses.
I don't know whether this is a split or regex problem.
Here are some sample lines:
[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79
I want to be able to do a loop that prints all the email addresses.
Thanks.
I'd use a regex:
import re
data = '''[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79'''
group_matcher = re.compile(r'\[(.*?)\]([^\[]+)')
for line in data.split('\n'):
o = dict(group_matcher.findall(line))
print o['email']
\[ is literally [.
(.*?) is a non-greedy capturing group. It "expands" to capture the text.
\] is literally ]
( is the beginning of a capturing group.
[^\[] matches anything but a [.
+ repeats the last pattern any number of times.
) closes the capturing group.
for line in lines:
print line.split("]")[2].split(" ")[0]
You can pass substrings to split, not just single characters, so:
email = line.partition('[email]')[-1].partition('[')[0].rstrip()
This has an advantage over the simple split solutions that it will work on fields that can have spaces in the value, on lines that have things in a different order (even if they have [email] as the last field), etc.
To generalize it:
def get_field(line, field):
return line.partition('[{}]'.format(field)][-1].partition('[')[0].rstrip()
However, I think it's still more complicated than the regex solution. Plus, it can only search for exactly one field at a time, instead of all fields at once (without making it even more complicated). To get two fields, you'll end up parsing each line twice, like this:
for line in data.splitlines():
print '''{} "babysat" Dan O'Brien on {}'''.format(get_field(line, 'name'),
get_field(line, 'dob'))
(I may have misinterpreted the DOB field, of course.)
You can split by space and then search for the element that starts with [email]:
line = '[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81'
items = line.split()
for item in items:
if item.startswith('[email]'):
print item.replace('[email]', '', 1)
say you have a file with lines.
import re
f = open("logfile", "r")
data = f.read()
for line in data.split("\n"):
match=re.search("email\](?P<id>.*)\[dob", line)
if match:
# either store or print the emails as you like
print match.group('id').strip(), "\n"
Thats all (try it, for python 3 n above remember print is a function make those changes ) !
The output from your sample data:
bill.billy#hotmail.com
mark.hilly#hotmail.com
gill.silly#hotmail.com
>>>

Categories