value not found and look behind errors - python

I am coding to print between 2 strings of a text,but i get errors:
regex coding:
import re
with open("in.txt") as f:
lines = f.read()
m = re.findall(r'(?s)(?<=Credit\s*\b).*?(?=Amount)', lines)
for i in m:
print i
(returns look behind not found)
Another coding:
with open("in.txt") as f:
lines = f.read()
cred_ind = (lines.index("Credit"))
am_ind = lines.index("Amount")
print(lines[cred_ind+6:am_ind])
(returns substring not found)
Text file:
....
accounts
Bank
Credit
good value
money
Amount
Amount
Output:
good value
money

Pythons re module does not support variable length lookbehind.
Simple fix, avoid using lookaround assertions..
m = re.findall(r'(?s)Credit\s*(.*?)Amount', lines)

Related

python open csv search for pattern and strip everything else

I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')

Iterating through a file and replacing strings, leaving the number of characters intact

I'm attempting to anonymize a file so that all the content except certain keywords are replaced with gibberish, but the format is kept the same (including punctuation, length of string and capitalization). For example:
I am testing this, check it out! This is a keyword: long
Wow, another line.
should turn in to:
T ad ehistmg ptrs, erovj qo giw! Tgds ar o qpyeogf: long
Yeg, rmbjthe yadn.
I am attempting to do this in python, but i'm having no luck in finding a solution. I have tried replacing via tokenization and writing to another file, but without much success.
Initially let's disregard the fact that we have to preserve some keywords. We will fix that later.
The easiest way to perform this kind of 1-to-1 mapping is to use the method str.translate. The string module also contains constants that contain all ASCII lowercase and uppercase characters, and random.shuffle can be used to obtain a random permutation.
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
random.shuffle(random_caps)
random.shuffle(random_lows)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
with open('the-file-i-want.txt', 'r') as f:
contents = f.read()
translated_contents = contents.translate(translation_table)
with open('the-file-i-want.txt', 'w') as f:
f.write(translated_contents)
In python 2 the str.maketrans is a function in the string module instead of a static method of str.
The translation_table is a mapping from characters to characters, so it will map every single ASCII character to an other one. The translate method simply applies this table to each character in the string.
Important note: the above method is actually reversible, because each letter its mapped to a unique other letter. This means that using a simple analysis over the frequency of the symbols it's possible to reverse it.
If you want to make this harder or impossible, you could re-create the translation_table for every line:
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
with open('the-file-i-want.txt', 'r') as f:
translated_lines = []
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
translated_lines.append(line.translate(translation_table))
with open('the-file-i-want.txt', 'w') as f:
f.writelines(translated_lines)
Also note that you could translate and save the file line by line:
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
o.write(line.translate(translation_table))
Which means you can translate huge files with this code, as far as the lines themselves are not insanely long.
The code above messing all characters, without taking into account such keywords.
The simplest way to handle the requirement is to simply check for each line whether one of keywords occur and "reinsert" it there:
import re
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
keywords = ['long'] # add all the possible keywords in this list
keyword_regex = re.compile('|'.join(re.escape(word) for word in keywords))
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
matches = keyword_regex.finditer(line)
translated_line = list(line.translate(translation_table))
for match in matches:
translated_line[match.start():match.end()] = match.group()
o.write(''.join(translated_line))
Sample usage (using the version that prevserves keywords):
$ echo 'I am testing this, check it out! This is a keyword: long
Wow, another line.' > the-file-i-want.txt
$ python3 trans.py
$ cat output.txt
M vy hoahitc hfia, ufoum ih pzh! Hfia ia v modjpel: long
Ltj, fstkwzb hdsz.
Note how long is preserved.

Extract numeric data from matched regular expression

I have some temperature data in a csv file and I want to extract only the temperature for a say the first month of the year, and so after processing I want a list of [1.4, -5.8] in the example below.
1866-01-01 00:00:01;1866-02-01 00:00:00;1866-01;1.4;G
1866-02-01 00:00:01;1866-03-01 00:00:00;1866-02;-3.0;G
1900-01-01 00:00:01;1900-01-01 00:00:00;1900-01;-5.8;G
I thought of doing this with python module re, but I always have issues getting to grips with regular expressions! For instance my quick test below returns all lines when I only expect it to return the entries from the first month of the year...
import numpy as np
import re
regex = '\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01;[-+]?\d*\.\d+|\d+;G'
with open('test.csv', 'rb') as fid:
for line in fid:
match = re.findall(regex,line)
if match:
print line
print match
Use the csv module, specifying ; as the delimiter. The third column in the data is YYYY-MM, so check whether it's the first month and print the temperature if it is:
import csv
with open('data') as f:
for row in csv.reader(f, delimiter=';'):
year, month = row[2].split('-')
if int(month) == 1:
print(row[3])
Output
1.4
-5.8
For comparison, here is the simplest regex that I could come up with to extract the required value:
import re
with open('data') as f:
temperature = re.findall(r'\d{4}-01;(.+?);', f.read())
print('\n'.join(temperature))
You can see how it takes more effort to read & understand the regex than it does the Python code.
There is an even easier way that relies on your data consisting of fixed width fields:
with open('data') as f:
for line in f:
if line[45:47] == '01':
print(line[48:-3])
I suggest the folling regex:
^(?:\d{4}-01-.*?)(-?\d+\.\d+)
Demo and explanation of behavior: regex101
The number is in the first capturing group.
Alternatively, with a positive lookahead:
^(?=\d{4}-01).*?(-?\d+\.\d+)
Demo and explanation of behavior: regex101
You have to put brackets around what you want to extract. So you should change the last part to ;([-+]?\d*\.\d+|\d+);G.
Try this code and tell me if it works:
import re
regex1 = re.compile('\d{4}-01-\d{2}')
regex2 = re.compile('([-+]?\d*\.\d+|\d+);G')
with open('test.csv', 'rb') as fid:
for line in fid:
match1 = re.findall(regex1,line)
if match1:
match2 = re.findall(regex2, line)
print line
print match2
Hope this helps.

extract float numbers from data file

I'm trying to extract the values (floats) from my datafile. I only want to extract the first value on the line, the second one is the error. (eg. xo # 9.95322254_0.00108217853
means 9.953... is value, 0.0010.. is error)
Here is my code:
import sys
import re
inf = sys.argv[1]
out = sys.argv[2]
f = inf
outf = open(out, 'w')
intensity = []
with open(inf) as f:
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
for line in f:
f.split("\n")
match = pattern.match(line)
if match:
intensity.append(match.group(0))
for k in range(len(intensity)):
outf.write(intensity[k])
but it doesn't work. The output file is empty.
the lines in data file look like:
xo_Is
xo # 9.95322254`_0.00108217853
SPVII_to_PVII_Peak_type
PVII_m(#, 1.61879`_0.08117)
PVII_h(#, 0.11649`_0.00216)
I # 0.101760618`_0.00190314017
each time the first number is the value I want to extract and the second one is the error.
You were almost there, but your code contains errors preventing it from running. The following works:
pattern = re.compile(r"[^-\d]*(-?\d+\.\d+)[^-\d]*")
with open(inf) as f, open(out, 'w') as outf:
for line in f:
match = pattern.match(line)
if match:
outf.write(match.group(1) + '\n')
I think you should test your pattern on a simple string instead of file. This will show where is the error: in pattern or in code which parsing file. Pattern looks good. Additionally in most languages i know group(0) is all captured data and for your number you need to use group(1)
Are you sure that f.slit('\n') must be inside for?

Python Regular Expression loop

I have this code wich will look for certain things in a file. The file looks like this:
name;lastname;job;5465465
name2;lastname2;job2;5465465
name3;lastname3;job3;5465465
This is the python code:
import re
import sys
filehandle = open('somefile.csv', 'r')
text = filehandle.read()
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
no it will only print:
lastname;name;job;5465465
It supposed to print the lastname first so i did that with groups. Now i need a loop to print all lines like this:
lastname;name;job;5465465
lastname2;name2;job2;5465465
lastname3;name3;job3;5465465l
i tried all kind of loops but it doesnt go trough the whole file... how do i need to do this?
it must be done with the re module. I know its easy in the csv module ;)
You need to process the file line by line.
import re
import sys
with open('somefile.csv', 'r') as filehandle:
for text in filehandle:
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
Your file has nicely semi-colon separated values, so it would be easier to just use split or the csv library as has been suggested.
No need for re, but a good job for csv:
import csv
with open('somefile.csv', 'r') as f:
for rec in csv.reader(f, delimiter=';'):
print (rec[1], rec[0], rec[2], rec[3])
You can use re if you want to check the validity of individual elements (valid phone number, no numbers in name, capitalized names, etc.).
The fault is not with the loops, but rather with your regex / capture group patterns. The class [a-zA-Z]+ will not match "lastname3" or "lastname2". This sample works:
import re
import sys
for line in open('somefile.csv', 'r'):
b = re.search("(\w+);(\w+);(\w*);([0-9^-]+)\n?",line)
if b:
print "%s;%s;%s;%s" % (b.group(2),b.group(1),b.group(3),b.group(4))
Seems as if you just want to reorder what you have, in which case I don't know whether regex are needed. I believe the following might be of use:
reorder = operator.itemgetter(1, 0, 2, 3)
http://docs.python.org/library/operator.html

Categories