I'm trying to extract the values (floats) from my datafile. I only want to extract the first value on the line, the second one is the error. (eg. xo # 9.95322254_0.00108217853
means 9.953... is value, 0.0010.. is error)
Here is my code:
import sys
import re
inf = sys.argv[1]
out = sys.argv[2]
f = inf
outf = open(out, 'w')
intensity = []
with open(inf) as f:
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
for line in f:
f.split("\n")
match = pattern.match(line)
if match:
intensity.append(match.group(0))
for k in range(len(intensity)):
outf.write(intensity[k])
but it doesn't work. The output file is empty.
the lines in data file look like:
xo_Is
xo # 9.95322254`_0.00108217853
SPVII_to_PVII_Peak_type
PVII_m(#, 1.61879`_0.08117)
PVII_h(#, 0.11649`_0.00216)
I # 0.101760618`_0.00190314017
each time the first number is the value I want to extract and the second one is the error.
You were almost there, but your code contains errors preventing it from running. The following works:
pattern = re.compile(r"[^-\d]*(-?\d+\.\d+)[^-\d]*")
with open(inf) as f, open(out, 'w') as outf:
for line in f:
match = pattern.match(line)
if match:
outf.write(match.group(1) + '\n')
I think you should test your pattern on a simple string instead of file. This will show where is the error: in pattern or in code which parsing file. Pattern looks good. Additionally in most languages i know group(0) is all captured data and for your number you need to use group(1)
Are you sure that f.slit('\n') must be inside for?
Related
I have a text file inside it is:
"000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|"
Now I'm trying to use a regular expression to get the first chunk of number before '|ROOT ', the number is 000000002.
I tried to use:
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
lines = f.read()
x = re.findall("^\s*[0-9].(ROOT$)", lines)[0]
print(x)
And it does not work. My strategy is to get the string start with number and end with ROOT, and get the first match.
ROOT$ requires the four characters ROOT adjacent to the end of the line. findall returns all matches; if you only care about the first, probably simply use match or search.
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
for line in f:
m = re.match(r'(\d+)\|ROOT', line)
if m:
print(m.group(1))
break
The break causes the loop to terminate as soon as the first match is found. We read one line at a time until we find one which matches, then terminate. (This also optimizes the program by avoiding the unnecessary reading of lines we do not care about, and by avoiding reading more than one line into memory at a time.) The parentheses in the regex causes the match inside them to be captured into group(1).
Check out this code :
import re
# 000000002|ROOT |237277309|000000003|ROOT |337277309|000000004|ROOT |437277309|
file = './file.txt'
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
lines = f.read()
x = re.findall(r"(\d*[0-9])\|ROOT", lines)
print(x)
x = re.findall(r"(\d*[0-9])\|ROOT", lines)[0]
print(x)
OUTPUT :
['000000002', '000000003', '000000004']
000000002
I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')
I am trying to extract data from a .txt file in Python. My goal is to capture the last occurrence of a certain word and show the next line, so I do a reverse () of the text and read from behind. In this case, I search for the word 'MEC', and show the next line, but I capture all occurrences of the word, not the first.
Any idea what I need to do?
Thanks!
This is what my code looks like:
import re
from file_read_backwards import FileReadBackwards
with FileReadBackwards("camdex.txt", encoding="utf-8") as file:
for l in file:
lines = l
while line:
if re.match('MEC', line):
x = (file.readline())
x2 = (x.strip('\n'))
print(x2)
break
line = file.readline()
The txt file contains this:
MEC
29/35
MEC
28,29/35
And with my code print this output:
28,29/35
29/35
And my objetive is print only this:
28,29/35
This will give you the result as well. Loop through lines, add the matching lines to an array. Then print the last element.
import re
with open("data\camdex.txt", encoding="utf-8") as file:
result = []
for line in file:
if re.match('MEC', line):
x = file.readline()
result.append(x.strip('\n'))
print(result[-1])
Get rid of the extra imports and overhead. Read your file normally, remembering the last line that qualifies.
with ("camdex.txt", encoding="utf-8") as file:
for line in file:
if line.startswith("MEC"):
last = line
print(last[4:-1]) # "4" gets rid of "MEC "; "-1" stops just before the line feed.
If the file is very large, then reading backwards makes sense -- seeking to the end and backing up will be faster than reading to the end.
I have some temperature data in a csv file and I want to extract only the temperature for a say the first month of the year, and so after processing I want a list of [1.4, -5.8] in the example below.
1866-01-01 00:00:01;1866-02-01 00:00:00;1866-01;1.4;G
1866-02-01 00:00:01;1866-03-01 00:00:00;1866-02;-3.0;G
1900-01-01 00:00:01;1900-01-01 00:00:00;1900-01;-5.8;G
I thought of doing this with python module re, but I always have issues getting to grips with regular expressions! For instance my quick test below returns all lines when I only expect it to return the entries from the first month of the year...
import numpy as np
import re
regex = '\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01;[-+]?\d*\.\d+|\d+;G'
with open('test.csv', 'rb') as fid:
for line in fid:
match = re.findall(regex,line)
if match:
print line
print match
Use the csv module, specifying ; as the delimiter. The third column in the data is YYYY-MM, so check whether it's the first month and print the temperature if it is:
import csv
with open('data') as f:
for row in csv.reader(f, delimiter=';'):
year, month = row[2].split('-')
if int(month) == 1:
print(row[3])
Output
1.4
-5.8
For comparison, here is the simplest regex that I could come up with to extract the required value:
import re
with open('data') as f:
temperature = re.findall(r'\d{4}-01;(.+?);', f.read())
print('\n'.join(temperature))
You can see how it takes more effort to read & understand the regex than it does the Python code.
There is an even easier way that relies on your data consisting of fixed width fields:
with open('data') as f:
for line in f:
if line[45:47] == '01':
print(line[48:-3])
I suggest the folling regex:
^(?:\d{4}-01-.*?)(-?\d+\.\d+)
Demo and explanation of behavior: regex101
The number is in the first capturing group.
Alternatively, with a positive lookahead:
^(?=\d{4}-01).*?(-?\d+\.\d+)
Demo and explanation of behavior: regex101
You have to put brackets around what you want to extract. So you should change the last part to ;([-+]?\d*\.\d+|\d+);G.
Try this code and tell me if it works:
import re
regex1 = re.compile('\d{4}-01-\d{2}')
regex2 = re.compile('([-+]?\d*\.\d+|\d+);G')
with open('test.csv', 'rb') as fid:
for line in fid:
match1 = re.findall(regex1,line)
if match1:
match2 = re.findall(regex2, line)
print line
print match2
Hope this helps.
I have a program that polls a servers current Wi-Fi status every minute, and saves that info to a .txt file. The output is:
*****CURRENT WIFI SIGNAL STRENGTH*****: Link Quality=57/70 Signal level=-53 dBm
The text file contains many of these lines. What I'm trying to accomplish is:
-Find the signal dBm values in all the lines, and append them to an array so I can then I can do other functions such as sort and average. I can't seem to get it working quite right.
Does anyone know how to do this?
Thank you!
I would go through each line in the file and split the line at =, then get the last value, split it at the space, and then get the first value which would yield -53.
strengthValues = []
f = open("input.txt", "r")
fileLines = f.readlines()
for line in fileLines:
lineSplit = line.split('=')
strengthValues.append(lineSplit[-1].split()[0])
print strengthValues
Or list comprehension:
f = open("test.txt", "r")
fileLines = f.readlines()
strengthValues = [line.split('=')[-1].split()[0] for line in fileLines]
print strengthValues
signal_levels = []
try:
with open("file.txt") as fh:
lines = fh.readlines()
except IOError as err:
# error handling
Then you can either make use of the re module:
for line in lines:
matches = re.search(r'Signal level=(-?[0-9]+) dBm$', line)
if matches is None:
# possible error handling
signal_levels.append(int(matches.group(1)))
Or without it (inspired by heinst's answer):
for line in lines:
try:
value = int(line.split('=')[-1].split()[0])
signal_levels.append(value)
except ValueError as err:
# possible error handling
Assuming that the signal level is the only negative number on any line you could use a regular expression with the findall function to search for all negative numbers in the file and return them as a list of strings (based on MC93's answer).
import re
f_in = open("input.txt", "r")
signal_levels = re.findall("-\d+", f_in.read())
Alternatively, you could get a list of ints with a list comprehension.
signal_levels = [int(n) for n in re.findall("-\d+", f_in.read())]