Extract numeric data from matched regular expression - python

I have some temperature data in a csv file and I want to extract only the temperature for a say the first month of the year, and so after processing I want a list of [1.4, -5.8] in the example below.
1866-01-01 00:00:01;1866-02-01 00:00:00;1866-01;1.4;G
1866-02-01 00:00:01;1866-03-01 00:00:00;1866-02;-3.0;G
1900-01-01 00:00:01;1900-01-01 00:00:00;1900-01;-5.8;G
I thought of doing this with python module re, but I always have issues getting to grips with regular expressions! For instance my quick test below returns all lines when I only expect it to return the entries from the first month of the year...
import numpy as np
import re
regex = '\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01-\d{2}\s\d{2}:\d{2}:\d{2};\d{4}-01;[-+]?\d*\.\d+|\d+;G'
with open('test.csv', 'rb') as fid:
for line in fid:
match = re.findall(regex,line)
if match:
print line
print match

Use the csv module, specifying ; as the delimiter. The third column in the data is YYYY-MM, so check whether it's the first month and print the temperature if it is:
import csv
with open('data') as f:
for row in csv.reader(f, delimiter=';'):
year, month = row[2].split('-')
if int(month) == 1:
print(row[3])
Output
1.4
-5.8
For comparison, here is the simplest regex that I could come up with to extract the required value:
import re
with open('data') as f:
temperature = re.findall(r'\d{4}-01;(.+?);', f.read())
print('\n'.join(temperature))
You can see how it takes more effort to read & understand the regex than it does the Python code.
There is an even easier way that relies on your data consisting of fixed width fields:
with open('data') as f:
for line in f:
if line[45:47] == '01':
print(line[48:-3])

I suggest the folling regex:
^(?:\d{4}-01-.*?)(-?\d+\.\d+)
Demo and explanation of behavior: regex101
The number is in the first capturing group.
Alternatively, with a positive lookahead:
^(?=\d{4}-01).*?(-?\d+\.\d+)
Demo and explanation of behavior: regex101

You have to put brackets around what you want to extract. So you should change the last part to ;([-+]?\d*\.\d+|\d+);G.
Try this code and tell me if it works:
import re
regex1 = re.compile('\d{4}-01-\d{2}')
regex2 = re.compile('([-+]?\d*\.\d+|\d+);G')
with open('test.csv', 'rb') as fid:
for line in fid:
match1 = re.findall(regex1,line)
if match1:
match2 = re.findall(regex2, line)
print line
print match2
Hope this helps.

Related

python open csv search for pattern and strip everything else

I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')

Extracting substrings from CSV table

I'm trying to clean up the data from a csv table that looks like this:
KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000
I want to extract just the "#" handles, such as:
#katyperry
#justinbieber
#BarackObama
This is the code I've put togheter, but all it does is repeat the second line of the table over and over:
import csv
import re
with open('C:\\Users\\TK\\Steemit\\Scripts\\twitter.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
if i.isalpha():
stringafterword = re.split('\\#\\',row)[-1]
print(stringafterword)
If you are willing to use re, you can get a list of strings in one line:
import re
#content string added to make it a working example
content = """KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000"""
#solution using 're':
m = re.findall('#.*', content)
print(m)
#option without 're' but using string.find() based on your loop:
for row in content.split():
pos_of_at = row.find('#')
if pos_of_at > -1: #-1 indicates "substring not found"
print(row[pos_of_at:])
You should of course replace the contentstring with the file content.
Firstly the "#" symbol is a symbol. Therefore the if i.isalpha(): will return False as it is NOT a alpha character. Your re.split() won't even be called.
Try this:
import csv
import re
with open('C:\\Users\\input.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
stringafterword = re.findall('#.*',i)
print(stringafterword)
Here I have removed the if-condition and changed the re.split() index to 1 as that is the section you want.
Hope it works.

value not found and look behind errors

I am coding to print between 2 strings of a text,but i get errors:
regex coding:
import re
with open("in.txt") as f:
lines = f.read()
m = re.findall(r'(?s)(?<=Credit\s*\b).*?(?=Amount)', lines)
for i in m:
print i
(returns look behind not found)
Another coding:
with open("in.txt") as f:
lines = f.read()
cred_ind = (lines.index("Credit"))
am_ind = lines.index("Amount")
print(lines[cred_ind+6:am_ind])
(returns substring not found)
Text file:
....
accounts
Bank
Credit
good value
money
Amount
Amount
Output:
good value
money
Pythons re module does not support variable length lookbehind.
Simple fix, avoid using lookaround assertions..
m = re.findall(r'(?s)Credit\s*(.*?)Amount', lines)

extract float numbers from data file

I'm trying to extract the values (floats) from my datafile. I only want to extract the first value on the line, the second one is the error. (eg. xo # 9.95322254_0.00108217853
means 9.953... is value, 0.0010.. is error)
Here is my code:
import sys
import re
inf = sys.argv[1]
out = sys.argv[2]
f = inf
outf = open(out, 'w')
intensity = []
with open(inf) as f:
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
for line in f:
f.split("\n")
match = pattern.match(line)
if match:
intensity.append(match.group(0))
for k in range(len(intensity)):
outf.write(intensity[k])
but it doesn't work. The output file is empty.
the lines in data file look like:
xo_Is
xo # 9.95322254`_0.00108217853
SPVII_to_PVII_Peak_type
PVII_m(#, 1.61879`_0.08117)
PVII_h(#, 0.11649`_0.00216)
I # 0.101760618`_0.00190314017
each time the first number is the value I want to extract and the second one is the error.
You were almost there, but your code contains errors preventing it from running. The following works:
pattern = re.compile(r"[^-\d]*(-?\d+\.\d+)[^-\d]*")
with open(inf) as f, open(out, 'w') as outf:
for line in f:
match = pattern.match(line)
if match:
outf.write(match.group(1) + '\n')
I think you should test your pattern on a simple string instead of file. This will show where is the error: in pattern or in code which parsing file. Pattern looks good. Additionally in most languages i know group(0) is all captured data and for your number you need to use group(1)
Are you sure that f.slit('\n') must be inside for?

Python Regular Expression loop

I have this code wich will look for certain things in a file. The file looks like this:
name;lastname;job;5465465
name2;lastname2;job2;5465465
name3;lastname3;job3;5465465
This is the python code:
import re
import sys
filehandle = open('somefile.csv', 'r')
text = filehandle.read()
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
no it will only print:
lastname;name;job;5465465
It supposed to print the lastname first so i did that with groups. Now i need a loop to print all lines like this:
lastname;name;job;5465465
lastname2;name2;job2;5465465
lastname3;name3;job3;5465465l
i tried all kind of loops but it doesnt go trough the whole file... how do i need to do this?
it must be done with the re module. I know its easy in the csv module ;)
You need to process the file line by line.
import re
import sys
with open('somefile.csv', 'r') as filehandle:
for text in filehandle:
b = re.search("([a-zA-Z]+);([a-z\sA-Z]+);([a-zA-Z]*);([0-9^-]+)\n?",text)
print (b.group(2),b.group(1),b.group(3),b.group(4))
Your file has nicely semi-colon separated values, so it would be easier to just use split or the csv library as has been suggested.
No need for re, but a good job for csv:
import csv
with open('somefile.csv', 'r') as f:
for rec in csv.reader(f, delimiter=';'):
print (rec[1], rec[0], rec[2], rec[3])
You can use re if you want to check the validity of individual elements (valid phone number, no numbers in name, capitalized names, etc.).
The fault is not with the loops, but rather with your regex / capture group patterns. The class [a-zA-Z]+ will not match "lastname3" or "lastname2". This sample works:
import re
import sys
for line in open('somefile.csv', 'r'):
b = re.search("(\w+);(\w+);(\w*);([0-9^-]+)\n?",line)
if b:
print "%s;%s;%s;%s" % (b.group(2),b.group(1),b.group(3),b.group(4))
Seems as if you just want to reorder what you have, in which case I don't know whether regex are needed. I believe the following might be of use:
reorder = operator.itemgetter(1, 0, 2, 3)
http://docs.python.org/library/operator.html

Categories