Extract Numeric Data from a Text file in Python - python

Say I have a text file with the data/string:
Dataset #1: X/Y= 5, Z=7 has been calculated
Dataset #2: X/Y= 6, Z=8 has been calculated
Dataset #10: X/Y =7, Z=9 has been calculated
I want the output to be on a csv file as:
X/Y, X/Y, X/Y
Which should display:
5, 6, 7
Here is my current approach, I am using string.find, but I feel like this is rather difficult in solving this problem:
data = open('TestData.txt').read()
#index of string
counter = 1
if (data.find('X/Y=')==1):
#extracts segment out of string
line = data[r+6:r+14]
r = data.find('X/Y=')
counter += 1
print line
else:
r = data.find('X/Y')`enter code here`
line = data[r+6:r+14]
for x in range(0,counter):
print line
print counter
Error: For some reason, I'm only getting the value of 5. when I setup a #loop, i get infinite 5's.

If you want the numbers and your txt file is formatted like the first two lines i.e X/Y= 6, not like X/Y =7:
import re
result=[]
with open("TestData.txt") as f:
for line in f:
s = re.search(r'(?<=Y=\s)\d+',line) # pattern matches up to "Y" followed by "=" and a space "\s" then a digit or digits.
if s: # if there is a match i.e re.search does not return None, add match to the list.
result.append(s.group())
print result
['5', '6', '7']
To match the pattern in your comment, you should escape the period like . or you will match strings like 1.2+3 etc.. the "." has special meaning re.
So re.search(r'(?<=Counting Numbers =\s)\d\.\d\.\d',s).group()
will return only 1.2.3
If it makes it more explicit, you can use s=re.search(r'(?<=X/Y=\s)\d+',line) using the full X/Y=\s pattern.
Using the original line in your comment and updated line would return :
['5', '6', '7', '5', '5']
The (?<=Y=\s)is called a positive lookbehind assertion.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position
There are lots of nice examples here in the re documentation. The items in the parens are not returned.

Since it appears that the entities are all on a single line, I would recommend using readline in a loop to read the file line-by-line and then using a regex to parse out the components you're looking for from that line.
Edit re: OP's comment:
One regex pattern that could be used to capture the number given the specified format in this case would be: X/Y\s*=\s*(.+),

Related

How can I extract numbers containing commas from strings in python

I am trying to find all numbers in text and return them in a list of floats.
In the text:
Commas are used to separate thousands
Several consecutive numbers are separated by a comma and a space
Numbers can be attached to words
My code seems to extract numbers separated with a comma and space and numbers attached to words.
However, it extracts numbers separated by commas as separate numbers
text = "30feet is about 10metre but that's 1 rough estimate several numbers are like 2, 137, and 40 or something big numbers are like 2,137,040 or something"
list(map(int, re.findall('\d+', text)))
The suggestions below work beautifully
Unfortunately, the output of the below returns a string:
nums = re.findall(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)', text)
print(nums)
I need to return the output as a list of floats, with commas between but no speech marks.
Eg.
extract_numbers("1, 2, 3, un pasito pa'lante Maria")
is [1.0, 2.0, 3.0]
Unfortunately, I have not yet been successful in my attempts. Currently, my code reads
def extract_numbers(text):
nums = re.findall(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)', text)
return (("[{0}]".format(
', '.join(map(str, nums)))))
extract_numbers(TEXT_SAMPLE)
You may try doing a regex re.findall search on the following pattern:
\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)
Sample script - try it here
import re
text = "30feet is about 10metre but that's 1 rough estimate several numbers are like 2, 137, and 40 or something big numbers are like 2,137,040 or something"
nums = re.findall(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?(?!\d)', text)
print(nums)
This prints:
['30', '10', '1', '2', '137', '40', '2,137,040']
Here is an explanation of the regex pattern:
\b word boundary
\d{1,3} match 1 to 3 leading digits
(?:,\d{3})* followed by zero or more thousands terms
(?:\.\d+)? match an optional decimal component
(?!\d) assert the "end" of the number by checking for a following non digit
Create a pattern with an optional character group []
Code try it here
import re
text = "30feet is about 10metre but that's 1 rough estimate several numbers are like 2, 137, and 40 or something big numbers are like 2,137,040 or something"
out = [
int(match.replace(',', ''))
for match in re.findall('[\d,]+', text)
]
print(out)
Output
[30, 10, 1, 2, 137, 40, 2137040]
you need to match the commas as well, then strip them before turning them into an integer:
list(map(lambda n: int(n.replace(',','')), re.findall('[\d,]+', text)))
Also, you should probably be using list comprehensions unless you need python2 compatibility for some reason:
[int(n.replace(',', '')) for n in re.findall('[\d,]+', text)]
y not use?
array = re.findall(r'[0-9]+', str)

Removing whitespaces in a txt line and adding the values to a list

I have the following line of text from a .txt file
0.14999999999999999 0.20000000000000001 0.29999999999999999 0.34999999999999998 0.50000000000000000 0.59999999999999998 0.69999999999999996 0.72999999999999998 0.84999999999999998 0.90000000000000002 \n
I've been facing a problem on how to make this first line into a list. I've tried line.strip() but that only took care of the first and last spaces. There are still spaces left that I couldn't get rid of. As seen below:
'0.14999999999999999 0.20000000000000001 0.29999999999999999 0.34999999999999998 0.50000000000000000 0.59999999999999998 0.69999999999999996 0.72999999999999998 0.84999999999999998 0.90000000000000002'
I also can't just replace all " " with "" as all numbers would get crumpled together. I also can't assume the number of whitespaces, nor the number of spaces before the ".", as I may have numbers greater than or equal to 10 down the line.
Use re.split.
Here is example.
a = '0.14999999999999999 0.20000000000000001 0.29999999999999999 0.34999999999999998 0.50000000000000000 0.59999999999999998 0.69999999999999996 0.72999999999999998 0.84999999999999998 0.90000000000000002'
import re
output = re.split(' +', a)
the output is
['0.14999999999999999',
'0.20000000000000001',
'0.29999999999999999',
'0.34999999999999998',
'0.50000000000000000',
'0.59999999999999998',
'0.69999999999999996',
'0.72999999999999998',
'0.84999999999999998',
'0.90000000000000002'].
If you want every element in the output becomes float, then use map.
output = list(map(float, output))
You don't even need re. Just use split on the line.
' 1 2 3 4 5 '.split()
['1', '2', '3', '4', '5']

Regex Extract specific data between specific strings in python

Using regex in python 3.6.3 I am trying to extract scientific notation numbers associated with a specific start text and end text. From the following sample data:
Not_the_data : REAL[10] (randomtext := doesntapply) := [1.00000000e+000,-2.00000000e000,3.00000000e+000,4.00000000e+000,5.00000000e+000,6.00000000e+000
,7.00000000e+000,8.00000000e-000,9.00000000e+000,1.00000000e+001,1.10000000e+001];
This_data : REAL[2,27] (RADIX := Float) := [3.45982254e-001,9.80374157e-001,8.29904616e-001,1.57800000e+002,4.48320538e-001,6.20533180e+001
,1.80081348e+003,-8.93283653e+000,5.25826037e-001,2.16974407e-001,1.17304848e+002,6.82604387e-002
,3.76116596e-002,6.82604387e-002,3.76116596e-002];
Not_it_either : REAL[72] (randomtext := doesntapply) := [0.00000000e+000,-0.00000000e000,0.00000000e+000,0.00000000e+000,0.00000000e+000,0.00000000e+000];
I would want only the data in the "This_data" set:
['3.45982254e-001','9.80374157e-001','8.29904616e001','1.57800000e+002','4.48320538e-001','6.20533180e+001','1.80081348e+003','-8.93283653e+000','5.25826037e-001','2.16974407e-001','1.17304848e+002','6.82604387e-002','3.76116596e-002','6.82604387e-002','3.76116596e-002']
If I don't use the lookaround functions I can get all the numbers that match the scientific notation easily like this:
values = re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)',_DATA_,re.DOTALL|re.MULTILINE)
But as soon as I add a lookahead function:
values = re.findall('(?<=This_data).*?(-?[0-9]+.[0-9]+e[+-][0-9]+)+',_DATA_,re.DOTALL|re.MULTILINE)
all but the first number in the desired set drop off. I have attempted multiple iterations of this using positive and negative lookahead and lookbehind on debugex to no avail.
My source file is 50k+ lines and the data set desired is 10-11k lines. Ideally I would like to capture my data set in one read through of my file.
How can I correctly use a lookahead or lookbehind function to limit my data capture to numbers that meet the format but only from the desired "This_Data" set?
Any help is appreciated!
You might have an easier time parsing the file one line at a time, skipping lines that don't meet the criteria. It looks like each line ends with a semicolon, so you can use that as a way to break the parsing.
import re
PARSING = False
out = []
with open('path/to/file.data') as fp:
for line in fp:
if line.startswith('This_data'):
PARSING = True
if PARSING:
out.append(re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)', line)
# check if the line contains a semicolon to stop parsing
if ';' in line:
PARSING = False
# return the results:
out

Define string between comma x and comma y the split all bytes using a comma

I have some data that I am parsing which takes the following format:
8344,5354,Binh Duong,1,0103313333333033133331,1,13333333331,1,00313330133
,8344,7633,TT Ha Noi,2,3330333113333303111303,3,33133331133,2,30333133010
....more data.....
The first record does not start with a comma, but all subsequent rows of data do. I want to take all the numbers between the 4th and 5th comma on the first line and 5th and 6th comma on all other lines and split this string using commas.
So in the above example '0103313333333033133331' should print as '0,1,0,3,3,1,3,3,3,3,3,3,3,0,3,3,1,3,3,3,3,1'. The difficulty is that the length of the string between comma x and y varies depending on what data I am parsing. I've used regex's to isolate the string in question provided it has 16 digits in it, however this is not the case in all items I might be parsing.
As a result using a .format() method with 16 instances of '{},' threw up a tuple index error on items where the string was not 16 bytes long.
Can anyone suggest a method of achieving what I want?
Thanks
I would use str.split() to get the correct field, and str.join() to split it into single characters:
with open('xx.in') as input_file:
for line in input_file:
line = line.strip().strip(',')
line = line.split(',')
field = line[4]
print ','.join(field)
A slightly different approach with a regex that grabs the 5th element of the comma separated line from the end:
>>> import re
>>> lines = ['8344,5354,Binh Duong,1,0103313333333033133331,1,13333333331,1,00313330133',',8344,7633,TT Ha Noi,2,3330333113333303111303,3,33133331133,2,30333133010']
>>> for line in lines:
... num = re.search(r'\d+(?=(?:,[^,]+){4}$)', line).group()
... seq = ','.join(list(num))
... print(seq)
...
0,1,0,3,3,1,3,3,3,3,3,3,3,0,3,3,1,3,3,3,3,1
3,3,3,0,3,3,3,1,1,3,3,3,3,3,0,3,1,1,1,3,0,3
You can use this regex:
^,?\d+,\d+,[\w\s]+,\d+,(\d+)
Working demo
MATCH 1
1. [23-45] `0103313333333033133331`
MATCH 2
1. [97-119] `3330333113333303111303`
Then you can split the content of each group with \d
p = re.compile(ur'(\d)')
test_str = u"0103313333333033133331"
subst = u"\1,"
result = re.sub(p, subst, test_str)
>> 0,1,0,3,3,1,3,3,3,3,3,3,3,0,3,3,1,3,3,3,3,1,

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.
Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

Categories