Extract Numbers from Text File excluding Dates - python

I have simple code which extracts numbers from a text file. It looks like this:
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
codata.append(i)
The text contains a lot financial data and also a lot of dates which I don't want. Is there an easy way to modify the code to exclude dates? The dates generally follow these formats (I'm using a specific date as an example for the format but it can be any date):
August 31, 2018
8/31/2018
8/31/18
August 2018
FY2018
CY2018
fiscal year 2018
calendar year 2018
Here is an example. I have a text file with the following text:
"For purposes of the financial analyses described in this section, the term “implied merger consideration” refers to the implied value of the per share consideration provided for in the transaction of $80.38 consisting of the cash portion of the consideration of $20.25 and the implied value of the stock portion of the consideration of 0.275 shares of XXX common stock based on XXX’s closing stock price of $218.67 per share on July 14, 2018."
When I run my code I posted above, I get this output from print(codata):
['80.38', '20.25', '0.275', '218.67', '14', '2018']
I would like to get this output instead:
['80.38', '20.25', '0.275', '218.67']
So I don't want to pick up the numbers 14 and 2018 associated with the date "July 14, 2018". If I know that any numbers related to dates within the text would have the formats that I outlined above, how should I modify my code to get the desired output?

Hard to understand exactly what you want. But if you are just looking for numbers you can do this (and if it has a decimal, use float instead).
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
try:
codata.append(int(i))
except:
continue

Here's a regex that will match and remove your current set of dates:
import re
p = r"(((january|february|march|april|may|june|july|august|september|october|november|december) +[\d, ]+)|" + \
r"((\d?\d\/){2}(\d\d){1,2})|" + r"((fiscal year|fy|calendar year|cy) *(\d\d){1,2}))"
codata = []
with open(r"filename.txt") as file:
for line in file:
codata.append(re.sub(p, "", line, flags=re.IGNORECASE))
print(codata)
Output (assuming input file is the same as your provided date list):
['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']

Considering the text sample I assume that every price starts with $ sign, in that case you are probably looking for the following regex:
r"(?<=\$)\d+\.?\d*(?= )"
the result would be:
['80.38', '20.25', '218.67']
Or in case you want the $ sign in your list the regex would be:
r"\$\d+\.?\d*(?= )"
and the result in that case:
['$80.38', '$20.25', '$218.67']
To clarify, the (?<=\$) means that our match needs to be proceeded by the $ sign, but the $ sign is not added to the output. (?= ) means that the price should be followed by space.

Related

Extract date from a string with a lot of numbers

There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.
Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))

Str.find() unable to search for '\n'

I am trying to extract text between two keywords using str.find(). But it fails to find the occurrence of '\n'
text = 'Cardiff, the 6th November 2007\n company \n'
String_to_extract = '6th November 2007'
keywords = {'date': ['Cardiff, the ' , '\n']}
Code:
text2=text[text.find(keywords['date']0])+len(keywords[0]):text.find(keywords['date'][1])]
print(text2)
str.find() is unable to search for '\n', which results in no output
PS-Want to use str.find() method only
There are several problems here:
In the keywords dictionary you use a date variable that should be string: 'date'.
In the keywords dictionary you doubly escaped \\n, while you don't do this in the text variable.
In the index calculations you use a key variable that is defined no where; this should be the 'date' key defined in the keywords dictionary.
And finally, you calculate the starting position of the first index, while it should be the ending position.
Try this:
# String to be extracted = '6th November 2007'
text = 'Cardiff, the 6th November 2007\n\n \n\n'
keywords = {'date' : ['Cardiff, the ' , '\n\n']}
a = text.find(keywords['date'][0]) + len(keywords['date'][0])
b = text.find(keywords['date'][1])
text2 = text[a:b]
print(text2)
You've incorrectly calculated first index. Try this:
text = 'Cardiff, the 6th November 2007\n\n company \n\n'
keywords = ['Cardiff, the ', '\n']
result = text[text.find(keywords[0])+len(keywords[0]):text.find(keywords[1])]
Output:
6th November 2007
To generalize the Answer. use this Code:
text2 = text[text.find(keywords[key][0])+len(keywords[key][0]):text.find(keywords[key][1])] # you can replace the key with whatever you want as keys
This is a really interesting question, and goes to show how something trivial may become hard to find if used in chained manner. Let's see what's happening in your code. You say that your code can't seem to find the 1st occurrence, however, I would like to state the opposite, it definitely finds the first occurrence. In the text: 'Cardiff, the 6th November 2007\n\n \n\n' you are trying to find the first occurrence of 'Cardiff, the '. You will see that in the text, the index of the string starts from index 0, i.e. text[0]. so this code text[text.find(keywords[key][0]):text.find(keywords[key][1])] essentially becomes text[0:text.find(keywords[key][1])]. Now in Python slicing rule, 0 is inclusive and you are getting the output like Cardiff, the 6th November 2007 and thinking it did not find the first occurrence. So in order to fix it, you need to move start slicing from after 'Cardiff, the '. You can achieve this by altering the text2 assignment in this way:
text2 = text[text.find(keywords[key][0])+len(keywords[key][0]):text.find(keywords[key][1])]
There are other ways to achieve what you want, but this what you were trying to do originally.

How do I grab specific text in between other text?

I need help grabbing just K334-76A9 from this string:
b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n
Please help, I have tried so many things but none have worked.
Sorry if my question is bad :/
If you want to find the format xxxx-xxxx, no matter what string you have you can do it like this:
import re
b = '\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
splitString = []
splitString = b.split()
r = re.compile('.{4}-.{4}')
for string in splitString:
if r.match(string):
print(string)
Output:
K334-76A9
Here's code that grabs everything after "Serial Number is " up to the next whitespace character.
import re
data = b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
pat = re.compile(r"Serial Number is ([^\s]+)")
match = pat.search(data.decode("ASCII"))
if match:
print(match.group(1))
Result:
K334-76A9
You can adjust the regular expression per your needs. Regular expressions are Da Bomb! This one's really simple, but you can do amazingly complex things with them.

get text in div after specific character [xpath]

I am using XPath to scrape a website, I've been able to access more of the information I need except for the date. The date is text in a div, it is formatted as such below.
October 13, 2018 / 1:31 AM / Updated 5 hours ago
I just want to get the date, not the time or anything else. However, with my current code, I am getting the entire text in the div. My code is below.
item['datePublished'] = response.xpath("//div[contains(#class, 'ArticleHeader_date') and substring-before(., '/')]/text()").extract()
As hinted, there are ways to do this in XPath 2.0+. However, this should be done in the host language.
One way is to extract the date using a regex after the value has been retrieved, e.g. Regex Demo
\w+\ \d\d?,\ \d{4}
Code Sample:
import re
regex = r"\w+\ \d\d?,\ \d{4}"
test_str = "October 13, 2018 / 1:31 AM / Updated 5 hours ago"
matches = re.search(regex, test_str)
if matches:
print (matches.group())

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

Categories