I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>
Related
There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.
Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))
I am trying to extract text between two keywords using str.find(). But it fails to find the occurrence of '\n'
text = 'Cardiff, the 6th November 2007\n company \n'
String_to_extract = '6th November 2007'
keywords = {'date': ['Cardiff, the ' , '\n']}
Code:
text2=text[text.find(keywords['date']0])+len(keywords[0]):text.find(keywords['date'][1])]
print(text2)
str.find() is unable to search for '\n', which results in no output
PS-Want to use str.find() method only
There are several problems here:
In the keywords dictionary you use a date variable that should be string: 'date'.
In the keywords dictionary you doubly escaped \\n, while you don't do this in the text variable.
In the index calculations you use a key variable that is defined no where; this should be the 'date' key defined in the keywords dictionary.
And finally, you calculate the starting position of the first index, while it should be the ending position.
Try this:
# String to be extracted = '6th November 2007'
text = 'Cardiff, the 6th November 2007\n\n \n\n'
keywords = {'date' : ['Cardiff, the ' , '\n\n']}
a = text.find(keywords['date'][0]) + len(keywords['date'][0])
b = text.find(keywords['date'][1])
text2 = text[a:b]
print(text2)
You've incorrectly calculated first index. Try this:
text = 'Cardiff, the 6th November 2007\n\n company \n\n'
keywords = ['Cardiff, the ', '\n']
result = text[text.find(keywords[0])+len(keywords[0]):text.find(keywords[1])]
Output:
6th November 2007
To generalize the Answer. use this Code:
text2 = text[text.find(keywords[key][0])+len(keywords[key][0]):text.find(keywords[key][1])] # you can replace the key with whatever you want as keys
This is a really interesting question, and goes to show how something trivial may become hard to find if used in chained manner. Let's see what's happening in your code. You say that your code can't seem to find the 1st occurrence, however, I would like to state the opposite, it definitely finds the first occurrence. In the text: 'Cardiff, the 6th November 2007\n\n \n\n' you are trying to find the first occurrence of 'Cardiff, the '. You will see that in the text, the index of the string starts from index 0, i.e. text[0]. so this code text[text.find(keywords[key][0]):text.find(keywords[key][1])] essentially becomes text[0:text.find(keywords[key][1])]. Now in Python slicing rule, 0 is inclusive and you are getting the output like Cardiff, the 6th November 2007 and thinking it did not find the first occurrence. So in order to fix it, you need to move start slicing from after 'Cardiff, the '. You can achieve this by altering the text2 assignment in this way:
text2 = text[text.find(keywords[key][0])+len(keywords[key][0]):text.find(keywords[key][1])]
There are other ways to achieve what you want, but this what you were trying to do originally.
I have simple code which extracts numbers from a text file. It looks like this:
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
codata.append(i)
The text contains a lot financial data and also a lot of dates which I don't want. Is there an easy way to modify the code to exclude dates? The dates generally follow these formats (I'm using a specific date as an example for the format but it can be any date):
August 31, 2018
8/31/2018
8/31/18
August 2018
FY2018
CY2018
fiscal year 2018
calendar year 2018
Here is an example. I have a text file with the following text:
"For purposes of the financial analyses described in this section, the term “implied merger consideration” refers to the implied value of the per share consideration provided for in the transaction of $80.38 consisting of the cash portion of the consideration of $20.25 and the implied value of the stock portion of the consideration of 0.275 shares of XXX common stock based on XXX’s closing stock price of $218.67 per share on July 14, 2018."
When I run my code I posted above, I get this output from print(codata):
['80.38', '20.25', '0.275', '218.67', '14', '2018']
I would like to get this output instead:
['80.38', '20.25', '0.275', '218.67']
So I don't want to pick up the numbers 14 and 2018 associated with the date "July 14, 2018". If I know that any numbers related to dates within the text would have the formats that I outlined above, how should I modify my code to get the desired output?
Hard to understand exactly what you want. But if you are just looking for numbers you can do this (and if it has a decimal, use float instead).
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
try:
codata.append(int(i))
except:
continue
Here's a regex that will match and remove your current set of dates:
import re
p = r"(((january|february|march|april|may|june|july|august|september|october|november|december) +[\d, ]+)|" + \
r"((\d?\d\/){2}(\d\d){1,2})|" + r"((fiscal year|fy|calendar year|cy) *(\d\d){1,2}))"
codata = []
with open(r"filename.txt") as file:
for line in file:
codata.append(re.sub(p, "", line, flags=re.IGNORECASE))
print(codata)
Output (assuming input file is the same as your provided date list):
['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']
Considering the text sample I assume that every price starts with $ sign, in that case you are probably looking for the following regex:
r"(?<=\$)\d+\.?\d*(?= )"
the result would be:
['80.38', '20.25', '218.67']
Or in case you want the $ sign in your list the regex would be:
r"\$\d+\.?\d*(?= )"
and the result in that case:
['$80.38', '$20.25', '$218.67']
To clarify, the (?<=\$) means that our match needs to be proceeded by the $ sign, but the $ sign is not added to the output. (?= ) means that the price should be followed by space.
import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.
Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')
I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar