I would like to extract a specific portion from a text.
For example, I have this text:
"*Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrum exercitationem ullamco laboriosam, nisi ut aliquid ex ea commodi consequatur.
Duis aute irure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum*",
I would like to extract the content from "Duis aute" to the start a new line ("nulla pariatur").
How could I do this in Python? Thanks in advance to everyone.
Sorry for poor English.
You can use this.
with open('filename.txt') as f: # open file and get the data.
data = f.read()
s_index = data.index('Duis aute') # get the starting index of text.
e_index = data.index('.',s_index) # get the end index of text here I also pass s_index as the parameter because I want the index of the dot after the starting index.
text = data[s_index:e_index]
print(text)
Output
Duis aute irure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur
If you want to end the text by \n Then use this one
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Duis aute')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
Testing
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('ipsum dolor')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Ut enim ad minim')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
Ut enim ad minim veniam, quis nostrum exercitationem ullamco laboriosam, nisi ut aliquid ex ea commodi consequatur.
And If you need only one word after the given word then use this.
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Lorem')
e_index = data.index(' ',s_index+len('Lorem')+1)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
Lorem ipsum
If you are trying to extract a particular "sentence" - then one way could be to split on the sentence separator (\n for example)
sentences = s.split('\n')
If you have multiple delimiters for a sentence - you can use the re module -
import re
sentences = re.split(r'\.|\n', s)
You can then extract the matches from sentences -
required = '\n'.join(_ for _ in sentences if _.strip().startswith('Duis aute'))
Of course, you can combine all of this into a single liner -
'\n'.join(_ for _ in s.split('.') if _.strip().startswith('Duis aute'))
The function I created to find a list of regex matches doesn't work: instead of printing a list of all the matches, it prints one match at a time. I tried multiple times and I don't understand what the error could be.
For instance, this is the text I want to find the regex in: '] prima ciao hello'
This is the function:
def find_regex(regex, text):
l = []
matches_prima = re.findall(regex, text)
lunghezza_prima = len(matches_prima)
for x in matches_prima:
l.extend(matches_prima)
print(l)
And in another function is called like:
def main():
testo = '] prima ciao hello', 'ola'
find_prima = re.compile(r"\]\s*prima(?!\S)")
print(find_regex(find_prima,testo))
if __name__ == "__main__":
main()
So given a regex, I call it like print(find_regex(find_prima,testo)). But the output is:
['] prima']
[]
So I get them printed once at a time.
And I would need the full list instead to count all the matches. What am I doing wrong?
Try this:
import re
txt = """mypattern, Lorem ipsum dolor sit amet, aliquip sunt ad irure ad
labore nulla do et est eiusmod ut fugiat. Minim enim incididunt ullamco
deserunt Lorem cillum in est ullamco dolor qui sint labore. Reprehenderit
laborum anim magna pariatur proident cillum et eiusmod eu laboris cillum.
Quis et nostrud laboris non. Est incididunt dolore sint dolore. Sunt eu
mypattern, ipsum ullamco dolore ad ut veniam est. dolore mollit ut sunt nulla
"""
print([line for line in txt.splitlines() if re.match(r"mypattern, ", line) is not None])
Output:
['mypattern, Lorem ipsum dolor sit amet, aliquip sunt ad irure ad', 'mypattern, ipsum ullamco dolore ad ut veniam est. dolore mollit ut sunt nulla']
For text processing task i need to apply multiple regex substitutions (i.e. re.sub). There are multiple regex patterns with custom replacement parameters. The result needs to be original text, text with replacements and a map of tuples identifying start,end indices of replaced strings in source text and indices in result text.
e.g.
following is a sample code having input text and an array of 3 modifier tuples.
text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.
'''
modifiers = [
(
r'([\w]+\.?)\s+(\d{1,2})\w{2},\s+(\d{4})',
{ 1:lambda x:month(x), 2:lambda x:num2text(x), 3:lambda x:num2text(x) }
),
(
r' (\d) ',
{ 1:lambda x:num2text(x) }
),
(
r'(culpa)',
{ 1: 'culpae' }
)
]
sample output index map:
[((7, 11), (7, 30)), ((12, 14), (31, 35)), ((20, 22), (41, 51)), ((23, 28), (52, 57)),...]
Already wrote a complicated function, that tries to handle all the corner cases of the index offsetting happening during replacements, but it's already taking too much time.
Maybe there is already a solution for this task?
Here is a demo of current state.
Word transformation expansion (normalization) functions were intentionally made simplistic with fixed value dict mapping.
The ultimate goal is to make a text dataset generator. Dataset needs to have two text parts - one with numbers abbreviations and other expandable strings and the other with fully expanded into full textual representation (e.g. 3->three, apr. -> april, etc.) And also offset mapping to link parts of non-expanded text with corresponding parts in expanded text.
One of the corner cases that my implementation already deals with is a case when there are at least two modifiers A and B and they have to deal with text like 'text text a text b text a text b' as first modifier churns out output span of the second 'a' replacement becomes incorrect as B modifier comes in and alters output text before second 'a'.
Also partially dealt with case where subsequent modifier replaces output replacement from first modifier and figures out the initial source span location.
UPDATE
Writing a python package called re-map.
One might also consider spacy mentioned here.
Here is a code example that handles your text modifiers using re, datetime and a third party package called inflect.
The code will return the modified text with the position of the modified words.
PS: You need to explain more what you're trying to do. Otherwise you can use this code and modify it to fulfill your needs.
To install inflect: pip install inflect
Sample code:
import re
from datetime import datetime
import inflect
ENGINE = inflect.engine()
def num2words(num):
"""Number to Words using inflect package"""
return ENGINE.number_to_words(num)
def pretty_format_date(pattern, date_found, text):
"""Pretty format dates"""
_month, _day, _year = date_found.groups()
month = datetime.strptime('{day}/{month}/{year}'.format(
day=_day, month=_month.strip('.'), year=_year
), '%d/%b/%Y').strftime('%B')
day, year = num2words(_day), num2words(_year)
date = '{month} {day}, {year} '.format(month=month, day=day, year=year)
begin, end = date_found.span()
_text = re.sub(pattern, date, text[begin:end])
text = text[:begin] + _text + text[end:]
return text, begin, end
def format_date(pattern, text):
"""Format given string into date"""
spans = []
# For loop prevents us from going into an infinite loop
# If there is malformed texts or bad regex
for _ in re.findall(pattern, text):
date_found = re.search(pattern, text)
if not date_found:
break
try:
text, begin, end = pretty_format_date(pattern, date_found, text)
spans.append([begin, end])
except Exception:
# Pass without any modification if there is any errors with date formats
pass
return text, spans
def number_to_words(pattern, text):
"""Numer to Words with spans"""
spans = []
# For loop prevents us from going into an infinite loop
# If there is malformed texts or bad regex
for _ in re.findall(pattern, text):
number_found = re.search(pattern, text)
if not number_found:
break
_number = number_found.groups()
number = num2words(_number)
begin, end = number_found.span()
spans.append([begin, end])
_text = re.sub(pattern, number, text[begin:end])
text = text[:begin] + ' {} '.format(_text) + text[end:]
return text, spans
def custom_func(pattern, text, output):
"""Custom function"""
spans = []
for _ in re.findall(pattern, text):
_found = re.search(pattern, text)
begin, end = _found.span()
spans.append([begin, end])
_text = re.sub(pattern, output, text[begin:end])
text = text[:begin] + ' {} '.format(_text) + text[end:]
return text, spans
text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.
'''
modifiers = [
(
r'([\w]+\.?)\s+(\d{1,2})\w{2},\s+(\d{4})',
format_date
),
(
r' (\d) ',
number_to_words
),
(
r'( \bculpa\b)', # Better using this pattern to catch the exact word
'culpae'
)
]
for regex, func in modifiers:
if not isinstance(func, str):
print('\n{} {} {}'.format('#' * 20, func.__name__, '#' * 20))
_text, spans = func(regex, text)
else:
print('\n{} {} {}'.format('#' * 20, func, '#' * 20))
_text, spans = custom_func(regex, text, func)
print(_text, spans)
Output:
#################### format_date ####################
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On April six, two thousand and nine Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit animid est laborum.
[[128, 142]]
#################### number_to_words ####################
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex five ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt six mollit anim id est laborum.
[[231, 234], [463, 466]]
#################### culpae ####################
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpae minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpae qui officia deserunt 6 mollit anim id est laborum.
[[150, 156], [435, 441]]
Demo on Replit
Wrote a re-map python library to solve the problem described.
Here is a demo.
I have potentially large amount of text output coming from an application. The output can be broken up into different sections. And I would like to determine which section I am processing based on existence of 1 or more keywords or key phrases.
Dummy Example output:
******************
** MyApp 1.1 **
** **
******************
**Copyright **
******************
Note # 1234
Text of the note 1234
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
******************
INPUT INFO:
Number of data points: 123456
Number of cases: 983
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
******************
Analysis Type: Simple
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
******************
Results:
Data 1: 1234e-10
Data 2
------
1 2
2 3.4
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
*******************
CPU TIME: 1:01:12
WALL TIME: 1:04:23
*******************
So I created a dictionary like and am trying to look up the values of the dict in each chunk.
def process(this_chunk):
keys = ['banner_k','runSummary','inputSummary_k']
vals = [['MyApp','Copyright'],['CPU TIME'],['Number of data']]
for k,v in zip(keys,vals):
chunkdict[k]=v
for k,v in chunkdict.items():
if any(x in v for x in this_chunk.splitlines()):
print(k + " is in this chunk")
process_for_k(chunk) #Function for each specific section.
break
else:
print(k + " is not in this chunk")
return
But this does not identify all the chunks. The values are indeed present but only in 1 chunk the values are matched. To be specific, my real application has the exact words for 'CPU TIME' and 'Copyright' in its output.
The section with 'CPU TIME' is captured correctly but the section with 'Copyright' is not found.
Is this the right approach to identifying sections with known keywords?
Any ideas why this (if any(x in v for x in this_chunk.splitlines()):)cmight not work?
I'm trying to split a large file that has several paragraphs, each one is of variable length and the only delimiter would be the bullet point for the next paragraph...
Is there a way to get several different files with each individual paragraph?
The final thing is to write each individual paragraph to a MySQL DB...
example input:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
output: each paragraph is a separate entry in the DB
this is how you split your file by bullet point:
new_files = open(source_file).read().split(u'\u2022')
for par in new_files:
open("%s.txt"%new_files.index(par),"w").write("%s"%par)
LOAD DATA INFILE "%s.txt"%new_files.index(par) INTO TABLE your_DB_name.your_table;
This conects to mysql DB and reads the file and splits it at each bullet point and inserts the data into mysql DB table
My Code:
#Server Connection to MySQL:
import MySQLdb
conn = MySQLdb.connect(host= "localhost",
user="root",
passwd="newpassword",
db="db")
x = conn.cursor()
try:
file_data = open("FILE_NAME_WITH_EXTENSION").read().split(u'\u2022')
for text in file_data:
print text
x.execute("""INSERT INTO TABLE_NAME VALUES (%s)""",(text))
conn.commit()
except:
conn.rollback()
conn.close()