Regex for CSV split including multiple double quotes - python

I have a CSV column data containing text. Each row is separated with double quotes "
Sample text in a row is similar to this (notice: new lines and the spaces before each line are intended)
"Lorem ipsum dolor sit amet,
consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna
aliqua. Ut ""enim ad"" minim veniam,
quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat ""nulla pariatu"""
"ex ea commodo
consequat. Duis aute irure ""dolor in"" reprehenderit
in voluptate velit esse
cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt
mollit anim id est laborum."
The above represent 2 subsequent rows.
I want to select as separated groups all the text contained between every first double quote " (starting a line) and every LAST double quote "
As you can see tho, there are line break in the text, along with subsequent escaped double quotes "" wich are part of the text that I need to select.
I came up with something like this
(?s)(?!")[^\s](.+?)(?=")
but the multiple double quotes are breaking my desired match
I'm a real novice with regex, so I think maybe I'm missing something very basic. Dunno if relevant but I'm using Sublime Text 3 so should be python I think.
What can I do to achieve what I need?

You can use the following regex:
"[^"]*(?:""[^"]*)*"
See demo
This regex will match either a non-quote, or 2 consequent double quotes inside double quotation marks.
How does it work? Let me share a graphics from debuggex.com:
With the regex, we match:
" - (1) - a literal quote
[^"]* - (2, 3) - 0 or more characters other than a quote (yes, including a newline, this is a negated character class), if there are none, then the regex searches for the final literal quote (6)
(?:""[^"]*)* - (4,5) - 0 or more sequences of:
"" - (4) - double double quotation marks
[^"]* - (5) - 0 or more characters other than a quote
" - (6) - the final literal quote.
This works faster than "(?:[^"]|"")*" (although yielding the same results), because the processing the former is linear, involving much less backtracking.

If you are using python , then you do not need regex , you can directly use the standard csv library, and double doublequotes inside a single row would be handled automatically. Example (For the csv you posted above in a.csv) -
>>> import csv
>>> with open('a.csv','r') as f:
... reader = csv.reader(f)
... for row in reader:
... print(row)
...
['Lorem ipsum dolor sit amet, \n consectetur adipisicing elit, sed do eiusmod\n tempor incididunt ut labore et dolore magna \n aliqua. Ut "enim ad" minim veniam,\n quis nostrud exercitation ullamco laboris nisi \n ut aliquip ex ea commodo\n consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse\n cillum dolore eu fugiat "nulla pariatu"']
['ex ea commodo\n consequat. Duis aute irure "dolor in" reprehenderit \n in voluptate velit esse\n cillum dolore eu fugiat nulla pariatur. \n Excepteur sint occaecat cupidatat non\n proident, sunt in culpa qui officia deserunt \n mollit anim id est laborum.']
This was handled correctly by the csv module basically because " is the default quotechar , so anything within two " is considered part of that single column, even if its \n or spaces, etc.
Also, csv module has another argument called doublequote that is -
Controls how instances of quotechar appearing inside a field should be themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.

Related

Is it possible to drop sentences from the text with NLTK in Python?

For example, I have a text that consists of several sentences:
"First sentence is not relevant. Second contains information about KPI I want to keep. Third is useless. Fourth mentions topic relevant for me".
In addition, I have self-constructed dictionary with words {KPI, topic}.
Is it somehow possible to write a code that will keep only those sentences, where at least one word is mentioned in the dictionary? So that from the above example, only 2nd and 4th sentence will remain.
Thanks
P.S. I already have a code to tokenize the text into sentences, but leaving only "relevant" ones is not something common, as I see.
One solution would be to use list comprehensions (see example below).
But there might be a better and more pythonic solution out there.
sentences = ['Lorem ipsum dolor keyword sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.',
'Duis aute irure other_keyword dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.',
'Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.']
vocabulary = {'keyword': 'Topic 1',
'other_keyword': 'Topic 2'}
[sentence for sentence in sentences if any(word in sentence for word in list(vocabulary.keys()))]
>>> ['Lorem ipsum dolor keyword sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
'Duis aute irure other_keyword dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.']

Python snippet to manage regex replacement index map?

For text processing task i need to apply multiple regex substitutions (i.e. re.sub). There are multiple regex patterns with custom replacement parameters. The result needs to be original text, text with replacements and a map of tuples identifying start,end indices of replaced strings in source text and indices in result text.
e.g.
following is a sample code having input text and an array of 3 modifier tuples.
text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.
'''
modifiers = [
(
r'([\w]+\.?)\s+(\d{1,2})\w{2},\s+(\d{4})',
{ 1:lambda x:month(x), 2:lambda x:num2text(x), 3:lambda x:num2text(x) }
),
(
r' (\d) ',
{ 1:lambda x:num2text(x) }
),
(
r'(culpa)',
{ 1: 'culpae' }
)
]
sample output index map:
[((7, 11), (7, 30)), ((12, 14), (31, 35)), ((20, 22), (41, 51)), ((23, 28), (52, 57)),...]
Already wrote a complicated function, that tries to handle all the corner cases of the index offsetting happening during replacements, but it's already taking too much time.
Maybe there is already a solution for this task?
Here is a demo of current state.
Word transformation expansion (normalization) functions were intentionally made simplistic with fixed value dict mapping.
The ultimate goal is to make a text dataset generator. Dataset needs to have two text parts - one with numbers abbreviations and other expandable strings and the other with fully expanded into full textual representation (e.g. 3->three, apr. -> april, etc.) And also offset mapping to link parts of non-expanded text with corresponding parts in expanded text.
One of the corner cases that my implementation already deals with is a case when there are at least two modifiers A and B and they have to deal with text like 'text text a text b text a text b' as first modifier churns out output span of the second 'a' replacement becomes incorrect as B modifier comes in and alters output text before second 'a'.
Also partially dealt with case where subsequent modifier replaces output replacement from first modifier and figures out the initial source span location.
UPDATE
Writing a python package called re-map.
One might also consider spacy mentioned here.
Here is a code example that handles your text modifiers using re, datetime and a third party package called inflect.
The code will return the modified text with the position of the modified words.
PS: You need to explain more what you're trying to do. Otherwise you can use this code and modify it to fulfill your needs.
To install inflect: pip install inflect
Sample code:
import re
from datetime import datetime
import inflect
ENGINE = inflect.engine()
def num2words(num):
"""Number to Words using inflect package"""
return ENGINE.number_to_words(num)
def pretty_format_date(pattern, date_found, text):
"""Pretty format dates"""
_month, _day, _year = date_found.groups()
month = datetime.strptime('{day}/{month}/{year}'.format(
day=_day, month=_month.strip('.'), year=_year
), '%d/%b/%Y').strftime('%B')
day, year = num2words(_day), num2words(_year)
date = '{month} {day}, {year} '.format(month=month, day=day, year=year)
begin, end = date_found.span()
_text = re.sub(pattern, date, text[begin:end])
text = text[:begin] + _text + text[end:]
return text, begin, end
def format_date(pattern, text):
"""Format given string into date"""
spans = []
# For loop prevents us from going into an infinite loop
# If there is malformed texts or bad regex
for _ in re.findall(pattern, text):
date_found = re.search(pattern, text)
if not date_found:
break
try:
text, begin, end = pretty_format_date(pattern, date_found, text)
spans.append([begin, end])
except Exception:
# Pass without any modification if there is any errors with date formats
pass
return text, spans
def number_to_words(pattern, text):
"""Numer to Words with spans"""
spans = []
# For loop prevents us from going into an infinite loop
# If there is malformed texts or bad regex
for _ in re.findall(pattern, text):
number_found = re.search(pattern, text)
if not number_found:
break
_number = number_found.groups()
number = num2words(_number)
begin, end = number_found.span()
spans.append([begin, end])
_text = re.sub(pattern, number, text[begin:end])
text = text[:begin] + ' {} '.format(_text) + text[end:]
return text, spans
def custom_func(pattern, text, output):
"""Custom function"""
spans = []
for _ in re.findall(pattern, text):
_found = re.search(pattern, text)
begin, end = _found.span()
spans.append([begin, end])
_text = re.sub(pattern, output, text[begin:end])
text = text[:begin] + ' {} '.format(_text) + text[end:]
return text, spans
text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit anim id est laborum.
'''
modifiers = [
(
r'([\w]+\.?)\s+(\d{1,2})\w{2},\s+(\d{4})',
format_date
),
(
r' (\d) ',
number_to_words
),
(
r'( \bculpa\b)', # Better using this pattern to catch the exact word
'culpae'
)
]
for regex, func in modifiers:
if not isinstance(func, str):
print('\n{} {} {}'.format('#' * 20, func.__name__, '#' * 20))
_text, spans = func(regex, text)
else:
print('\n{} {} {}'.format('#' * 20, func, '#' * 20))
_text, spans = custom_func(regex, text, func)
print(_text, spans)
Output:
#################### format_date ####################
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On April six, two thousand and nine Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt 6 mollit animid est laborum.
[[128, 142]]
#################### number_to_words ####################
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpa minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex five ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt six mollit anim id est laborum.
[[231, 234], [463, 466]]
#################### culpae ####################
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. On Apr. 6th, 2009 Ut enim culpae minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex 5 ea commodo consequat. Duis aute irure dolorin reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. On June 23rd, 3004 excepteur sint occaecat
cupidatat non proident, sunt in culpae qui officia deserunt 6 mollit anim id est laborum.
[[150, 156], [435, 441]]
Demo on Replit
Wrote a re-map python library to solve the problem described.
Here is a demo.

Unable to identify text segments based on keywords

I have potentially large amount of text output coming from an application. The output can be broken up into different sections. And I would like to determine which section I am processing based on existence of 1 or more keywords or key phrases.
Dummy Example output:
******************
** MyApp 1.1 **
** **
******************
**Copyright **
******************
Note # 1234
Text of the note 1234
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
******************
INPUT INFO:
Number of data points: 123456
Number of cases: 983
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
******************
Analysis Type: Simple
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
******************
Results:
Data 1: 1234e-10
Data 2
------
1 2
2 3.4
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
*******************
CPU TIME: 1:01:12
WALL TIME: 1:04:23
*******************
So I created a dictionary like and am trying to look up the values of the dict in each chunk.
def process(this_chunk):
keys = ['banner_k','runSummary','inputSummary_k']
vals = [['MyApp','Copyright'],['CPU TIME'],['Number of data']]
for k,v in zip(keys,vals):
chunkdict[k]=v
for k,v in chunkdict.items():
if any(x in v for x in this_chunk.splitlines()):
print(k + " is in this chunk")
process_for_k(chunk) #Function for each specific section.
break
else:
print(k + " is not in this chunk")
return
But this does not identify all the chunks. The values are indeed present but only in 1 chunk the values are matched. To be specific, my real application has the exact words for 'CPU TIME' and 'Copyright' in its output.
The section with 'CPU TIME' is captured correctly but the section with 'Copyright' is not found.
Is this the right approach to identifying sections with known keywords?
Any ideas why this (if any(x in v for x in this_chunk.splitlines()):)cmight not work?

Any way to search zlib-compressed text?

For a project I have to store a great deal of text and I was hoping to keep the database size small by zlib-compressing the text. Is there a way to search zlib-compressed text by testing for substrings without decompressing?
I would like to do something like the following:
>>> import zlib
>>> lorem = zlib.compress("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.")
>>> test_string = zlib.compress("Lorem")
>>> test_string in lorem
False
No. You cannot compress a short string and expect to find the result of that compression in the compressed version of a file that contains that original short string. Compression codes the data differently depending on the data that precedes it. In fact, that's how most compressors work -- by using the preceding data for matching strings and statistical distributions.
To search for a string, you have to decompress the data. You do not have to store the decompressed data though. You can read in the compressed data and decompress on the fly, discarding that data as you go until you find your string or get to the end. If the compressed data is very large and on slow mass media, this may be faster than searching for the string in the same data uncompressed on the same media.

How to iterate over a sequence of spaces?

Ok so I'm trying to take an input text file and justify it like microsoft word or any other word processor would. I've have gotten the text to do character justification bar the last line. I'm trying to figure out how to iterate over each space in the last line and insert a ' ' to get the last last up to the specified length.
If I try:
for ' ' in new:
insert(new,' ',find(' '))
In the spirit of simple style that python has taught me,
I get a non iterable error. Hence the while in the code loop. But this only inserts all the spaces at the first space.
Also is there a way to get this program to justified by words and not chars?
I was using the 'Lorem ipsum...' paragraph as my default text.
Any help is appreciated.
full code:
inf = open('filein.txt', 'r')
of = open('fileout.txt', 'w')
inf.tell()
n = input('enter the number of characters per line: ')
def insert(original, new, pos):
#Inserts new inside original at pos.
return original[:pos] + new + original[pos:]
try:
print 'you entered {}\n'.format(int(n))
except:
print 'there was an error'
n = input('enter the number of characters per line: ')
else:
new = inf.readline(n)
def printn(l):
print>>of, l+'\n'
print 'printing to file',
print '(first char: {} || last char: {})'.format(l[0],l[-1])
while new != '': #multiple spaces present at EOF
if new[0] != ' ': #check space at beginning of line
if new[-1] != ' ': # check space at end of line
while (len(new) < n):
new = insert(new,' ',(new.find(' ')))
printn(new)
elif new[0] == ' ':
new = new.lstrip() #remove leading whitespace
new = insert(new,' ',(new.find(' ')))
while (len(new) < n):
new = insert(new,' ',(new.find(' ')))
printn(new)
elif new[-1] == ' ':
new = new.rstrip() #remove trailing whitespace
new = insert(new, ' ',(new.rfind(' ')))
while (len(new) < n):
new = insert(new,' ',(new.rfind(' ')))
printn(new)
new = inf.readline(n)
print '\nclosing files...'
inf.close()
print 'input closed'
of.close()
print 'output closed'
input:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
If line length n == 37
output:
Lorem ipsum dolor sit amet, consectet
ur adipisicing elit, sed do eiusmod t
magna aliqua. Ut enim ad minim veniam
, quis nostrud exercitation ullamco l
consequat. Duis aute irure dolor in r
cillum dolore eu fugiat nulla pariatu
non proident, sunt in culpa qui offic
ia deserunt mollit anim id est laboru
m .
I'm having a little trouble understanding what you want to do ... It seems like you might want the textwrap module ( http://docs.python.org/library/textwrap.html ). If this isn't what you want, let me know and I'll happily delete this answer...
EDIT
Does this do what you want?
s="""Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute
irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum."""
import textwrap
print textwrap.fill(s,37)
EDIT2
In this situation, I would break your string into substrings that are each N blocks long, store those strings in a list one after another and then I would "\n".join(list_of_strings) at the end of the day. I won't code it up though since I suspect this is homework.
for ' ' in new:
insert(new,' ',find(' '))
You're tyring to assign to a literal. Try it like this
for x in new:
if x == ' ':
insert(new, x, find(' '))
and see if it works?

Categories