Replacing String Text That Contains Double Quotes - python

I have a number series contained in a string, and I want to remove everything but the number series. But the double quotes are giving me errors. Here are examples of the strings and a sample command that I have used. All I want is 127.60-02-15, 127.60-02-16, etc.
<span id="lblTaxMapNum">127.60-02-15</span>
<span id="lblTaxMapNum">127.60-02-16</span>
I have tried all sorts of methods (e.g., triple double quotes, single quotes, quotes with backslashes, etc.). Here is one inelegant way that still isn't working because it's still leaving ">:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
Here is what I am working with (more specific code). I'm retrieving the data from an CSV and just trying to clean it up.
text = open("outputA.csv", "r")
text = ''.join([i for i in text])
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
outputB = open("outputB.csv", "w")
outputB.writelines(text)
outputB.close()

If you add a > in the second replace it is still not elegant but it works:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\">", "")
text = text.replace("</span>", "")
Alternatively, you could use a regex:
import re
text = "<span id=\"lblTaxMapNum\">127.60-02-16</span>"
pattern = r".*>(\d*.\d*-\d*-\d*)\D*" # the pattern in the brackets matches the number
match = re.search(pattern, text) # this searches for the pattern in the text
print(match.group(1)) # this prints out only the number

You can use beatifulsoup.
from bs4 import BeautifulSoup
strings = ['<span id="lblTaxMapNum">127.60-02-15</span>', '<span id="lblTaxMapNum">127.60-02-16</span>']
# Use BeautifulSoup to extract the text from the <span> tags
for string in strings:
soup = BeautifulSoup(string, 'html.parser')
number_series = soup.span.text
print(number_series)
output:
127.60-02-15
127.60-02-16

it's a little bit long , hope my documents are readable
with open(r'c:\users\GH\desktop\test.csv' , 'r') as f:
text = f.read().strip()
stRange = '<' # we will gonna remove the dump txt from our file by using (range
index) method
endRange = '>' # which means removing all extra literals between <>
text = list(text)
# casting our data to a list to be able to modify our data by reffering to its
components by index number
i = 0
length = len(text)
# we're gonna manipulate our text while we are iterating upon it
# so we have to declare a variable to be able to change it while iterating
while i < length:
if text[i] == stRange:
stRange = text.index(text[i])
elif text[i] != endRange and text[i] != stRange:
i += 1
continue
elif text[i] == endRange:
endRange = text.index(text[i]) # an integer to be used as rangeIndex
i = 0
del text[stRange : endRange + 1] # deleting the extra unwanted
characters
length = len(text) # getting the new length of our data
stRange = '<' # and again , assigning the specific characters to their
variables
endRange = '>'
i += 1
else:
result = str()
for l in text:
result += l
else:
with open(path , 'w') as f:
f.write(result)
with open(path , 'r') as f:
print('the result ==>')
print(f.read())

Related

How to split sentences from multiple txt files while keeping track from which paragraph it is from and its position in the paragraph?

I’m stuck with the following problem. I’m trying to split sentences of multiple text files while keeping track of which paragraph it’s from and its position in the paragraph. What I mean with position is if it’s for example the first, second, or last sentence. In addition, I want to omit duplicates while keeping track of how many they are in the same paragraph and on the same position. All text files use one format and have the same number of paragraphs. What I’m aiming for is this:
SENTENCE
PARAGRAPH
POSITION
OCCURENCES
Sentence_A
1
1
1
Sentence_B
1
2
4
Sentence_C
2
1
1
I managed to count all occurrences using the following code.
from pathlib import Path
import pandas as pd
import re
my_dir_path = "data/text_files/"
sentences = []
for file in Path(my_dir_path).iterdir():
with open(file, 'r', encoding='windows-1252') as file_open:
string = file_open.read()
string = re.sub(r"\n+", '\n', string)
string = re.sub(r'(\d+)\.(\d+)', r"\1:\2", string)
string = string.replace('\n', ' ')
string = string.replace('. ', '.')
string = string.split('.')
for elem in string:
if 'Verwachting vandaag en morgen:' in elem: # omit specific sentence
continue
else:
sentences.append(elem)
df = pd.DataFrame(sentences, columns=['SENTENCES'])
df1 = df['SENTENCES'].value_counts().rename('SENTENCES').reset_index(name='OCCURENCES')
Adding the conditional part about the paragraph and its position is where I got stuck. I’m a novice at Python, any pointers are appreciated :]
Example text: https://pastebin.com/y30hzWCB. It always has 2 newlines between paragraphs. Some sentences are cut off in-between with a newline.
Thanks in advance!
from pathlib import Path
import pandas as pd
import re
from collections import defaultdict
my_dir_path = "data/text_files/"
data = {"SENTENCE": [], "PARAGRAPH": [], "POSITION": [], "OCCURENCES": []}
for file in Path(my_dir_path).iterdir():
parano = 1
pos = 0
already = False
occ = defaultdict(int)
with open(file, 'r', encoding='windows-1252') as file_open:
string = file_open.read()
string = re.sub(r"\n+", '\n', string)
string = re.sub(r'(\d+)\.(\d+)', r"\1:\2", string)
string = string.replace('\n', ' ')
string = string.replace('. ', '.')
string = string.split('.')
for elem in string:
if elem == "\n" and already:
parano += 1
already = True
pos = 0
else:
already = False
pos += 1
if 'Verwachting vandaag en morgen:' in elem: # omit specific sentence
continue
elif elem not in data["SENTENCES"]:
data["SENTENCE"].append(elem)
data["PARAGRAPH"].append(parano)
data["POSITION"].append(pos)
occ[elem] = 1
else:
occ[elem] += 1
for i in data["SENTENCES"]:
data["OCCURENCES"].append(occ[i])
df = pd.DataFrame(data)
What I did:
created a dictionary data which would directly be used to create the required dataframe
kept count of paragraph number parano by checking if current line is a newline
kept count of position by simply increasing pos each time a valid sentence occured

Python parse text from multiple txt file

Seeking advice on how to mine items from multiple text files to build a dictionary.
This text file: https://pastebin.com/Npcp3HCM
Was manually transformed into this required data structure: https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view
There are thousands of such text files and they may have different section headings as shown in these examples:
https://pastebin.com/wWSPGaLX
https://pastebin.com/9Up4RWHu
I started off by reading the files
from glob import glob
txtPth = '../tr-txt/*.txt'
txtFiles = glob(txtPth)
with open(txtFiles[0],'r') as tf:
allLines = [line.rstrip() for line in tf]
sectionHeading = ['Corporate Participants',
'Conference Call Participiants',
'Presentation',
'Questions and Answers']
for lineNum, line in enumerate(allLines):
if line in sectionHeading:
print(lineNum,allLines[lineNum])
My idea was to look for the line numbers where section headings existed and try to extract the content in between those line numbers, then strip out separators like dashes. That didn't work and I got stuck in trying to create a dictionary of this kind so that I can later run various natural language processing algorithms on quarried items.
{file-name1:{
{date-time:[string]},
{corporate-name:[string]},
{corporate-participants:[name1,name2,name3]},
{call-participants:[name4,name5]},
{section-headings:{
{heading1:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]}],
{heading2:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name1:[speechOrderNum, text-content]},
{name4:[speechOrderNum, text-content]}],
{heading3:[text-content]},
{heading4:[text-content]}
}
}
}
The challenge is that different files may have different headings and number of headings. But there will always be a section called "Presentation" and very likely to have "Question and Answer" section. These section headings are always separated by a string of equal-to signs. And content of different speaker is always separated by string of dashes. The "speech order" for Q&A section is indicated with a number in square brackets. The participants are are always indicated in the beginning of the document with an asterisks before their name and their tile is always on the next line.
Any suggestion on how to parse the text files is appreciated. The ideal help would be to provide guidance on how to produce such a dictionary (or other suitable data structure) for each file that can then be written to a database.
Thanks
--EDIT--
One of the files looks like this: https://pastebin.com/MSvmHb2e
In which the "Question & Answer" section is mislabeled as "Presentation" and there is no other "Question & Answer" section.
And final sample text: https://pastebin.com/jr9WfpV8
The comments in the code should explain everything. Let me know if anything is under specified, and needs more comments.
In short I leverage regex to find the '=' delimiter lines to subdivide the entire text into subsections, then handle each type of sections separately for clarity sake ( so you can tell how I am handling each case).
Side note: I am using the word 'attendee' and 'author' interchangeably.
EDIT: Updated the code to sort based on the '[x]' pattern found right next to the attendee/author in the presentation/QA section. Also changed the pretty print part since pprint does not handle OrderedDict very well.
To strip any additional whitespace including \n anywhere in the string, simply do str.strip(). if you specifically need to strip only \n, then just do str.strip('\n').
I have modified the code to strip any whitespace in the talks.
import json
import re
from collections import OrderedDict
from pprint import pprint
# Subdivides a collection of lines based on the delimiting regular expression.
# >>> example_string =' =============================
# asdfasdfasdf
# sdfasdfdfsdfsdf
# =============================
# asdfsdfasdfasd
# =============================
# >>> subdivide(example_string, "^=+")
# >>> ['asdfasdfasdf\nsdfasdfdfsdfsdf\n', 'asdfsdfasdfasd\n']
def subdivide(lines, regex):
equ_pattern = re.compile(regex, re.MULTILINE)
sections = equ_pattern.split(lines)
sections = [section.strip('\n') for section in sections]
return sections
# for processing sections with dashes in them, returns the heading of the section along with
# a dictionary where each key is the subsection's header, and each value is the text in the subsection.
def process_dashed_sections(section):
subsections = subdivide(section, "^-+")
heading = subsections[0] # header of the section.
d = {key: value for key, value in zip(subsections[1::2], subsections[2::2])}
index_pattern = re.compile("\[(.+)\]", re.MULTILINE)
# sort the dictionary by first capturing the pattern '[x]' and extracting 'x' number.
# Then this is passed as a compare function to 'sorted' to sort based on 'x'.
def cmp(d):
mat = index_pattern.findall(d[0])
if mat:
print(mat[0])
return int(mat[0])
# There are issues when dealing with subsections containing '-'s but not containing '[x]' pattern.
# This is just to deal with that small issue.
else:
return 0
o_d = OrderedDict(sorted(d.items(), key=cmp))
return heading, o_d
# this is to rename the keys of 'd' dictionary to the proper names present in the attendees.
# it searches for the best match for the key in the 'attendees' list, and replaces the corresponding key.
# >>> d = {'mr. man ceo of company [1]' : ' This is talk a' ,
# ... 'ms. woman ceo of company [2]' : ' This is talk b'}
# >>> l = ['mr. man', 'ms. woman']
# >>> new_d = assign_attendee(d, l)
# new_d = {'mr. man': 'This is talk a', 'ms. woman': 'This is talk b'}
def assign_attendee(d, attendees):
new_d = OrderedDict()
for key, value in d.items():
a = [a for a in attendees if a in key]
if len(a) == 1:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[a[0]] = value.strip()
elif len(a) == 0:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[key] = value.strip()
return new_d
if __name__ == '__main__':
with open('input.txt', 'r') as input:
lines = input.read()
# regex pattern for matching headers of each section
header_pattern = re.compile("^.*[^\n]", re.MULTILINE)
# regex pattern for matching the sections that contains
# the list of attendee's (those that start with asterisks )
ppl_pattern = re.compile("^(\s+\*)(.+)(\s.*)", re.MULTILINE)
# regex pattern for matching sections with subsections in them.
dash_pattern = re.compile("^-+", re.MULTILINE)
ppl_d = dict()
talks_d = dict()
# Step1. Divide the the entire document into sections using the '=' divider
sections = subdivide(lines, "^=+")
header = []
print(sections)
# Step2. Handle each section like a switch case
for section in sections:
# Handle headers
if len(section.split('\n')) == 1: # likely to match only a header (assuming )
header = header_pattern.match(section).string
# Handle attendees/authors
elif ppl_pattern.match(section):
ppls = ppl_pattern.findall(section)
d = {key.strip(): value.strip() for (_, key, value) in ppls}
ppl_d.update(d)
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: ppl_d})
# Handle subsections
elif dash_pattern.findall(section):
heading, d = process_dashed_sections(section)
talks_d.update({heading: d})
# Else its just some random text.
else:
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: section})
#pprint(talks_d)
# To assign the talks material to the appropriate attendee/author. Still works if no match found.
for key, value in talks_d.items():
talks_d[key] = assign_attendee(value, ppl_d.keys())
# ordered dict does not pretty print using 'pprint'. So a small hack to make use of json output to pretty print.
print(json.dumps(talks_d, indent=4))
Could you please confirm that whether you only require "Presentation" and "Question and Answer" sections?
Also, regarding the output is it ok to dump CSV format similar to what you have "manually transformed".
Updated solution to work for every sample file you provided.
Output is from Cell "D:H" as per "Parsed-transcript" file shared.
#state = ["other", "head", "present", "qa", "speaker", "data"]
# codes : 0, 1, 2, 3, 4, 5
def writecell(out, data):
out.write(data)
out.write(",")
def readfile(fname, outname):
initstate = 0
f = open(fname, "r")
out = open(outname, "w")
head = ""
head_written = 0
quotes = 0
had_speaker = 0
for line in f:
line = line.strip()
if not line: continue
if initstate in [0,5] and not any([s for s in line if "=" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 1
elif initstate in [0,5] and not any([s for s in line if "-" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 4
elif initstate == 1 and line == "Presentation":
initstate = 2
head = "Presentation"
head_written = 0
elif initstate == 1 and line == "Questions and Answers":
initstate = 3
head = "Questions and Answers"
head_written = 0
elif initstate == 1 and not any([s for s in line if "=" != s]):
initstate = 0
elif initstate in [2, 3] and not any([s for s in line if ("=" != s and "-" != s)]):
initstate = 4
elif initstate == 4 and '[' in line and ']' in line:
comma = line.find(',')
speech_st = line.find('[')
speech_end = line.find(']')
if speech_st == -1:
initstate = 0
continue
if comma == -1:
firm = ""
speaker = line[:speech_st].strip()
else:
speaker = line[:comma].strip()
firm = line[comma+1:speech_st].strip()
head_written = 1
if head_written:
writecell(out, head)
head_written = 0
order = line[speech_st+1:speech_end]
writecell(out, speaker)
writecell(out, firm)
writecell(out, order)
had_speaker = 1
elif initstate == 4 and not any([s for s in line if ("=" != s and "-" != s)]):
if had_speaker:
initstate = 5
out.write('"')
quotes = 1
had_speaker = 0
elif initstate == 5:
line = line.replace('"', '""')
out.write(line)
elif initstate == 0:
continue
else:
continue
f.close()
if quotes:
out.write('"')
out.close()
readfile("Sample1.txt", "out1.csv")
readfile("Sample2.txt", "out2.csv")
readfile("Sample3.txt", "out3.csv")
Details
in this solution there is a state machine which works as follows:
1. detects whether heading is present, if yes, write it
2. detects speakers after heading is written
3. writes notes for that speaker
4. switches to next speaker and so on...
You can later process the csv files as you want.
You can also populate the data in any format you want once basic processing is done.
Edit:
Please replace the function "writecell"
def writecell(out, data):
data = data.replace('"', '""')
out.write('"')
out.write(data)
out.write('"')
out.write(",")

Can't get rid of hex characters

This program makes an array of verbs which come from a text file.
file = open("Verbs.txt", "r")
data = str(file.read())
table = eval(data)
num_table = len(table)
new_table = []
for x in range(0, num_table):
newstr = table[x].replace(")", "")
split = newstr.rsplit("(")
numx = len(split)
for y in range(0, numx):
split[y] = split[y].split(",", 1)[0]
new_table.append(split[y])
num_new_table = len(new_table)
for z in range(0, num_new_table):
print(new_table[z])
However the text itself contains hex characters such as in
('a\\xc4\\x9fr\\xc4\\xb1[Verb]+[Pos]+[Imp]+[A2sg]', ':', 17.6044921875)('A\\xc4\\x9fr\\xc4\\xb1[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]', ':', 11.5615234375)
I'm trying to get rid of those. How am supposed to do that?
I've looked up pretty much everywhere and decode() returns an error (even after importing codecs).
You could use parse, a python module that allows you to search inside a string for regularly-formatted components, and, from the components returned, you could extract the corresponding integers, replacing them from the original string.
For example (untested alert!):
import parse
# Parse all hex-like items
list_of_findings = parse.findall("\\x{:w}", your_string)
# For each item
for hex_item in list_of_findings:
# Replace the item in the string
your_string = your_string.replace(
# Retrieve the value from the Parse Data Format
hex_item[0],
# Convert the value parsed to a normal hex string,
# then to int, then to string again
str(int("0x"+hex_item[0]))
)
Obs: instead of "int", you could convert the found hex-like values to characters, using chr, as in:
chr(hex_item[0])

regex number of repetitions from code

Can you use values from script to inform regexs dynamically how to operate?
For example:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{n_rep}'
line_matches = re.findall(new_pattern, some_text)
I keep getting problems with trying to get the grouping to work
Explanation
I am attempting to find the most common number of repetitions of a regex pattern in a text file in order to find table type data within files.
I have the idea to make a regex such as this:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
line_matches = np.array([re.findallbase_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
# Find where the text has similar number of words/data in each line
where_same_pattern= np.where(np.diff([len(x) for x in line_matches])==0)
line_matches_where_same = line_matches[where_same_pattern]
# Extract out just the lines which have data
interesting_lines = np.array([x for x in line_matches_where_same if x != []])
# Find how many words in each line of interest
len_of_lines = [len(l) for l in interesting_lines]
# Use the most prevalent as the most likely number of columns of data
n_cols = Counter(len_of_lines).most_common()[0][0]
# Rerun the data through a regex to find the columns
new_pattern = base_pattern + '{n_cols}'
line_matches = np.array([re.findall(new_pattern, line) for line_num, line in enumerate(some_text.split("\n"))])
you need to use the value of the variable, not a string literal with the name of the variable, e.g.:
new_pattern = base_pattern + '{' + n_cols + '}'
Your pattern is just a string. So, all you need is to convert your number into a string. You can use format (for example, https://infohost.nmt.edu/tcc/help/pubs/python/web/new-str-format.html) to do that:
base_pattern = r'\s*(([\d.\w]+)[ \h]+)'
n_rep = random.randint(1, 9)
new_pattern = base_pattern + '{{{0}}}'.format(n_rep)
print new_pattern ## '\\s*(([\\d.\\w]+)[ \\h]+){6}'
Note that the two first and the two last curly braces are creating the curly braces in the new pattern, while {0} is being replaced by the number n_rep

python regex unicode - extracting data from an utf-8 file

I am facing difficulties for extracting data from an UTF-8 file that contains chinese characters.
The file is actually the CEDICT (chinese-english dictionary) and looks like this :
賓 宾 [bin1] /visitor/guest/object (in grammar)/
賓主 宾主 [bin1 zhu3] /host and guest/
賓利 宾利 [Bin1 li4] /Bentley/
賓士 宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/
賓夕法尼亞 宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/
賓夕法尼亞大學 宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/
賓夕法尼亞州 宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/
Until now, I manage to get the first two fields using split() but I can't find out how I should proceed to extract the two other fields (let's say for the second line "bin1 zhu3" and "host and guest". I have been trying to use regex but it doesn't work for a reason I ignore.
#!/bin/python
#coding=utf-8
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
def look(character):
myFile = open("/home/quentin/cedict_ts.u8","r")
for line in myFile:
line = line.rstrip()
elements = line.split(" ")
try:
if line != "" and elements[1] == character:
myFile.close()
return line
except:
myFile.close()
break
myFile.close()
return "Aucun résultat :("
translation = look("賓主") # translation contains one line of the file
elements = translation.split()
traditionnal = elements[0]
simplified = elements[1]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
m = REMatcher(translation)
tr = ""
if m.match(r"\[(\w+)\]"):
tr = m.group(1)
print "Pronouciation:" + tr
Any help appreciated.
This builds a dictionary to look up translations by either simplified or traditional characters and works in both Python 2.7 and 3.3:
# coding: utf8
import re
import codecs
# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
D = {}
for line in datafile:
# Skip comment lines
if line.startswith('#'):
continue
trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
D[trad] = (simp,pinyin,trans)
D[simp] = (trad,pinyin,trans)
Output (Python 3.3):
>>> D['马克']
('馬克', 'Ma3 ke4', 'Mark (name)')
>>> D['一路顺风']
('一路順風', 'yi1 lu4 shun4 feng1', 'to have a pleasant journey (idiom)')
>>> D['馬克']
('马克', 'Ma3 ke4', 'Mark (name)')
Output (Python 2.7, you have to print strings to see non-ASCII characters):
>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克
I would continue to use splits instead of regular expressions, with the maximum split number given. It depends on how consistent the format of the input file is.
elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr
Output:
Traditionnal:賓主
Simplified:宾主
Pronouciation:bin1 zhu3
EDIT: To split the last field into a list, split on the /:
translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash,
#then split on the slashes
Output (for the first line of input):
['visitor', 'guest', 'object (in grammar)']
Heh, I've done this exact same thing before. Basically you just need to use regex with groupings. Unfortunately, I don't know python regex super well (I did the same thing using C#), but you should really do something like this:
matcher = "(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"
basically you match the entire line using one expression, but then you use ( ) to separate each item into a regex-group. Then you just need to read the groups and voila!

Categories