Removing encoded text from strings read from txt file - python

Here's the problem:
I copied and pasted this entire list to a txt file from https://www.cboe.org/mdx/mdi/mdiproducts.aspx
Sample of text lines:
BFLY - The CBOE S&P 500 Iron Butterfly Index
BPVIX - CBOE/CME FX British Pound Volatility Index
BPVIX1 - CBOE/CME FX British Pound Volatility First Term Structure Index
BPVIX2 - CBOE/CME FX British Pound Volatility Second Term Structure Index
These lines of course appear normal in my text file, and I saved the file with utf-8 encoding.
My goal is to use python to strip out only the symbols from this long list, .e.g. BFLY, VPVIX etc, and write them to a new file
I am using the following code to read the file and split it:
x=open('sometextfile.txt','r')
y=x.read().split()
The issue I'm seeing is that there are unfamiliar characters popping up and they are affecting my ability to filter the list. Example:
print(y[0])
BFLY
I'm guessing that these characters have something to do with the encoding and I have tried a few different things with the codec module without success. Using .decode('utf-8') throws an error when trying to use it against the above variables x or y. I am able to use .encode('utf-8'), which obviously makes things even worse.
The main problem is that when I try to loop through the list and remove any items that are not all upper case or contain non-alpha characters. Ex:
y[0].isalpha()
False
y[0].isupper()
False
So in this example the symbol BFLY ends up being removed from the list.
Funny thing is that these characters are not present in a txt file if I do something like:
q=open('someotherfile.txt','w')
q.write(y[0])
Any help would be greatly appreciated. I would really like to understand why this frequently happens when copying and pasting text from web pages like this one.

Why not use Regex?
I think this will catch the letters in caps
"[A-Z]{1,}/?[A-Z]{1,}[0-9]?"
This is better. I got a list of all such symbols. Here's my result.
['BFLY', 'CBOE', 'BPVIX', 'CBOE/CME', 'FX', 'BPVIX1', 'CBOE/CME', 'FX', 'BPVIX2', 'CBOE/CME', 'FX']
Here's the code
import re
reg_obj = re.compile(r'[A-Z]{1,}/?[A-Z]{1,}[0-9]?')
sym = reg_obj.findall(a)enter code here
print(sym)

Related

Removing various symbols from a text

I am trying to clean some texts that are very different from one another. I would like to remove the headlines, quotation marks, abbreviations, special symbols and points that don't actually end sentences.
Example input:
This is a headline
And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and
Sometimes there are special symbols. ✓
Example output:
And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.
What I did:
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile:
data = infile.read()
data = data.replace("'", '')
data = data.replace("e.g.", 'for example')
#and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
outfile.write(data)
My problems (although number 2 is the most important):
I just want a string with this input, but it obviously breaks because of the quotation marks, is there any way to do this other than working with files like I did? In reality, I'm copy-pasting a text and want an app to clean it.
The code seems very inefficient because I just manually write the things that I remember to delete/clean, but I don't know all the abbreviations by heart. How do I clean it in one go, so to say?
Is there any way to eliminate the headline and enumeration, and the point . that appears in that German date? My code doesn't do that.
Edit: I just remembered stuff like text = re.sub(r"(#\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text), but regex is inefficient for huge texts, isn't it?
To easily remove all non standard symbols you can use the str.isalnum() which only returns true for any alphaneumaric sequence or str.isascii() for any ascii strings. isprintable() seems viable too. A full list can be found here Using those functions you can iterate over the string and filter each character. So something like this:
filteredData = filter(str.isidentifier, data)
You can also combine those by creating a function that checks multiple string variables like this:
def FilterKey(char:str): return char.isidentifier() and char.isalpha()
Which can be used in filter like this:
filteredData = filter(FilterKey, data)
if it returns true its included in the output if it returns false its excluded.
You can also extend this by including your own checks on the chars in the return of the function and then afterwards, to remove the large chunks of strings, you can use the typical str.replace(old,new) function.

Add special characters in csv pandas python

While writing strings containing certain special characters, such as
Töölönlahdenkatu
using to_csv from pandas, the result in the csv looks like
T%C3%B6%C3%B6l%C3%B6nlahdenkatu
How do we get to write the text of string as it is? This is my to_csv command
df.to_csv(csv_path,index=False,encoding='utf8')
I have even tried
df.to_csv(csv_path,index=False,encoding='utf-8')
df.to_csv(csv_path,index=False,encoding='utf-8-sig')
and still no success.There are other characters replaced with random symbols
'-' to –
Is there a workaround?
What you're trying to do is remove German umlauts and Spanish tildes. There is an easy solution for that.
import unicodedata
data = u'Töölönlahdenkatu Adiós Pequeño'
english = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print(english)
output : b'Toolonlahdenkatu Adios Pequeno'
Let me know if it works or if there are any edge cases.
Special characters like ö cannot be stored in a csv the same way english letters can. The "random symbols" tell a program like excel to interpret the letters as special characters when you open the file, but special characters cannot be seen when you view the csv in vscode (for instance).

Python - Dividing a book in PDF form into individual text files that correspond with page numbers

I've converted my PDF file into a long string using PDFminer.
I'm wondering how I should go about dividing this string into smaller, individual strings/pages. Each page is divided by a certain series of characters (CRLF, FF, page number etc), and the string should be split and appended to a new text file according to these characters occurring.
I have no experience with regex, but is using the re module the best way to go about this?
My vague idea for implementation is that I have to iterate through the file using the re.search function, creating text files with each new form feed found. The only code I have is PDF > text conversion. Can anyone point me in the right direction?
Edit: I think the expression I should use is something like ^.*(?=(\d\n\n\d\n\n\f\bFavela\b)) (capture everything before 2 digits, the line breaks and the book's title 'Favela' which appears on top of each page.
Can I save these \d digits as variables? I want to use them as file names, as I iterate through the book and scoop up the portions of text divided by each appearance of \f\Favela.
I'm thinking the re.sub method would do it, looping through and replacing with an empty string as I go.

Why am I getting odd characters?

Sorry if this isn't a reproducible example but I am guessing someone will know what to do when I describe the problem. The problem I have is that I am getting characters like "\xe2" "\x80" from a txt file that I am reading in the following way:
words = open("directory/file.txt","r")
liness = []
for x in words.readlines():
liness.append(lines.rstrip('\n'))
When I print lines I get the list I want, but then when I use max() in the following way:
max(liness, key = len)
returns the "a line from file.txt that containts \xe2 and \x80" I know this probably has something to do with encoding, but I haven't had luck solving it. Anyone?
I tried to reproduce your error but used the following code:
words = open("directory/file.txt", 'r', 0)
line = words.readline()
wordlist = string.split(line)
Unfortunately, I was not able to reproduce your error as you would have guessed. My file was txt file with a list of English words.
I assume that you are reading a .txt file with non-standard American English characters, correct?. If you are not using American English characters, you might want to check out this post:
Handling non-standard American English Characters and Symbols in a CSV, using Python
You will need to determine what type of encoding/decoding to use based on your file.

Python: Removing particular character (u"\u2610") from string

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.
The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…
Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.

Categories