Removing various symbols from a text

Removing various symbols from a text - python

I am trying to clean some texts that are very different from one another. I would like to remove the headlines, quotation marks, abbreviations, special symbols and points that don't actually end sentences.
Example input:
This is a headline
And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and
Sometimes there are special symbols. ✓
Example output:
And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.
What I did:
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile:
data = infile.read()
data = data.replace("'", '')
data = data.replace("e.g.", 'for example')
#and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
outfile.write(data)
My problems (although number 2 is the most important):
I just want a string with this input, but it obviously breaks because of the quotation marks, is there any way to do this other than working with files like I did? In reality, I'm copy-pasting a text and want an app to clean it.
The code seems very inefficient because I just manually write the things that I remember to delete/clean, but I don't know all the abbreviations by heart. How do I clean it in one go, so to say?
Is there any way to eliminate the headline and enumeration, and the point . that appears in that German date? My code doesn't do that.
Edit: I just remembered stuff like text = re.sub(r"(#\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text), but regex is inefficient for huge texts, isn't it?

To easily remove all non standard symbols you can use the str.isalnum() which only returns true for any alphaneumaric sequence or str.isascii() for any ascii strings. isprintable() seems viable too. A full list can be found here Using those functions you can iterate over the string and filter each character. So something like this:
filteredData = filter(str.isidentifier, data)
You can also combine those by creating a function that checks multiple string variables like this:
def FilterKey(char:str): return char.isidentifier() and char.isalpha()
Which can be used in filter like this:
filteredData = filter(FilterKey, data)
if it returns true its included in the output if it returns false its excluded.
You can also extend this by including your own checks on the chars in the return of the function and then afterwards, to remove the large chunks of strings, you can use the typical str.replace(old,new) function.

Related

I have a question on using a split() method and the data's format (solved, uploading as memo)

I have made a variable like below,
data = '''
hello my name is mj
and I like reading novels and webtoons
nice meeting you all!
'''
and used data.split('\n') to split by sentences.
The data came up like,
['', 'hello my name is mj', 'and I like reading novels and webtoons', 'nice meeting you all!', '']
At the above list, why is there double quotations(") in the starting and at the end? Are those single sentence like 'hello my name is mj' and 'and I like ~' tied up as one string? If so, why??
Wait, while writing this question I think I got the answer, it is not just double quotation it is two single quotation written in order. As there is nothing written next to two '''s, so it just made empty string.

There is \n character at the beginning and end of you string, therefore it is also a part of the return value from split. You can do something like this:
[x for x in data.split('\n') if x]
Using list comprehension with a condition to filter only lines that are not empty.

...to split by sentence
There is the native splitlines method for this. It is advised to use that method, as it is aware of all kinds of varying encodings for line breaks.
Also, it will not create an extra entry at the end when the input ends with a line break like in your example. However, since you have an explicit empty line at the beginning, that one would still be included.
It might be a pragmatic solution to just strip your input from surrounding white space:
data.strip().splitlines()
For your example input, this will evaluate to:
[
'hello my name is mj',
'and I like reading novels and webtoons',
'nice meeting you all!'
]

Which is the most efficent of matching and replacing with an identifier every three new lines?

I am working with some .txt files that doesn't have structure (they are messy), they represent a number of pages. In order to give them some structure I would like to identify the number of pages since the file itself doesn't have them. This can be done by replacing every three newlines with some annotation like:
\n
page: N
\n
Where N is the number. This is how my files look like, and I also tried with a simple replace. However, this function confuses and does not give me the expected format which would be something like this. Any idea of how to replace the spaces with some kind of identifier, just to try to parse them and getting the position of some information (page)?.
I also tried this:
import re
replaced = re.sub('\b(\s+\t+)\b', '\n\n\n', text)
print (replaced)

If the format is as regular as you state in your problem description:
Replace every occurrence of three newlines \n with page: N
You wouldn't have to use the re module. Something as simple as the following would do the trick:
>>> s='aaaaaaaaaaaaaaaaa\n\n\nbbbbbbbbbbbbbbbbbbbbbbb\n\n\nccccccccccccccccccccccc'
>>> pages = s.split('\n\n\n')
>>> ''.join(page + '\n\tpage: {}\n'.format(i + 1) for i, page in enumerate(pages))
'aaaaaaaaaaaaaaaaa\n\tpage: 1\nbbbbbbbbbbbbbbbbbbbbbbb\n\tpage: 2\nccccccccccccccccccccccc\n\tpage: 3\n'
I suspect, though, that your format is less regular than that, but you'll have to include more details before I can give a good answer for that.
If you want to split with messy whitespace (which I'll define as at least three newlines with any other whitespace mixed in), you can replace s.split('\n\n\n') with:
re.split(r'(?:\n\s*?){3,}', s)

Removing encoded text from strings read from txt file

Here's the problem:
I copied and pasted this entire list to a txt file from https://www.cboe.org/mdx/mdi/mdiproducts.aspx
Sample of text lines:
BFLY - The CBOE S&P 500 Iron Butterfly Index
BPVIX - CBOE/CME FX British Pound Volatility Index
BPVIX1 - CBOE/CME FX British Pound Volatility First Term Structure Index
BPVIX2 - CBOE/CME FX British Pound Volatility Second Term Structure Index
These lines of course appear normal in my text file, and I saved the file with utf-8 encoding.
My goal is to use python to strip out only the symbols from this long list, .e.g. BFLY, VPVIX etc, and write them to a new file
I am using the following code to read the file and split it:
x=open('sometextfile.txt','r')
y=x.read().split()
The issue I'm seeing is that there are unfamiliar characters popping up and they are affecting my ability to filter the list. Example:
print(y[0])
ï»¿BFLY
I'm guessing that these characters have something to do with the encoding and I have tried a few different things with the codec module without success. Using .decode('utf-8') throws an error when trying to use it against the above variables x or y. I am able to use .encode('utf-8'), which obviously makes things even worse.
The main problem is that when I try to loop through the list and remove any items that are not all upper case or contain non-alpha characters. Ex:
y[0].isalpha()
False
y[0].isupper()
False
So in this example the symbol BFLY ends up being removed from the list.
Funny thing is that these characters are not present in a txt file if I do something like:
q=open('someotherfile.txt','w')
q.write(y[0])
Any help would be greatly appreciated. I would really like to understand why this frequently happens when copying and pasting text from web pages like this one.

Why not use Regex?
I think this will catch the letters in caps
"[A-Z]{1,}/?[A-Z]{1,}[0-9]?"
This is better. I got a list of all such symbols. Here's my result.
['BFLY', 'CBOE', 'BPVIX', 'CBOE/CME', 'FX', 'BPVIX1', 'CBOE/CME', 'FX', 'BPVIX2', 'CBOE/CME', 'FX']
Here's the code
import re
reg_obj = re.compile(r'[A-Z]{1,}/?[A-Z]{1,}[0-9]?')
sym = reg_obj.findall(a)enter code here
print(sym)

Looking for a strategy for parsing a file

I'm an experienced C programmer, but a complete python newbie. I'm learning python mostly for fun, and as a first exercise want to parse a text file, extracting the meaningful bits from the fluff, and ending up with a tab-delimited string of those bits in a different order.
I've had a blast plowing through tutorials and documentation and stackoverflow Q&As, merrily splitting strings and reading lines from files and etc. Now I think I'm at the point where I need a few road signs from experienced folks to avoid blind alleys.
Here's one chunk of the text I want to parse (you may recognize this as a McMaster order). The actual file will contain one or more chunks like this.
1 92351A603 Lag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5
Your Part Number: 7218-GYROID
22
packs today
5.85
per pack 128.70
Note that the information is split over several lines in the file. I'd like to end up with a tab-delimited string that looks like this:
22\tpacks\tLag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5\t\t92351A603\t5.85\t\t128.70\t7218-GYROID\n
So I need to extract some parts of the string while ignoring others, rearrange them a bit, and re-pack them into a string.
Here's the (very early) code I have at the moment, it reads the file a line at a time, splits each line with delimiters, and I end up with several lists of strings, including a bunch of empty ones where there were double tabs:
import sys
import string
def split(delimiters, string, maxsplit=0):
"""Split the given string with the given delimiters (an array of strings)
This function lifted from stackoverflow in a post by Kos"""
import re
regexPattern = '|'.join(map(re.escape, delimiters))
return re.split(regexPattern, string, maxsplit)
delimiters = "\t", "\n", "\r", "Your Part Number: "
with open(sys.argv[1], 'r') as f:
for line in f:
print(split( delimiters, line))
f.close()
Question 1 is basic: how can I remove the empty strings from my lists, then mash all the strings together into one list? In C I'd loop through all the lists, ignoring the empties and sticking the other strings in a new list. But I have a feeling python has a more elegant way to do this sort of thing.
Question 2 is more open ended: what's a robust strategy here? Should I read more than one line at a time in the first place? Make a dictionary, allowing easier re-ordering of the items later?
Sorry for the novel. Thanks for any pointers. And please, stylistic comments are more than welcome, style matters.

You don't need to close file when using with.
And if I were to implement this. I might use a big regex to extract parts from each chunk(with finditer), and reassemble them for output.

You can remove empty strings by:
new_list = filter(None, old_list)
Replace the first parameter with a lambda expression that is True for elements you want to keep. Passing None is equivalent to lambda x: x.
You can mash strings together into one string using:
a_string = "".join(list_of_strings)
If you have several lists (of whatever) and you want to join them together into one list, then:
new_list = reduce(lambda x, y: x+y, old_list)
That will simply concatenate them, but you can use any non-empty string as the separator.
If you're new to Python, then functions like filter and reduce (EDIT: deprecated in Python 3) may seem a bit alien, but they save a lot of time coding, so it's worth getting to know them.
I think you're on the right track to solving your problem. I'd do this:
break up everything into lines
break the resulting list into smaller list, one list per order
parse the orders into "something meaningful"
sort, output the result
Personally, I'd make a class to handle the last two parts (they kind of belong together logically) but you could get by without it.

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.

If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing various symbols from a text - python

Related

I have a question on using a split() method and the data's format (solved, uploading as memo)

Which is the most efficent of matching and replacing with an identifier every three new lines?

Removing encoded text from strings read from txt file

Looking for a strategy for parsing a file

Python: Regex a dictionary using user input wildcards

Categories

Resources