only special characters remove from the list - python

From a pdf file I extract all the text as a string, and convert it into the list by removing all the double white spaces, newlines (two or more), spaces (if two or more), and on every dot (.).
Now in my list I want, if a value of a list consists of only special characters, that value should be excluded.
pdfFileObj = open('Python String.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text=pageObj.extractText()
z =re.split("\n+|[.]|\s{2,}",text)
while("" in z) :
z.remove("")
print(z)
My output is
['split()', 'method in Python split a string into a list of strings after breaking the', 'given string by the specified separator', 'Syntax', ':', 'str', 'split(separator, maxsplit)', 'Parameters', ':', 'separator', ':', 'This is a delimiter', ' The string splits at this specified separator', ' If is', 'no', 't provided then any white space is a separator', 'maxsplit', ':', 'It is a number, which tells us to split the string into maximum of provi', 'ded number of times', ' If it is not provided then the default is', '-', '1 that means there', 'is no limit', 'Returns', ':', 'Returns a list of s', 'trings after breaking the given string by the specifie', 'd separator']
Here are some values that contain only special characters and I want to remove those. Thanks

Remove those special characters before converting text to list.
remove while("" in z) : z.remove("") and add following line after read text variable:
text = re.sub('(a|b|c)', '', text)
In this example, my special characters are a, b and c.

Use a regular expression that tests if a string contains any letters or numbers.
import re
z = [x for x in z if re.search(r'[a-z\d]', x, flags=re.I)]
In the regexp, a-z matches letters, \d matches digits, so [a-z\d] matches any letter or digit (and the re.I flag makes it case-insensitive). So the list comprehension includes any elements of z that contain a letter or digit.

Related

Regex matching optional pattern/string

I have a list of strings, and I'm trying to write regex that captures groups of strings that may or may not contain a certain pattern.
any ascii character string
another string = other stuff
string = another string = string
I'm trying to capture the part of the string before first occurrence of the pattern (" = ") and after the pattern. I've tried this:
\s*?([\x00-\x7F]+)( - )?(.*)?
but then it just captures the entire string as one group.
How would I do this?
You can solve this with regular expressions:
>>> text = '''any ascii character string
another string = other stuff
string = another string = string'''
>>> re.findall('^([^=]+?)(?: = (.*?))?$', text, re.M)
[('any ascii character string', ''),
('another string ', ' other stuff'),
('string ', ' another string = string')]
But in this case, a lot easier would be to just split by line and then partition/split the line by the first equals character:
>>> [line.split('=', 1) for line in text.splitlines()]
[['any ascii character string'],
['another string ', ' other stuff'],
['string ', ' another string = string']]
If you don’t like that whitespace, just strip it away:
>>> [list(map(str.strip, line.split('=', 1))) for line in text.splitlines()]
[['any ascii character string'],
['another string', 'other stuff'],
['string', 'another string = string']]

Split and replace a string with another string of same length in python

I want to split the given string into letters,digits and special characters.
After splitting these have to be replaced by another letters,digits and special characters respectively.
e.g. abc123wer#xyz.com is the given string.Then
Splitted output: ['abc','123','wer','#','xyz','.','com']
Replacing should happen from a file which contains some letters, digits and special characters.
Replacement output: ['xyz','231','etr','$','pou','#','fin']
One option to split the string is to use regex with re module to match letters [a-zA-Z]+, digits [0-9]+ and non-alphanumeric [^a-zA-Z0-9]+ respectively:
import re
re.findall("[a-zA-Z]+|[0-9]+|[^a-zA-Z0-9]+", s)
# ['abc', '123', 'wer', '#', 'xyz', '.', 'com']

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

How can I simplify this function?

Is there any way to simplify this function? Specifically, I'd like to rewrite it with fewer lines of indentation.
# split string (first argument) at location of separators (second argument, should be a string)
def split_string(text, separators):
text = ' ' + text + ' '
words = []
word = ""
for char in text:
if char not in separators:
word += char
else:
if word:
words.append(word)
word = ""
if not words:
words.append(text)
return words
Try using re.split, for example:
re.split('[%s]' % (separators),string)
The [] creates a regular expression character class to split on.
Your code seems to produce
>>> split_string("foo.,.bar", ".,")
[' foo']
but your comment says
split_string("foo.,.bar", ".,") will return ["foo", "bar"]
Assuming the comment is what's intended, then I'd use itertools.groupby (I hate using regexes):
from itertools import groupby
def splitter(text, separators):
grouped = groupby(text, lambda c: c in separators)
return [''.join(g) for k,g in grouped if not k]
which gives
>>> splitter("foo.,.bar", ".,")
['foo', 'bar']
groupby returns an iterator over consecutive terms grouped by some function -- in this case, lambda c: c in separators -- of the terms.
You should use the split() method. Taken from the official documentation:
str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string.
If maxsplit is given, at most maxsplit splits are done (thus, the list will have
at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no
limit on the number of splits (all possible splits are made).
If sep is given, consecutive delimiters are not grouped together and are deemed
to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
The sep argument may consist of multiple characters (for example,
'1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a
specified separator returns [''].
If sep is not specified or is None, a different splitting algorithm is applied:
runs of consecutive whitespace are regarded as a single separator, and the result
will contain no empty strings at the start or end if the string has leading or
trailing whitespace. Consequently, splitting an empty string or a string consisting
of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
You can do:
myString = "Some-text-here"
splitWords = myString.split("-")
The above code will return a List of with the words separated. I used "-" as the delimiter, you may assign any delimiter you like. Default is the "space" delimiter like this:
myString = "Some text here"
splitWords = myString.split()

Splitting strings separated by multiple possible characters?

...note that values will be delimited by one or more space or TAB characters
How can I use the split() method if there are multiple separating characters of different types, as in this case?
by default split can handle multiple types of white space, not sure if it's enough for what you need but try it:
>>> s = "a \tb c\t\t\td"
>>> s.split()
['a', 'b', 'c', 'd']
It certainly works for multiple spaces and tabs mixed.
Split using regular expressions and not just one separator:
http://docs.python.org/2/library/re.html
I had the same problem with some strings separated by different whitespace chars, and used \s as shown in the Regular Expressions library specification.
\s matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].
you will need to import re as the regular expression handler:
import re
line = "something separated\t by \t\t\t different \t things"
workstr = re.sub('\s+','\t',line)
So, any whitespace or separator (\s) repeated one or more times (+) is transformed to a single tabulation (\t), that you can reprocess with split('\t')
workstr = "something`\t`separated`\t`by`\t`different`\t`things"
newline = workstr.split('\t')
newline = ['something','separated','by','different','things']
Do a text substitution first then split.
e.g. replace all tabs with spaces, then split on space.
You can use regular expressions first:
import re
re.sub('\s+', ' ', 'text with whitespace etc').split()
['text', 'with', 'whitespace', 'etc']
For whitespace delimeters, str.split() already does what you may want. From the Python Standard Library,
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and ' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].

Categories