How to remove all non-alphabetic characters from a string? - python

I have been working on a program which will take a hex file, and if the file name starts with "CID", then it should remove the first 104 characters, and after that point there is a few words. I also want to remove everything after the words, but the problem is the part I want to isolate varies in length.
My code is currently like this:
y = 0
import os
files = os.listdir(".")
filenames = []
for names in files:
if names.endswith(".uexp"):
filenames.append(names)
y +=1
print(y)
print(filenames)
for x in range(1,y):
filenamestart = (filenames[x][0:3])
print(filenamestart)
if filenamestart == "CID":
openFile = open(filenames[x],'r')
fileContents = (openFile.read())
ItemName = (fileContents[104:])
print(ItemName)
Input Example file (pulled from HxD):
.........................ýÿÿÿ................E.................!...1AC9816A4D34966936605BB7EFBC0841.....Sun Tan Specialist.................9.................!...9658361F4EFF6B98FF153898E58C9D52.....Outfit.................D.................!...F37BE72345271144C16FECAFE6A46F2A.....Don't get burned............................................................................................................................Áƒ*ž
I have got it working to remove the first 104 characters, but I would also like to remove the characters after 'Sun Tan Specialist', which will differ in length, so I am left with only that part.
I appreciate any help that anyone can give me.

One way to remove non-alphabetic characters in a string is to use regular expressions [1].
>>> import re
>>> re.sub(r'[^a-z]', '', "lol123\t")
'lol'
EDIT
The first argument r'[^a-z]' is the pattern that captures what will removed (here, by replacing it by an empty string ''). The square brackets are used to denote a category (the pattern will match anything in this category), the ^ is a "not" operator and the a-z denotes all the small caps alphabetiv characters. More information here:
https://docs.python.org/3/library/re.html#regular-expression-syntax
So for instance, to keep also capital letters and spaces it would be:
>>> re.sub(r'[^a-zA-Z ]', '', 'Lol !this *is* a3 -test\t12378')
'Lol this is a test'
However from the data you give in your question the exact process you need seems to be a bit more complicated than just "getting rid of non-alphabetical characters".

You can use filter:
import string
print(''.join(filter(lambda character: character in string.ascii_letters + string.digits, '(ABC), DEF!'))) # => ABCDEF

You mentioned in a comment that you got the string down to Sun Tan SpecialistFEFFBFFECDOutfitDFBECFECAFEAFADont get burned
Essentially your goal at this point is to remove any uppercase letter that isn't immediately followed by a lowercase letter because Upper Lower indicates the start of a phrase. You can use a for loop to do this.
import re
h = "Sun Tan SpecialistFEFFBFFECDOutfitDFBECFECAFEAFADont get burned"
output = ""
for i in range(0, len(h)):
# Keep spaces
if h[i] is " ":
output += h[i]
# Start of a phrase found, so separate with space and store character
elif h[i].isupper() and h[i+1].islower():
output += " " + h[i]
# We want all lowercase characters
elif h[i].islower():
output += h[i]
# [1:] because we appended a space to the start of every word
print output[1:]
# If you dont care about Outfit since it is always there, remove it
print output[1:].replace("Outfit", "")
Output:
Sun Tan Specialist Outfit Dont get burned
Sun Tan Specialist Dont get burned

Related

Splitting a string into smaller strings using split()

I'm trying to write a D&D dice roller program in Python. I'd like to have the input be typed in the form "xdy+z" (ex. 4d6+12, meaning roll 4 6-sided di and add 12 to the result), and have the program "roll the dice", add the modifier, and output the result. I'm trying to figure out how to split the string into numbers so I can have the program do the math.
I know of the split() function, and I'm trying to use it. When I input the example above, I get the string [4 6 12], but when I have the program print string[1], I get a white space because it's still one full string. I'd either like to figure out how to get the program to identify the individual numbers in the string, or a way to split the full string into smaller strings (like string1 = [4], string2 = [6], string3 = [12]).
Yes, I tried Google and searching this site, but I'm not sure what the terminology for this type of process is to it's been hard to find help.
Here's the relevant code:
separators = ["d", "+", "-"]
for sep in separators:
inputText = inputText.replace(sep, ' ')
Solution without regular expressions, using find, slicing and split. First, find the position of + or - to get the last number. Then, split the rest at d. Also, convert the partial strings to int:
s = "4d6+12"
if "-" in s:
pos = s.find("-")
add = int(s[pos:])
elif "+" in s:
pos = s.find("+")
add = int(s[pos:])
else:
add = 0
pos = len(s)
rolls, sides = map(int, s[:pos].split("d"))
print(f"{rolls=} {sides=} {add=}")
You can use a regular expression for this instead of splitting the string up. Just capture each numeric position with a group. Check out the docs on the re package for details.
Here's a sample:
import re
pattern = r"(?P<rolls>\d+)d(?P<sides>\d+)\+(?P<add>\d+)"
match = re.match(pattern, input_text)
rolls = int(match["rolls"])
sides = int(match["sides"])
add = int(match["add"])
With input_text as "4d6+12" as in your example, the resulting values are:
print(rolls) # 4
print(sides) # 6
print(add) # 12

Count total number of words in a file?

I want to find the total number of words in a file (text/string). I was able to get an output with my code but I'm not sure if it is correct Here are some sample files for y'all to try and see what you get.
Also note, use of any modules/libraries is not permitted.
sample1: https://www.dropbox.com/s/kqwvudflxnmldqr/sample1.txt?dl=0
sample2 - https://www.dropbox.com/s/7xph5pb9bdf551h/sample2.txt?dl=0
sample3 - https://www.dropbox.com/s/4mdb5hgnxyy5n2p/sample3.txt?dl=0
There are some things you must consider before counting the words.
A sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character).
E.g if a full-stop is not at the end of a sentence, it is to be regarded as white space, so serve to end words.
Like 3.42 would be two words. Or P.yth.on would be 3 words.
Double hypen (--) represents is to be regarded as a space character.
That being said, first of all, I opened and read the file to get all the text. I then replaced all the useless characters with blank space so it is easier to count the words. This includes '--' as well.
Then I split the text into words, created a dictionary to store count of the words. After completing the dictionary, I added all the values to get the total number of words and printed this. See below for code:
def countwords():
filename = input("Name of file? ")
text = open(filename, "r").read()
text = text.lower()
for ch in '!.?"#$%&()*+/:<=>#[\\]^_`{|}~':
text = text.replace(ch, ' ')
text = text.replace('--', ' ')
text = text.rstrip("\n")
words = text.split()
count = {}
for w in words:
count[w] = count.get(w,0) + 1
wordcount = sum(count.values())
print(wordcount)
So for sample1 text file, my word count is 321,
Forsample2: 542
For sample3: 139
I was hoping if I could compare these answers with some python pros here and see if my results are correct and if they are not what I'm doing wrong.
You can try this solution using regex.
#word counter using regex
import re
while True:
string =raw_input("Enter the string: ")
count = len(re.findall("[a-zA-Z_]+", string))
if line == "Done": #command to terminate the loop
break
print (count)
print ("Terminated")

split strings with multiple special characters into lists without importing anything in python

i need to make a program that will capitalize the first word in a sentence and i want to be sure that all the special characters that are used to end a sentence can be used.
i can not import anything! this is for a class and i just want some examples to do this.
i have tried to use if to look in the list to see if it finds the matching character and do the correct split operatrion...
this is the function i have now... i know its not good at all as it just returns the original string...
def getSplit(userString):
userStringList = []
if "? " in userString:
userStringList=userString.split("? ")
elif "! " in userStringList:
userStringList = userString.split("! ")
elif ". " in userStringList:
userStringList = userString.split(". ")
else:
userStringList = userString
return userStringList
i want to be able to input something like this is a test. this is a test? this is definitely a test!
and get [this is a test.', 'this is a test?', 'this is definitely a test!']
and the this is going to send the list of sentences to another function to make the the first letter capitalized for each sentence.
this is an old homework assignment that i could only make it use one special character to separate the string into a list. buti want to user to be able to put in more then just one kind of sentence...
This may hep. use str.replace to replace special chars with space and the use str.split
Ex:
def getSplit(userString):
return userString.replace("!", " ").replace("?", " ").replace(".", " ").split()
print(map(lambda x:x.capitalize, getSplit("sdfsdf! sdfsdfdf? sdfsfdsf.sdfsdfsd!fdfgdfg?dsfdsfgf")))
Normally, you could use re.split(), but since you cannot import anything, the best option would be just to do a for loop. Here it is:
def getSplit(user_input):
n = len(user_input)
sentences =[]
previdx = 0
for i in range(n - 1):
if(user_input[i:i+2] in ['. ', '! ', '? ']):
sentences.append(user_input[previdx:i+2].capitalize())
previdx = i + 2
sentences.append(user_input[previdx:n].capitalize())
return "".join(sentences)
I would split the string at each white space. Then scan the list for words that contain the special character. If any is present, the next word is capitalised. Join the list back at the end. Of course, this assumes that there are no more than two consecutive spaces between words.
def capitalise(text):
words = text.split()
new_words = [words[0].capitalize()]
i = 1
while i < len(words) - 1:
new_words.append(words[i])
if "." in words[i] or "!" in words[i] or "?" in words[i]:
i += 1
new_words.append(words[i].capitalize())
i += 1
return " ".join(new_words)
If you can use the re module which is available by default in python, this is how you could do it:
import re
a = 'test this. and that, and maybe something else?even without space. or with multiple.\nor line breaks.'
print(re.sub(r'[.!?]\s*\w', lambda x: x.group(0).upper(), a))
Would lead to:
test this. And that, and maybe something else?Even without space. Or with multiple.\nOr line breaks.

Iterate over a string in Python and add some new characters

I need to iterate over a string in Python and add some whitespace characters after existing ones in certain cases. So I need smth like the following code (ofc, it doesn't work):
for i, c in enumerate(some_string):
if c == ' ':
rand_value = random.randint(0, 2)
if rand_value == 0:
# Place two whitespaces here
elif rand_value == 1:
# Place three whitespaces here
Also I know that I can't modify string object while iterating over it.
How can I write such code in Python?
Example:
Input -- "some string with whitespace characters"
Possible outputs -- "some string with whitespace characters", "some string with whitespace characters", "some string with whitespace characters", etc
Thanks in advance.
I'd go for:
import re
from random import randint
text = 'this is some example text'
for i in xrange(5):
print re.sub(' ', lambda m: m.group() * randint(1, 3), text)
Which gives:
this is some example text
this is some example text
this is some example text
this is some example text
this is some example text
This reads as, find a space, then replace it by 1 to 3 spaces... it should be straight forward enough to adapt to other scenarios more easily than looping/re-joining etc...
A concise solution:
from random import choice
output = ''.join([choice([' ', ' ']) if c==' ' else c for c in input])
For each character in the input string, if the current character is a space, output either two or three spaces selected randomly; for any input other than space, copy to output. Then since the result of the [list comprehension] is a list, join the characters to make a new string.

Python regular expression to remove space and capitalize letters where the space was?

I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.
What I want is to supply the input field and:
remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces
Here's what I have gathered around SO so far:
def no_whitespace(s):
"""Remove all whitespace & newlines. """
return re.sub(r"(?m)\s+", "", s)
# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051
tag_list = ''.join(no_whitespace(tags_input))
# split into a list at comma's
tag_list = tag_list.split(',')
# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
tag_list = filter(None, tag_list)
I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.
Any thoughts to get me going in the right direction?
As suggested, here are some sample inputs = desired_outputs
form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se#%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']
You could use a white list of characters allowed to be in a word, everything else is ignored:
import re
def camelCase(tag_str):
words = re.findall(r'\w+', tag_str)
nwords = len(words)
if nwords == 1:
return words[0] # leave unchanged
elif nwords > 1: # make it camelCaseTag
return words[0].lower() + ''.join(map(str.title, words[1:]))
return '' # no word characters
This example uses \w word characters.
Example
tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$,
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))
Output
thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps
I think this should work
def toCamelCase(s):
# remove all punctuation
# modify to include other characters you may want to keep
s = re.sub("[^a-zA-Z0-9\s]","",s)
# remove leading spaces
s = re.sub("^\s+","",s)
# camel case
s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)
# remove all punctuation and spaces
s = re.sub("[^a-zA-Z0-9]", "", s)
return s
tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]
the key here is to make use of re.sub to make the replacements you want.
EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces
EDIT : Moved "if s" after the toCamelCase call

Categories