I'm creating a script that querys websites, and my results end up looking something like this
result = "
nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43
"
Basically, I want to remove the numbers that come after each line of text. That would be easy for me to do in a pattern, but there is the occasional long string that takes up two separate lines. There's also the matter of needing to keep the numbers in the actual text names.
I was thinking of checking each individual line for string length and just removing those w/o 5 or more letters / numbers, but I wasn't sure if that would work, and I wasn't too sure how to do it either.
Any help from you guys would be great.
Thanks! :)
You could maybe use regex matching, looking for a link-like string (allowing for newlines) followed by a number and a newline, which you'd want to ignore. Then, to accommodate multi-line links, use simple str.replace() to remove any occurrences of the consistent ...\n that occurs when the link is split across multiple lines.
What I have in mind, given the example you've provided, is this:
import re
result = """nameof1stlink
38
nameof2ndlink120
12
nameof3rdlink15
7
nameof4thlin...
k143
43"""
matches = re.findall(r'([A-Za-z0-9\n/_.-]+?)[0-9\n]+[\n\b]', result, flags=re.M)
# match this group '( ) ' ^
# shortest possible ' ? ' (multi-line
# at least one of ' + ' string input)
# these characters ' [A-Za-z0-9\n/_.-] '
# then, at least one ' + '
# digit or newline ' [0-9\n] '
# and ending with \n ' [\n\b]'
# or end-of-string
# matches = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlin...\nk']
links = [link.replace('...\n', '') for link in matches]
# links = ['nameof1stlink', 'nameof2ndlink', 'nameof3rdlink', 'nameof4thlink']
I'm not sure what your links look like, but I assumed [A-Za-z0-9/_.-] (alphanumerics plus /, _, ., and -) covers all the standard parts of hyperlinks. And \n needs to be thrown somewhere in there to accommodate for multi-line entries. You can modify this character class depending on what you expect your links to look like.
Related
I have some issues in finding the correct regular expression.
Lets say I have this list of keywords:
keywords = [' b.o.o', ' a.b.a', ' titi']
(please note that there is a blank space before any keyword and this list can contain up to 100keywords so I can't to it without a function)
and my dataframe df:
I use the following code to extract the matching words, it works partially because it extract even the words that are not an exact match :
keywords = [' b.o.o', ' a.b.a', ' titi']
pattern = '(' + '|'.join([fr'\\b({k})\\b' for k in keywords]) + ')'
df.withColumn('words', F.expr(f"regexp_extract_all(colB, '{pattern}' ,1)))
Actual output :
Expected output :
As we can see, it does extract words that are not exact match, it does not take into account the dot. For example, this code considers awbwa as a match because if we replace w by a dot it will be a match. I also tried:
pattern = '(' + '|'.join([fr'\\b({k})\\b' for k in [re.escape(x) for x in keywords]]) + ')'
to add a backslash before every dot and before the blank space but it doesnt work.
I searched on stackoverflow; but didnt find an answer.
I finally figure it out, for some reason re.escape doesnt work, the solution was to add [] in between dots.
I have been working on a program which will take a hex file, and if the file name starts with "CID", then it should remove the first 104 characters, and after that point there is a few words. I also want to remove everything after the words, but the problem is the part I want to isolate varies in length.
My code is currently like this:
y = 0
import os
files = os.listdir(".")
filenames = []
for names in files:
if names.endswith(".uexp"):
filenames.append(names)
y +=1
print(y)
print(filenames)
for x in range(1,y):
filenamestart = (filenames[x][0:3])
print(filenamestart)
if filenamestart == "CID":
openFile = open(filenames[x],'r')
fileContents = (openFile.read())
ItemName = (fileContents[104:])
print(ItemName)
Input Example file (pulled from HxD):
.........................ýÿÿÿ................E.................!...1AC9816A4D34966936605BB7EFBC0841.....Sun Tan Specialist.................9.................!...9658361F4EFF6B98FF153898E58C9D52.....Outfit.................D.................!...F37BE72345271144C16FECAFE6A46F2A.....Don't get burned............................................................................................................................Áƒ*ž
I have got it working to remove the first 104 characters, but I would also like to remove the characters after 'Sun Tan Specialist', which will differ in length, so I am left with only that part.
I appreciate any help that anyone can give me.
One way to remove non-alphabetic characters in a string is to use regular expressions [1].
>>> import re
>>> re.sub(r'[^a-z]', '', "lol123\t")
'lol'
EDIT
The first argument r'[^a-z]' is the pattern that captures what will removed (here, by replacing it by an empty string ''). The square brackets are used to denote a category (the pattern will match anything in this category), the ^ is a "not" operator and the a-z denotes all the small caps alphabetiv characters. More information here:
https://docs.python.org/3/library/re.html#regular-expression-syntax
So for instance, to keep also capital letters and spaces it would be:
>>> re.sub(r'[^a-zA-Z ]', '', 'Lol !this *is* a3 -test\t12378')
'Lol this is a test'
However from the data you give in your question the exact process you need seems to be a bit more complicated than just "getting rid of non-alphabetical characters".
You can use filter:
import string
print(''.join(filter(lambda character: character in string.ascii_letters + string.digits, '(ABC), DEF!'))) # => ABCDEF
You mentioned in a comment that you got the string down to Sun Tan SpecialistFEFFBFFECDOutfitDFBECFECAFEAFADont get burned
Essentially your goal at this point is to remove any uppercase letter that isn't immediately followed by a lowercase letter because Upper Lower indicates the start of a phrase. You can use a for loop to do this.
import re
h = "Sun Tan SpecialistFEFFBFFECDOutfitDFBECFECAFEAFADont get burned"
output = ""
for i in range(0, len(h)):
# Keep spaces
if h[i] is " ":
output += h[i]
# Start of a phrase found, so separate with space and store character
elif h[i].isupper() and h[i+1].islower():
output += " " + h[i]
# We want all lowercase characters
elif h[i].islower():
output += h[i]
# [1:] because we appended a space to the start of every word
print output[1:]
# If you dont care about Outfit since it is always there, remove it
print output[1:].replace("Outfit", "")
Output:
Sun Tan Specialist Outfit Dont get burned
Sun Tan Specialist Dont get burned
There are probably several ways to solve this problem, so I'm open to any ideas.
I have a file, within that file is the string "D133330593" Note: I do have the exact position within the file this string exists, but I don't know if that helps.
Following this string, there are 6 digits, I need to replace these 6 digits with 6 other digits.
This is what I have so far:
def editfile():
f = open(filein,'r')
filedata = f.read()
f.close()
#This is the line that needs help
newdata = filedata.replace( -TOREPLACE- ,-REPLACER-)
#Basically what I need is something that lets me say "D133330593******"
#->"D133330593123456" Note: The following 6 digits don't need to be
#anything specific, just different from the original 6
f = open(filein,'w')
f.write(newdata)
f.close()
Use the re module to define your pattern and then use the sub() function to substitute occurrence of that pattern with your own string.
import re
...
pat = re.compile(r"D133330593\d{6}")
re.sub(pat, "D133330593abcdef", filedata)
The above defines a pattern as -- your string ("D133330593") followed by six decimal digits. Then the next line replaces ALL occurrences of this pattern with your replacement string ("abcdef" in this case), if that is what you want.
If you want a unique replacement string for each occurrence of pattern, then you could use the count keyword argument in the sub() function, which allows you to specify the number of times the replacement must be done.
Check out this library for more info - https://docs.python.org/3.6/library/re.html
Let's simplify your problem to you having a string:
s = "zshisjD133330593090909fdjgsl"
and you wanting to replace the 6 characters after "D133330593" with "123456" to produce:
"zshisjD133330594123456fdjgsl"
To achieve this, we can first need to find the index of "D133330593". This is done by just using str.index:
i = s.index("D133330593")
Then replace the next 6 characters, but for this, we should first calculate the length of our string that we want to replace:
l = len("D133330593")
then do the replace:
s[:i+l] + "123456" + s[i+l+6:]
which gives us the desired result of:
'zshisjD133330593123456fdjgsl'
I am sure that you can now integrate this into your code to work with a file, but this is how you can do the heart of your problem .
Note that using variables as above is the right thing to do as it is the most efficient compared to calculating them on the go. Nevertheless, if your file isn't too long (i.e. efficiency isn't too much of a big deal) you can do the whole process outlined above in one line:
s[:s.index("D133330593")+len("D133330593")] + "123456" + s[s.index("D133330593")+len("D133330593")+6:]
which gives the same result.
I have a long text file converted from a PDF and I want to remove instances of some things, e.g. like page numbers that will appear by themselves but possibly surrounded by spaces. I made a regex that works on short lines: e.g.
news1 = 'Hello done.\n4\nNext paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news1)
print(m)
Hello done. Next paragraph.
But when I try this on more complex strings, it fails, e.g
news = '1 \n Hello done. \n 4 \n 44 \n Next paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news)
print(m)
1
Hello done. 44
Next paragraph.
How do I make this work across the entire file? Should I instead read line by line and deal with it per line, instead of trying to edit the whole string?
I've also tried using the periods to match with whatever but that doesn't get the initial '1' in the more complex string. So I guess I could do 2 regexs.
m = re.sub('. *[0-9] *.', '', news)
1
Hello done.
Next paragraph.
Thoughts?
I would recommend doing it line by line unless you have some specific reason to slurp it all in as a string. Then just a few regexes to clean it all up like:
#not sure how the pages are numbered, but perhaps...
text = re.sub(r"^\s*\d+\s*$", "", text)
#chuck a line in to strip out stuff in all caps of at least 3 letters
text = re.sub(r"[A-Z]{3,}", "", text)
#concatenate multiple whitespace to 1 space, handy to clean up the data
text = re.sub(r"\s+", " ", text)
#trim the start and end of the line
text = text.strip()
Just one strategy but that's the route I would go with, easy to maintain down the road as your business side comes up with "OH OH! Can you also replace any mention of 'Cat' with 'Dog'?" I think it's easier to toubleshoot/log your changes as well. Maybe even try using re.subn to track changes... ?
I'm currently using Python to search through a .config file and look for an integer in a line such as "locationId="225".
It replaces the integer such as 225 with another number of my choosing.
This works fine. However, I'm not sure how to enter my own number if the original .config file is missing a number. Example:
locationID=""
So if the original locationId is missing an integer, I still want to replace it with my new integer.
I have used:
import re
sys.stdout.write(re.sub(r'(locationid=")', r'\1 ' + newtext, line))
but this causes it to output something such as
locationId=" 33"
with a space before the 33. How to I remove the space before the 33 and make it output
locationId="33"
?
I basically just want to know how to remove the space before the number.
The space is coming from your replacement string, r'\1 ', but removing that space causes a problem when you concatenate a number, say, 1, with it. If newtext is 1 then the replacement string becomes r'\11' without the space.
Remove the double quote from the capturing group and add it to the replacement string:
re.sub(r'(locationid=)"', r'\1"' + newtext, line)