I have a string https://www.exampleurl.com/
How would I insert a word in the middle of a string so it could look like this: https://www.subdomain.exampleulr.com/
I know I can insert the word if I did this:
url = 'https://www.exampleurl.com/'
url[:12] + 'subdomain'
It prints me https://www.subdomain, but I can't figure out how to print the rest of the string dynamically so it would adjust to the subdomain that is being appended to the string.
My goal is for the end result to look like the following https://www.subdomain.exampleurl.com/
url = 'https://www.exampleurl.com/'
content = url.split("www.")
url = content[0] + "www." + "subdomain." + content[1]
url = 'https://www.exampleurl.com/'
text = url.split(".")
url = text[0] + '.subdomain.' + text[1] + '.' + text[2]
Final output : https://www.subdomain.exampleurl.com/
Better split on the first .:
l = url.split('.', 1)
l[0] + '.subdomain.' + l[1]
## OR if subdomain is a variable:
f'{l[0]}.{subdomain}.{l[1]}'
output: 'https://www.subdomain.exampleurl.com/'
Using replace (once)
url = 'https://www.exampleurl.com/'
url = url.replace(".", ".subdomain.", 1) # only replaces first "." to
# get desured result
Related
so what i wanna do is basically i have a list of urls with multiple parameters, such as:
https://www.somesite.com/path/path2/path3?param1=value1¶m2=value2
and i would want to get is something like this:
https://www.somesite.com/path/path2/path3?param1=PAYLOAD¶m2=value2
https://www.somesite.com/path/path2/path3?param1=value1¶m2=PAYLOAD
like i wanna iterate through every parameter (basically every match of "=" and "&") and replace each value one per time. Thank you in advance.
from urllib.parse import urlparse
import re
urls = ["https://www.somesite.com/path/path2/path3?param1=value1¶m2=value2¶m3=value3",
"https://www.anothersite.com/path/path2/path3?param1=value1¶m2=value2¶m3=value3"]
parseds = [urlparse(url) for url in urls]
newurls = []
for parsed in parseds:
params = parsed[4].split("&")
for i, param in enumerate(params):
newparam = re.sub("=.+", "=PAYLOAD", param)
newurls.append(
parsed[0] +
"://" +
parsed[1] +
parsed[2] +
"?" +
parsed[4].replace(param, newparam)
)
newurls is
['https://www.somesite.com/path/path2/path3?param1=PAYLOAD¶m2=value2¶m3=value3',
'https://www.somesite.com/path/path2/path3?param1=value1¶m2=PAYLOAD¶m3=value3',
'https://www.somesite.com/path/path2/path3?param1=value1¶m2=value2¶m3=PAYLOAD',
'https://www.anothersite.com/path/path2/path3?param1=PAYLOAD¶m2=value2¶m3=value3',
'https://www.anothersite.com/path/path2/path3?param1=value1¶m2=PAYLOAD¶m3=value3',
'https://www.anothersite.com/path/path2/path3?param1=value1¶m2=value2¶m3=PAYLOAD']
I've solved it:
from urllib.parse import urlparse
url = "https://github.com/search?p=2&q=user&type=Code&name=djalel"
parsed = urlparse(url)
query = parsed.query
params = query.split("&")
new_query = []
for param in params:
l = params.index(param)
param = str(param.split("=")[0]) + "=" + "PAYLOAD"
params[l] = param
new_query.append("&".join(params))
params = query.split("&")
for query in new_query:
print(str(parsed.scheme) + '://' + str(parsed.netloc) + str(parsed.path) + '?' + query)
Output:
https://github.com/search?p=PAYLOAD&q=user&type=Code&name=djalel
https://github.com/search?p=2&q=PAYLOAD&type=Code&name=djalel
https://github.com/search?p=2&q=user&type=PAYLOAD&name=djalel
https://github.com/search?p=2&q=user&type=Code&name=PAYLOAD
So I've written something to pull out certain string (beneficiary) from pdf's and rename the file based on the string but the problem is if there are duplicates, is there any way to add a +1 counter behind the name?
My inefficient code as follow, appreciate any help!:
for filename in os.listdir(input_dir):
if filename.endswith('.pdf'):
input_path = os.path.join(input_dir, filename)
pdf_array = (glob.glob(input_dir + '*.pdf'))
for current_pdf in pdf_array:
with pdfplumber.open(current_pdf) as pdf:
page = pdf.pages[0]
text = page.extract_text()
keyword = text.split('\n')[2]
try:
if 'attention' in keyword:
pdf_to_att = text.split('\n')[2]
start_to_att = 'For the attention of: '
to_att = pdf_to_att.removeprefix(start_to_att)
pdf.close()
result = to_att
os.rename(current_pdf, result + '.pdf')
else:
pdf_to_ben = text.split('\n')[1]
start_to_ben = 'Beneficiary Name : '
end_to_ben = pdf_to_ben.rsplit(' ', 1)[1]
to_ben = pdf_to_ben.removeprefix(start_to_ben).removesuffix(end_to_ben).rstrip()
pdf.close()
result = to_ben
os.rename(current_pdf, result + '.pdf')
except Exception:
pass
messagebox.showinfo("A Title", "Done!")
edit: the desired output should be
AAA.pdf
AAA_2.pdf
BBB.pdf
CCC.pdf
CCC_2.pdf
I would use a dict to record the occurrence count of each filename.
dict.get() returns the value for key if key is in the dictionary, else default. If default is not given, it defaults to None
pdf_name_count = {}
for current_pdf in pdf_array:
with pdfplumber.open(current_pdf) as pdf:
page = pdf.pages[0]
text = page.extract_text()
keyword = text.split('\n')[2]
try:
if 'attention' in keyword:
...
result = to_att
else:
...
result = to_ben
filename_count = pdf_name_count.get(result, 0)
if filename_count >= 1:
filename = f'{result}_{filename_count+1}.pdf'
else:
filename = result + '.pdf'
os.rename(current_pdf, filename)
# increase the name occurrence by 1
pdf_name_count[result] = filename_count + 1
except Exception:
pass
What you want is to build a string, for the filename, that includes a counter,
let's call it cnt. Python has the f-string syntax for this exact purpose, it
lets you interpolate a variable into a string.
Initialize your counter before the for loop:
cnt = 0
Replace
os.rename(current_pdf, result + '.pdf')
with
os.rename(current_pdf, f'{result}_{cnt}.pdf')
cnt += 1
The f before the opening quote introduces the f-string, and the curly braces
{} let you include any python expression, in your case we just substitute the
values of the two variables result and cnt. Then we increment the counter,
of course.
os.path.isfile can be your mate meet your needs.
import os
def get_new_name(result):
file_name = result + '{}.pdf'
file_number = 0
if os.path.isfile(file_name.format('')): # AAA.pdf
file_number = 2
while os.path.isfile(file_name.format('_{}'.format(file_number))):
file_number += 1
if file_number:
pdf_name = file_name.format('_{}'.format(file_number))
else:
pdf_name = file_name.format('')
return pdf_name
my screenshot
I update code for your output format, it can be work.
I have stored my output in a dictionary, like this:
str3 = "Triangle, Bow, Boat"
str1 = "some text regarding body parts"
str2 = "some text regarding themes"
d={}
key=str3
d[key] = str1
d[key]=[d[key]]
d[key].append(str2)
print(d)
{'Triangle, Bow, Boat': ['some text regarding body parts', 'some text regarding themes']}
And I am trying to get it to be returned to html so that it appears separated on three lines as such:
Triangle, Bow, Boat
some text regarding body parts
some text regarding themes
I have tried creating an entire string as output and using new line and break characters, but this didn't work.
So I'm trying to use some combination of jsonify and json.dump in order to get these to display properly in html.
I think you want to do:
string = key + '<br>' + d[key][0] + '<br>' + d[key[1]
return '<p>' + string '</p>'
above code should work with the + sign added
string = key + '<br>' + d[key][0] + '<br>' + d[key[1]]
return '<p>' + string + '</p>'
I have a URL as follows: https://www.vq.com/36851082/?p=1. I want to create a file named list_of_urls.txt which contains url links from p=1 to p=20, seperate each with space, and save it as a txt file.
Here is what I have tried, but it only prints the last one:
url = "https://www.vq.com/36851082/?p="
list_of_urls = []
for page in range(20):
list_of_urls = url + str(page)
print(list_of_urls)
The expected txt file inside would be like this:
It is the occasion to use f-strings, usable since Python 3.6, and fully described in PEP 498 -- Literal String Interpolation.
url_base = "https://www.vq.com/36851082/?p="
with open('your.txt', 'w') as f:
for page in range(1, 20 + 1):
f.write(f'{url_base}{page} ')
#f.write('{}{} '.format(url_base, page))
#f.write('{0}{1} '.format(url_base, page))
#f.write('{u}{p} '.format(u=url_base, p=page))
#f.write('{u}{p} '.format(**{'u':url_base, 'p':page}))
#f.write('%s%s '%(url_base, page))
Notice the space character at the end of each formatting expression.
Be careful with range - it starts from from 0 by default and the last number of the range is not included. Hence, if you want numbers 1 - 20 you need to use range(1, 21).
url_template = "https://www.vq.com/36851082/?p={page}"
urls = [url_template.format(page=page) for page in range(1, 21)]
with open("/tmp/urls.txt", "w") as f:
f.write(" ".join(urls))
Try this :)
url = "https://www.vq.com/36851082/?p="
list_of_urls = ""
for page in range(20):
list_of_urls = list_of_urls + url + str(page) + " "
print(list_of_urls)
Not sure if you want one line inside your file but if so:
url = "https://www.vq.com/36851082/?p=%i"
with open("expected.txt", "w") as f:
f.write(' '.join([url %i for i in range(1,21)]))
Output:
https://www.vq.com/36851082/?p=1 https://www.vq.com/36851082/?p=2 https://www.vq.com/36851082/?p=3 https://www.vq.com/36851082/?p=4 https://www.vq.com/36851082/?p=5 https://www.vq.com/36851082/?p=6 https://www.vq.com/36851082/?p=7 https://www.vq.com/36851082/?p=8 https://www.vq.com/36851082/?p=9 https://www.vq.com/36851082/?p=10 https://www.vq.com/36851082/?p=11 https://www.vq.com/36851082/?p=12 https://www.vq.com/36851082/?p=13 https://www.vq.com/36851082/?p=14 https://www.vq.com/36851082/?p=15 https://www.vq.com/36851082/?p=16 https://www.vq.com/36851082/?p=17 https://www.vq.com/36851082/?p=18 https://www.vq.com/36851082/?p=19 https://www.vq.com/36851082/?p=20
This one also work, thanks to my colleague!
url = "https://www.vq.com/36851082/?p=%d"
result = " ".join([ url % (x + 1) for x in range(20)])
with open("list_of_urls.txt", "w") as f:
f.write(result)
I want to extract website names from the url. For e.g. https://plus.google.com/in/test.html
should give the output as - "plus google"
Some more testcases are -
WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/AUTO_PARTS_MADISON_OH_7402.HTML
Output:- OH MADISON STORES ADVANCEAUTOPARTS
WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054
Output:- LQ
WWW.LOCATIONS.DENNYS.COM
Output:- LOCATIONS DENNYS
WV.WESTON.STORES.ADVANCEAUTOPARTS.COM
Output:- WV WESTON STORES ADVANCEAUTOPARTS
WOODYANDERSONFORDFAYETTEVILLE.NET/
Output:- WOODYANDERSONFORFAYETTEVILLE
WILMINGTONMAYFAIRETOWNCENTER.HGI.COM
Output:- WILMINGTONMAYFAIRETOWNCENTER HGI
WHITEHOUSEBLACKMARKET.COM/
Output:- WHITEHOUSEBLACKMARKET
WINGATEHOTELS.COM
Output:- WINGATEHOTELS
string = str(input("Enter the url "))
new_list = list(string)
count=0
flag=0
if 'w' in new_list:
index1 = new_list.index('w')
new_list.pop(index1)
count += 1
if 'w' in new_list:
index2 = new_list.index('w')
if index2 != -1 and index2 == index1:
new_list.pop(index2)
count += 1
if 'w' in new_list:
index3= new_list.index('w')
if index3!= -1 and index3== index2 and new_list[index3+1]=='.':
new_list.pop(index3)
count+=1
flag = 1
if flag == 0:
start = string.find('/')
start += 2
end = string.rfind('.')
new_string=string[start:end]
print(new_string)
elif flag == 1:
start = string.find('.')
start = start + 1
end = string.rfind('.')
new_string=string[start:end]
print(new_string)
The above works for some testcases but not all. Please help me with it.
Thanks
this is something you could build upon; using urllib.parse.urlparse:
from urllib.parse import urlparse
tests = ('https://plus.google.com/in/test.html',
('WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/'
'AUTO_PARTS_MADISON_OH_7402.HTML'),
'WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054')
def extract(url):
# urlparse will not work without a 'scheme'
if not url.startswith('http'):
url = 'http://' + url
parsed = urlparse(url).netloc
split = parsed.split('.')[:-1] # get rid of TLD
if split[0].lower() == 'www':
split = split[1:]
ret = ' '.join(split)
return ret
for url in tests:
print(extract(url))
The function strips the url from the double slash to the single slash:
the rest is 'clean up'
def stripURL( url, TwoSlashes, OneSlash ):
try:
start = url.index(TwoSlashes) + len(TwoSlashes)
end = url.index( OneSlash, start )
return url[start:end]
except ValueError:
return ""
url= raw_input("URL : ")
if "www." in url:url=url.replace("www.","")
Strip = stripURL( url, "//", "/" )
# Strips anything after the last period found
Stripped = Strip[:Strip.rfind(".")]
# get rid of the any periods used in the name
Stripped = Stripped.replace("."," ")
print Stripped