How to remove a specific part of a link? - python

So basically Im making a script that's able to download a bunch of maps from TrackmaniaExchange with a search result. However, to download the map files, I need the actual download link, which the search result doesn't give.
I already know how to download maps. The link is https://trackmania.exchange/maps/download/(map id). However, the href's for the search results is /maps/(map id)/(map name).
What I was thinking of doing is using selenium to go to the site, grab the href for the map, edit the link with re.sub so that itll link to /maps/download/(map id)/, and remove the end of the link with re.sub so there's no map name at the end of it. I dont know how to go about it, though. This is what I have so far in my script:
import requests
import os.path
import os
import selenium.webdriver as webdriver
from selenium.webdriver.firefox.options import Options
import time
import re
def Search():
link="https://trackmania.exchange/mapsearch2?limit=100" #Trackmania Exchange link, will scrape all 100 results
checkedlink = re.sub("\s", "+", link) #Replaces spaces with + for track names (this shouldnt happen with authors/tags)
options = Options() #This is for selenium
options.binary_location = "C:/Program Files/Mozilla Firefox/firefox.exe"
driver = webdriver.Firefox(options=options)
search_box = driver.find_element_by_name("trackname")
sitelinks = driver.find_element_by_xpath("/html/[div/#id='container'/#data-select2-id='container']/[div/#class='container-inner']/[div/#class='ly-box-open']/[div/#class='box-col-6']/[div/#class='windowv2-panel']/[div/#id='searchResults-container']/div/div/table/tbody/[tr/#class='WindowTableCell2v2 with-hover has-image']/[td/#class='cell-ellipsis']")
results = []
name=input("Track Name (if nothing, hit enter)") #Prompts the user to input stuff
author=input("Track Author (if nothing, hit enter)")
tags=input("Tags (separate with %2C if there's multiple, if nothing, hit enter)")
path=input("Map download directory (do not leave blank, use forward slashes)")
print("WARNING: Download wget for this script to work.")
type(name) #These are to make a link to find html with
type(author)
type(tags)
type(path)
if path == "":
print("Please put a path next time you start this")
time.sleep(3)
os.exit()
else: #And so begins the if/else hellhole to find out what needs to be added to the link
if tags == "":
if name == "":
if author == "":
print("Chief, you cant just enter nothing. Put something in here next time")
time.sleep(3)
os.exit()
else:
link = link+"&author="+author
else:
link = link+"&trackname="+name
if author != "":
link = link+"&author="+author
else:
link = link+"&tags="+tags
if name != "":
link = link+"&trackname="+name
if author != "":
link = link+"&author="+author
else:
if author != "":
link = link+"&author="+author
print("Checking link...")
checkedlink() #this is to make sure there's no spaces in the link. tags are separated by %2C, but track names are separated by +
print("Attempting to download...")
driver.get(link)
links = sitelinks
for link in links
href = link.get_attribute("href")
browser.close()
with open("list.txt", "w", encoding="utf-8") as f:
f.write(href)
for line in f:
h = re.findall("\d") #My failed attempt at removing the end of the link
re.sub("/maps/", "https://trackmania.exchange/maps/download", f)
re.sub("") #unfinished part cause i was stubbed
os.system("wget --directory-prefix="path" -i list.txt")
Search()
Their API is listed on the site and after looking over the rules for the site, this is allowed. I also havent really tested the script after making the if/else hellhole, but I can work on that later. All I need help with is removing the map name after the map ID. If you need a proper example, one of the href's on the front page for me is /maps/91677/cloudy-day. Itll be different for every link, so I don't really know what I should do.

If I know the URL format will be /maps/id/some-text and the ID will only include numbers, then I would just simply grab the ID from the link using the bellow regex, and then use an f string to build the URL.
map_id = re.search(r"\d+", url).group(0)
get_map_url = f"https://trackmania.exchange/maps/download/{map_id}"
Play around on regex101 with different URLs you may come across.

Related

Why won't webbrowser module open my html file in my browser

I am using the python webbrowser module to try and open a html file. I added a short thing to get code from a website to view, allowing me to store a web-page incase I ever need to view it without wifi, for instance a news article or something else.
The code itself is fairly short so far, so here it is:
import requests as req
from bs4 import BeautifulSoup as bs
import webbrowser
import re
webcheck = re.compile('^(https?:\/\/)?(www.)?([a-z0-9]+\.[a-z]+)([\/a-zA-Z0-9#\-_]+\/?)*$')
#Valid URL Check
while True:
url = input('URL (MUST HAVE HTTP://): ')
check = webcheck.search(url)
groups = list(check.groups())
if check != None:
for group in groups:
if group == 'https://':
groups.remove(group)
elif group.count('/') > 0:
groups.append(group.replace('/', '--'))
groups.remove(group)
filename = ''.join(groups) + '.html'
break
#Getting Website Data
reply = req.get(url)
soup = bs(reply.text, 'html.parser')
#Writing Website
with open(filename, 'w') as file:
file.write(reply.text)
#Open Website
webbrowser.open(filename)
webbrowser.open('https://www.youtube.com')
I added webbrowser.open('https://www.youtube.com') so that I knew the module was working, which it was, as it did open up youtube.
However, webbrowser.open(filename) doesn't do anything, yet it returns True if I define it as a variable and print it.
The html file itself has a period in the name, but I don't think that should matter as I have made a file without it as the name and it wont run.
Does webbrowser need special permissions to work?
I'm not sure what to do as I've removed characters from the filename and even showed that the module is working by opening youtube.
What can I do to fix this?
From the webbrowser documentation:
Note that on some platforms, trying to open a filename using this function, may work and start the operating system’s associated program. However, this is neither supported nor portable.
So it seems that webbrowser can't do what you want. Why did you expect that it would?
adding file:// + full path name does the trick for any wondering

Retry mechanism for my web crawler script

So, I am trying to make a website crawler which would retrieve all links within the site and print them to the console and also redirect the links to a text file using a python script.
This script will take in the URL of the website you want to retrieve links from and the no.of URLs to be followed from the main page and the maximum number of URLs to be retrieved and then using the functions crawl(), is_valid() and get_all_website_links() it retrieves the URLs. It also separates external links and internal links through the get_all_website_links() function.
So far I have been successful with the retrieving and printing and redirecting the links to the text file using the script but I faced a problem when the server refuses to connect. It stops the link retrieval and ends the execution.
What I want my script to do is to retry a specified number of times and continue to the next link if it fails even after retrying.
I tried to implement this mechanism by myself but I did not get any idea.
I'm appending my python script below for your better understanding.
An elaborate explanation with implementation would be deeply appreciated!
Pardon me if my grammar is bad ;)
Thanks for your time :)
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
import sys
sys.setrecursionlimit(99999999)
print("WEBSITE CRAWLER".center(175,"_"))
print("\n","="*175)
print("\n\n\n\nThis program does not tolerate faults!\nPlease type whatever you are typing correctly!\nIf you think you have made a mistake please close the program and reopen it!\nIf you proceed with errors the program will crash/close!\nHelp can be found in the README.txt file!\n\n\n")
print("\n","="*175)
siteurl = input("Enter the address of the site (Please don't forget https:// or http://, etc. at the front!) :")
max_urls = int(input("Enter the number of urls you want to crawl through the main page : "))
filename = input("Give a name for your text file (Don't append .txt at the end!) : ")
# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
MAGENTA = colorama.Fore.MAGENTA
RESET = colorama.Fore.RESET
# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
print(f"{MAGENTA} [!] External link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
with open(filename+".txt","a") as f:
print(f"{href}",file = f)
urls.add(href)
internal_urls.add(href)
return urls
# number of urls visited so far will be stored here
total_urls_visited = 0
def crawl(url, max_urls=50000):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
links = get_all_website_links(url)
for link in links:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
crawl(siteurl,max_urls)
print("[+] Total External links:", len(external_urls))
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total:", len(external_urls) + len(internal_urls))
input("Press any key to exit...")

Cycle trough URLs from a txt

This is my first question so please bear with me (I have googled this and I did not find anything)
I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
So far I've got the first two steps done but I'm having some issues.
I want Selenium to repeat this process with multiple urls (if possible, multiple at a time).
I have all the urls in a txt called output.txt
At first I did
url_list = https://example.com
to see if my program even worked, and it did however I am stuck on how to get it to go to the next URL in the list and I am unable to find anything on the internet which helps me.
This is my code so far
import selenium
from selenium import webdriver
url_list = "C\\user\\python\\output.txt"
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
driver.get(url_list)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
I have no idea as to how I'd get selenium to go to the first url in the list then go onto the second one and so forth.
If anyone would be able to help me I'd be very grateful.
I think the problem is that you assumed the name of the file containing the url, is a url. You need to open the file first and build the url list.
According to the docs https://selenium.dev/documentation/en/webdriver/browser_manipulation/, get expect a url, not a file path.
import selenium
from selenium import webdriver
with open("C\\user\\python\\output.txt") as f:
url_list = f.read().split('\n')
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")

How to extract text from previously inserted input with Selenium?

For some reasons, all I want is to get from the input field what I have just written IN THE INPUT FIELD just to check.
from selenium import webdriver
import os
xpath_user = '//*[#id="login-username"]'
user = 'user#yahoo.com'
dir_path = os.path.dirname(os.path.realpath(__file__))
chromedriver = dir_path + "/chromedriver.exe"
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(3)
driver.get('https:\\www.yahoo.com')
driver.find_element_by_xpath(xpath_user).send_keys(user)
element = driver.find_element_by_xpath(xpath_user).text
print(element)
if element == 'user#yahoo.com':
print("Good")
In this example, the output is '', but I want the actual 'user#yahoo.com', but I don't know if it is even possible because 'user#yahoo.com' doesn't appear in the html form of the page. Maybe I am missing something or there is a work around. I'll be glad if someone could help me.
Note that my experience with python is limited.
Try driver.find_element_by_xpath(xpath_user).get_attribute("value")
The text property is for text within the tags of an element.

selenium stop after input

I am new to Python and Selenium coding, but I think I figured it out, tryed to build some exmaples for myself to learn from them, I got 2 questions,
First of all for some reason my code is stopping after my Input it does not going for the yalla() Function for some reason,
yallaurl = str(input('Your URL + ' + ""))
browser = webdriver.Chrome()
browser.get(yallaurl)
browser.maximize_window()
yalla()
Other then this the other Question is about browser.find_element_by_xpath so After I go to an html file and click Copy xpath I am getting something like this:
/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]
So how is the line of code is working? is this legit?
def yalla():
sleep(2)
count = len(browser.find_elements_by_class_name('flyingCart'))
email = browser.find_element_by_xpath('/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]')
for x in range(2, count):
itemdesc[x] = browser.find_element_by_xpath(
"/html/body/table[2]/tbody/tr/td/form/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[x]/td[2]/a[1]/text()")
priceper[x] = browser.find_element_by_xpath(
"/html/body/table[2]/tbody/tr/td/form/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[x]/td[5]/text()")
amount[x] = browser.find_element_by_xpath(
"/html/body/table[2]/tbody/tr/td/form/table[1]/tbody/tr[2]/td[2]/table/tbody/tr[x]/td[6]")
browser.navigate().to('https://www.greeninvoice.co.il/app/documents/new#type=100')
checklogininvoice()
Yes, your code will run just fine and is legit but not recommended.
As described, the absolute path works fine, but would break if the HTML was changed only slightly
Reference: https://selenium-python.readthedocs.io/locating-elements.html
Firstly, this code is confusing:
yallaurl = str(input('Your URL + ' + ""))
This is essentially equavilent to:
yallaurl = input('Your URL: ')
Yes, this code is correct:
browser.find_element_by_xpath('/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]')
Please refer to the docs for proper usage.
Here is the suggested use of this method:
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '/html/body/table[2]/tbody/tr/td/form/table[4]/tbody/tr[2]/td/table/tbody/tr[2]/td[2]')
This code will return an object of the element you have selected. To print the HTML of the element itself, this should work:
print(element.get_attribute('outerHTML'))
For further information on page objects, please refer to this page of the docs.
Since you have not provided the code for your 'yalla' function, it is hard to diagnose the problem there.

Categories