I'm writing a webscrabing program and need to bulk search on FedEx, to do this I normally concatenate all my tracking numbers with "\n" between them to stand as an equivalent as pasting text from an excel column
issue is when I enter my string into their search box, it enters the number concatenated as if without the delimiter, so the search box only sees 1 long tracking number rather than multiple (lie if pasted from excel) any idea on how I can get the string formatted, or sent to the search box correctly?
this is what it looks like when I paste 2 tracking numbers 12345 and abcdefg:
and here's what it should look like:
here is my code for sending the string to the search box:
def fedex_bulk(tn_list):
# you can mostly ignore until the end of this function, all this is setup for the driver #
# all relevant formatting is in the creating of variable search_str #
driver = start_uc()
loaded = False
size=3
tn_list = [tn_list[i:i+size] if i+size <= len(tn_list) else tn_list[i:len(tn_list)] for i in range(0,len(tn_list),size)]
tn_dict = []
for sublist in tn_list:
tries = 0
### concatenate all tracking numbers with "\n" delimiter
search_str = ''
for tn in sublist:
search_str+=tn+'\n'
### loop until numbers searched or tried 4 times
while not loaded:
try:
if tries==4:
break
tries+=1
### refresh until loaded
driver.get("https://www.fedex.com/en-us/tracking.html")
page_loaded = False
while not page_loaded:
try:
inputform = driver.find_element(By.XPATH, "//input[#class='form-input__element ng-pristine ng-invalid ng-touched']")
page_loaded = True
except:
driver.refresh()
sleep(5)
### search_str sent to search box, formatted incorrectly
inputform.send_keys(search_str)
sleep(1)
driver.find_element(By.XPATH, "//button[#type = 'submit']").click()
thankyou in advance!
I think the problem here is as following:
Inside the for sublist in tn_list: loop you do adding a '\n' to each tracking number tn in the sublist so search_str is containing a list of tracking numbers concatenated with '\n' between them.
But inside the while not page_loaded: you are locating the first input field and then you are sending to it all that long string containing multiple tracking numbers.
The search input element on the page is probably limited to accept valid inputs only, so it just ignores all the '\n' signs.
On the other hand, you are not inserting your tracking numbers to other search field inputs as you presenting on the picture showing how it should look.
So, in order to make your code work as you want you will probably need to insert single tracking number each time or to insert them to different search input fields.
Related
My python code looks like below. Basically, I am joining two part of url using urljoin module of urlib. The issue that I am facing is during the URL join my output looks like below. As shown below the input from a which is a list is getting displayed at start part of url and end has start information. My expected output is also mentioned below.
To summarize, I want user to input total number of terms and the entered terms should be passed into query part of URL (i.e. query[]=" "&query[]= " "). Not sure if I am missing something.
Thanks in advance for help!
Code
from urllib.parse import urljoin
num_terms=int(input("Enter total number of search terms:")) #Asking user for number of terms
a=input("Enter all search terms: ").split(",",num_terms) #User enters all the terms
start,end=input("Enter start and end date").split() #User enters start and end date
base_url="http://mytest.org"
join_url="/comments/data?"+"terms[]={}"+"&terms[]={}"*int(num_terms-1)+"&start={}&end={}".format(a,start,end)
url=urljoin(base_url,join_url) #Joining url
url
Output:
Enter total number of search terms:3
Enter all search terms: ty ou io
Enter start and end date2345 7890
"http://mytest.org/comments/data?terms[]={}&terms[]={}&terms[]={}start=['ty ou io']&end=2345"
Expected Output
"http://mytest.org/comments/data?terms[]=ty&terms[]=ou&terms[]=io&start=2345&end=7890"
One issue I spotted: the search terms don't have any (,) which you used to split the string.
# the base URL path
url_base = "http://mytest.org/comments/data?"
# you don't need a search term number here
# the split below will do the job
# ask for search item directly, must have at least one item
a = input("Enter all search terms (separate by ,): ").split(",")
while len(a) < 1:
a = input("Enter all search terms (separate by ,): ").split(",")
# ask for the start and end dates, no guarantee they are correct
# so use loop to force the user does the check for you
dates = input("Enter the start and end date (separate by ,): ").split(",")
while len(dates) != 2:
dates = input("Enter the start and end date (separate by ,): ").split(",")
# form the URL
url_search = "&".join([f"terms[]={x}" for x in a])
url_date = "start=" + dates[0] + "&end=" + dates[1]
# the final result
url_final = "&".join([url_base, url_search, url_date])
# print the result
print(url_final)
The output is like:
Enter all search terms (separate by ,): ty,ou,io
Enter the start and end date (separate by ,): 2000,2022
http://mytest.org/comments/data?&terms[]=ty&terms[]=ou&terms[]=io&start=2000&end=2022
As author mentioned in this comment he/she will use requests to make an API call, so constructing URL isn't necessary, you can just use functionality of module you're using. You can let requests build query string internally by passing dict with URL params to params argument (read Passing Parameters In URLs):
import requests
response = requests.get(
"http://mytest.org/comments/data",
{
"terms[]": ["ty", "ou", "io"],
"start": 2345,
"end": 7890
}
)
One problem is your code is only formatting the last bit of the url. That is,
"&start={}&end={}".format(a,start,end)
is the only part where the formatting applies; you need to add parentheses.
The other thing is that we need to unpack the list of terms, a, in the .format function:
join_url=("/comments/data?"+"terms[]={}"+"&terms[]={}"*int(num_terms-1)+"&start={}&end={}").format(*a,start,end)
But I'd recommend using f-strings instead of .format:
join_url=("/comments/data?"+'&'.join([f"terms[]={term}"for term in a])+f"&start={start}&end={end}")
(I also used str.join for the terms instead of string multiplication.)
A simple for loop should suffice:
terms = ""
for i in range(num_terms):
terms += f"terms[]={a[i]}&"
Basically, format takes a single value, it does not iterate over a list as you wanted. This is a simple way to achieve your goal. You could probably use list comprehension as well.
[f"terms[]={term}"for term in a]
Output:
Enter total number of search terms:3
Enter all search terms: au,io,ua
Enter start and end date233 444
http://mytest.org/comments/data?terms[]=au&terms[]=io&terms[]=ua&&start=233&end=444
I am trying to extract PDF page numbers if the page contains certain strings, and then append the selected page numbers to a list. For example, page 2, 254, 439 and 458 meet the criteria and I'm expecting the output as a list [2,254,439,458]. My code is:
object=PyPDF2.PdfFileReader(file_path)
NumPages = object.getNumPages()
String = 'specific string'
for i in range(0,NumPages):
PageObj=object.getPage(i)
Text = PageObj.extractText()
ReSearch = re.search(String,Text)
Pagelist=[]
if ReSearch != None:
Pagelist.append(i)
print(Pagelist)
I received output as:
[2]
[254]
[439]
[458]
Could someone please take a look and see how I can fix it? Thank you
Right now you are defining a new llst in every iteration, so you have to define the list only once, before the loop. Also print it outside the loop:
Pagelist=[]
for i in range(0,NumPages):
# rest of the loop
print(Pagelist)
I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.
I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.
How would I go about finding multiple instances of the word "time"?
This is my code:
import PyPDF2
def pdf_read():
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
if "Time" in pageContent or "time" in pageContent:
print(pageNumber)
Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?
A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:
import PyPDF2
import string
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
pageContent = ''.join(pageContent.splitlines()).split() # words to list
pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation
print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word
Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.
I am creating a bot that would automate my work and copy particular values from a particular website. Everything works fine but the last lines of my code that says w.text produces an outcome which is text and I need a number. Each element that I need the value of looks like this after inspection:
<span class="good">€25,217.65</span>
How do I get the value as a number instead of as a text? I tried w.value or w.get_attribute('value) but it doesn't work.
Here is my program (excluding downloads of libraries and files)
driver = webdriver.Chrome(driver_path)
driver.get('https://seabass-admin.igp.cloud/')
# waiting for login table to load
try:
element = WebDriverWait(driver,10).until(
ec.presence_of_element_located((By.XPATH,'//*[#id="email"]'))
)
except:
driver.quit()
#entering sensitive info
driver.find_element_by_id("email").send_keys(pwx.em) # login details
driver.find_element_by_id("password").send_keys(pwx.pw) # password
details
driver.find_element_by_xpath('//*[#id="appContainer"]/div/form/button').click() # click sign in
# waiting for page to load
try:
element = WebDriverWait(driver,10).until(
ec.presence_of_element_located((By.XPATH,'//*
[#id="testing"]/section/section[4]/div/table/tbody/tr[2]/td[3]/span'))
)
except:
driver.quit()
# getting info from the page
w = driver.find_element_by_xpath('//*
[#id="testing"]/section/section[4]/div/table/tbody/tr[2]/td[3]/span')
cell = outcome['import']
cell[withdrawal_cell].value = w.text
You could use some of Python's built in functions for that:
str.strip() to remove any leading or trailing '€' character, then
str.replace() to remove ',' (replace it with an empty string '')
Specifically:
str_w = w.text # this is the '€25,217.65' string
digits=str_w.strip('€').replace(',','') # use the functions above to get number-like string
cell[withdrawal_cell].value = float(digits) # convert to float number
As per the HTML you have shared:
<span class="good">€25,217.65</span>
The text €25,217.65 is the innerHTML.
So, you can extract the text €25,217.65 using either:
w.get_attribute("innerHTML")
text attribute.
Now to get the value €25,217.65 as a number instead of text you need to:
Remove the € and , character using re.sub():
import re
string = "€25,217.65"
my_string = re.sub('[€,]', '', string)
Finally, to convert the string to float you need to pass the string as an argument to the float() as follows:
my_number = float(my_string)
So the entire operation in a single line:
import re
string = "€25,217.65"
print(float(re.sub('[€,]', '', string)))
Effectively, your line of code can be either of the following:
Using text attribute:
cell[withdrawal_cell].value = float(re.sub('[€,]', '', w.text))
Using get_attribute("innerHTML"):
cell[withdrawal_cell].value = float(re.sub('[€,]', '', w.get_attribute("innerHTML")))
I have a .txt file that is currently formatted kind of like this:
John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
...
The first column will never have any missing values.
I'm trying to use python to convert this into a .csv file. I know how to do this if I have all of the column data for each row, but my .txt is missing some data in certain columns. How can I convert this to a .csv while making sure the same type of data remains in the same column? Thanks :)
Split by commas. You know the pattern should be word, word, int(I'm assuming), string in the pattern of www.word.word.
If there is only 1 word at the front instead of 2, add another comma after the first word.
If the number is missing, add a comma after the second word.
Etc...
Say you get a line "Susie,www.regexr.com" , you know that there is a missing word and missing number. Add 2 commas after the first word.
It's essentially a bunch of if statements or a switch-case statement.
There probably is a more elegant way of doing this, but my mind is fried from dealing with server and phone issues all morning.
This isn't tested in any way, I hope I didn't just embarrass myself:
import re
#read_line is a line read from the csv
split_line = read_line.split(',')
num_elements = len(split_line) #do this only once for efficiency
if (num_elements == 3): #Need to add an element somewhere, depending on what's missing
if(re.search('[^#]+#[^#]+\.[^#]+',split_line[2])): #Starting at the last element, if it is an email address
if(re.search('[\d]',split_line[1])): #If the previous element is a digit
#if so, add a comma as the only element missing is the string at split_line[1]
read_line = split_line[0]+','+','+split_line[1]+','+split_line[2]
else:
#if not so, add a comma at split_line[2]
read_line = split_line[0]+','+split_line[1]+','+','+split_line[2]
else:
#last element isn't email address, add a comma in its place
read_line = split_line[0]+','+split_line[1]+','+split_line[2]+','
elif (num_elements == 2) #need two elements, first one is assumed to always be there
if(re.search('[^#]+#[^#]+\.[^#]+',split_line[1])): #The second element is an email address
#Insert 2 commas in for missing string and number
read_line = split_line[0]+',,,'+split_line[1]
elif(re.search('[\d]',split_line[1])): #The second element contains digits
#Insert commas for missing string and email address
read_line = split_line[0]+',,'+split_line[1]+','
else:
#Insert commas for missing number and email address
read_line = split_line[0]+','+split_line[1]+',,'
elif (num_elements == 1):
read_line = split_line[0]+',,,'
I thought about your issue and I can only offer a half baked solution as your CSV file, when having missing data do not show it with something like ,,.
Your current csv file is like that
John,bread,17,www.google.com
Emily,apples,24,
Anita,35,www.website.com
Charles,banana,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
If you find a way to change your CSV file to like that
John,bread,17,www.google.com
Emily,apples,24,
Anita,,35,www.website.com
Charles,banana,,www.stackoverflow.com
Susie,french fries,31,www.regexr.com
You can use the solution like below. For info, I've put your input into a text file
In [1]: import pandas as pd
In [2]: population = pd.read_csv('input_to_csv.txt')
In [3]: mod_population=population.fillna("NaN")
In [4]: mod_population.to_csv('output_to_csv.csv',index=False)
One suggestion would be to do a regex check, if you can assume some kind of uniformity. For example, build a list of regex patterns, since each piece of data seems to be different.
If the second column you read in matches all characters and spaces, it's likely food. On the other hand, if it's a digit match, you should assume that food is missing. If it's a url match, you missed both. You'll want to be thorough with your test cases, but if the actual data is similar to your example you have 3 relatively unique cases, with a string, an integer, and a url. This should make writing regex tasks relatively trivial. Importing re and using re.search should help you test each regex without too much overhead.