Safe dublets with Selenium in .txt.File - python

So, my goal was to write a script, that scrapes users, that used a specific hashtag on Instagram and writes their accounts into a .txt-file and it mostly works!
My problem is, that even though some accounts posted plural pictures, my script does show each name only once. Any idea, how it might be able to kind of count them or get my script to not delete doublets?
I looked for everything but can't find a solution.
This is my part of writing code:
def generate_initial_information_txt(initial_information):
initial_information_txt = open("initial_information", "w+")
for user in initial_information:
initial_information_txt.write(user + "\n")
This is the part to find the name:
for user in range(30):
el = self.driver.find_element_by_xpath('/html/body/div[4]/div[2]/div/article/header/div[2]/div[1]/div[1]')
el = el.find_element_by_tag_name('a')
time.sleep(2)
profile = el.get_attribute('href')
open_recent_posts_set.add(profile)
time.sleep(2)
next_button = self.driver.find_element_by_xpath('/html/body/div[4]/div[1]/div/div/a[2]')
next_button.click()
time.sleep(2)
THE URL would be
https://instagram.com/explore/tags/hansaviertel_ms
So I'm starting to scrape the "Recent" Posts and e.g. the "Hansaforum" posted like 5 of the first 6. If I insert a range of 6 it just throws out a .txt-file with two accounts, not 5 times the "Hansaforum". And I'd like to get the amount of times in any kind of way. –
Thanks :)

Related

Google search company names and websites in Selenium pyhon

As a starting videographer I am trying to make a list of companies in a specific area.
So far I was able to get to the results with this code:
search_input = "hovenier Ridderkerk"
PATH = **location chromedriver**
driver = webdriver.Chrome(PATH)
driver.get("https://www.google.com")
cookie_consent = driver.find_element_by_xpath('//*[#id="L2AGLb"]').click()
time.sleep(0.5)
search = driver.find_element_by_xpath('/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input')
time.sleep(0.5)
search.send_keys(search_input)
time.sleep(0.5)
search.send_keys(Keys.RETURN)
time.sleep(0.5)
show_all = driver.find_element_by_xpath('//*[#id="rso"]/div[2]/div/div/div/div/div[6]/div/g-more-link/a/div').click()
time.sleep(0.5)
however after this I am a bit stuck, it appears that the names and websites of these companies are in a lot of classes and I cant figure out over which element I should do an iteration to get all the company names and websites in the list.
*edit: I added the manual steps that I would like it to do.
go to google.com
search for 'hoveniers Ridderkerk' (garden companies)
show all results
make a list with all the company names and website addresses*
Can anyone guide me in the correct direction? perhaps also explaining how they know they should use that specific element, so next time I can do it myself?
Thank you!

Selenium webdriver url changes automatically for unknown reason

Description:
I am trying to make a job ad parser which works on the indeed.com site (I am using python + selenium + chromedriver)
I am able to login with my facebook credentials and then, I am redirected to the default site which is hu.indeed.com (as I am living in Hungary).
I would like to search for jobs available in London, therefore get selenium driver to change to the uk.indeed.com site.
Then I get selenium to locate and input my job search criteria in the position input field and the locality as well in the locality field. Up untill now everything works smoothly.
The problem:
After pressing the search button I am able to see the results window, but after a very short time I am automatically redirected to the hu.indeed.com site. As you can see from my code below, I have no such commands, I have no clue whatsoever why and how this is happening. My print statements show that driver.current_url changes at a moment in time and I dont understand why is that happening and how could I prevent that.
Could you please let me know why does the url change and how could I prevent that?
Code:
driver.get("https://uk.indeed.com/")
time.sleep(1)
job_type_input=driver.find_element_by_xpath('//*[#id="text-input-what"]')
search_text=f"{jobs[0]} {extra_info}"
job_type_input.send_keys(search_text)
time.sleep(1)
print(f"1 print:{driver.current_url}") #<--- 1. print
job_location_input=driver.find_element_by_xpath('//*[#id="text-input-where"]')
job_location_input.send_keys(cities[0])
search_button=driver.find_element_by_xpath('//*[#id="jobsearch"]/button')
search_button.click()
time.sleep(5)
print(f"2 print:{driver.current_url}") #<--- 2. print
print(f"3 print:{driver.current_url}") #<--- 3. print
try:
moaic_element=driver.find_element_by_id("mosaic-provider-jobcards")
html=mosaic_element.get_attribute('innerHTML')
print("success")
except:
print("error in try")
print(f"4 print:{driver.current_url}") #<--- 4. print
Output:
1 print:https://uk.indeed.com/
2 print:https://hu.indeed.com/
3 print:https://hu.indeed.com/
error in try
4 print:https://hu.indeed.com/
I am the one who wrote the original post and found I found the solution to this problem. As Max Daroshchanka mentioned in his answer, the problem was claused by indeed.com as it reloaded due to some plugin (or something). Therefore my solution was to use the input field only after some time passed (using time.sleep(2))

String filtering in an if function not working in Python

I am writing a webscraper that scrapes data from a list of links one after the other. The problem is that the website uses the same class names for up to 3 different buttons at once with no other unique identifiers used which to my understanding makes it impossible to point to the exact button if there are more.
I used the driver.find.element which worked well since it just found the first result and basically ignored the other buttons. However, on some pages, the information the offers information that I am trying to scrape is missing which results in the script picking up wrong data and filling it in even though I am not interested in that data at all.
So I went out with a solution that checks whether the scraped information contains a specific string that only appears for that one piece of information that I am trying to get and if the string is not found the data variable should get overwritten with empty data so that it would be obvious that the information doesn't exist.
However, during the process the if statement that I am trying to filter the strings with doesn't seem to work at all. When there are no buttons on the webpage it indeed manages to fill in the variable with empty data. However, once a different button appears it's not filtered and gets through somehow and ruins the whole thing.
This is an example webpage which doesn't contain the data at all :
https://reality.idnes.cz/rk/detail/nido-group-s-r-o/5a85b108a26e3a2adb4e394c/?page=185
This is an example webpage that contains 2 buttons with data the first of which I am trying to scrape look for the "nemovitostí" text in the blue button that's what I am trying to filter.
https://reality.idnes.cz/rk/detail/m-m-reality-holding-a-s/5a85b582a26e3a321d4f2700/
This is the problematic code :
# Offers
offers = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text")))
offers = offers.text
print(offers)
# Check if scraped information contains offers else move on
if "nemovitostí" or "nemovitosti" or "nemovitost" in offers:
pass
else:
offers = ""
Since the if statement should supposedly look for the set of strings and otherwise if not found should execute any other code under the else statement I can't seem to understand how is it possible that the data gets in at all. There are no error codes or warning it just picks up the data instead of ignoring it even if the string is different.
This is more of the code for reference :
# Open links.csv file and read it's contents
with open('links.csv') as read:
reader = csv.reader(read)
link_list = list(reader)
# Information search
for link in link_list:
driver.get(', '.join(link))
# Title
title = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "h1.b-annot__title.mb-5")))
# Offers
offers = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text")))
offers = offers.text
print(offers)
# Check if scraped information contains offers else move on
if "nemovitostí" or "nemovitosti" or "nemovitost" in offers:
None
else:
offers = ""
# Address
address = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "p.font-sm")))
# Phone number
# Try to obtain phone number if nonexistent move on
try:
phone_number = wait.until(ec.presence_of_element_located((By.XPATH, "//a[./span[contains(#class, 'icon icon--phone')]]")))
phone_number = phone_number.text
except TimeoutException:
phone_number = ""
# Email
# Try to obtain email if nonexistent move on
try:
email = wait.until(ec.presence_of_element_located((By.XPATH, "//a[./span[contains(#class, 'icon icon--email')]]")))
email = email.text
except TimeoutException:
email = ""
# Print scraping results
print(title.text, " ", offers, " ", address.text, " ", phone_number, " ", email)
# Save results to a list
company = [title.text, offers, address.text, phone_number, email]
# Write results to scraped.xlsx file
worksheet.write_row(row, 0, company)
del title, offers, address, phone_number, email
# Push row number lower
row += 1
workbook.close()
driver.quit()
How is it possible that the data still gets through? Is there an error in my syntax? If you saw my mistake please let me know so I can get better next time! Thanks to anyone for any sort of help!
1. The problem is that the website uses the same class names for up to 3 different buttons at once with no other unique identifiers used which to my understanding makes it impossible to point to the exact button if there are more
You can actually get the element you need if you use By.XPATH instead By.CSS_SELECTOR.
First would be (//span[#class='btn__text'])[1], second (//span[#class='btn__text'])[2] and third (//span[#class='btn__text'])[3]
Or if you are not sure what the order would be, you can be more specific like
(//span[#class='btn__text' and contains(text(),'nemovitostí')])
2. Second problem is related to if syntax in python
It should be like this
if "nemovitostí" in offers or "nemovitosti" in offers or "nemovitost" in offers:
There might be a nicer way to write this, maybe something like this:
for i in ["nemovitostí" , "nemovitosti" , "nemovitost"]:
if i in offers:
The most ideal way to write this would be the following
value=["nemovitostí","nemovitosti","nemovitost"]
if any(s in offers for s in value):
#dosomethinghere
else:
offers = ""

How to scrape paginated table

I am trying to scrape all the data from the table on this website (https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s) but can't seem to figure out how I would go about scraping all of the subsequent pages. This is the code to scrape the first page of results into a CSV file:
import csv
import requests
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
fileList = []
# For the table-header cells
tableHeader = soup.find('tr', attrs={'class': 'table-header'})
rowList = []
for cell in tableHeader.findAll('th'):
cellText = cell.text.replace(' ', '').replace('\n', '')
rowList.append(cellText)
fileList.append(rowList)
# For the table body cells
table = soup.find('tbody', attrs={'class': 'stripe'})
for row in table.findAll('tr'):
rowList = []
for cell in row.findAll('td'):
cellText = cell.text.replace(' ', '').replace('\n', '')
if cellText == "Details":
continue
rowList.append(cellText)
fileList.append(rowList)
outfile = open("./prison-inmates.csv", "w")
writer = csv.writer(outfile)
writer.writerows(fileList)
How do I get to the next page of results?
Code taken from this tutorial (http://first-web-scraper.readthedocs.io/en/latest/)
Although I wasn't able to get your posted code to run, I did find that the original tutorial code you linked to, can be changed on the url = line to:
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' \
+ '?max_rows=250'
Running python scrape.py then successfully outputs inmates.csv with all available records.
In short, this works by:
instead of: How do I get to the next page ?
we pursue: How do I remove pagination ?
we cause the page to send all records at once so there would be no pagination to deal with in the first place
This allows us to use the original tutorial code to save the complete set of records.
Explanation
url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s' to use the new URL. The old URL in the tutorial: http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp redirects to this new URL, but doesn't work with our solution so we can't use the old URL
\ is a line break allowing me to continue the line of code on the next line, for readability
+ is to concatenate so we can add the ?max_rows=250.
So the result is equivalent to url = 'https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250'
?max_rows=<number-of-records-to-display> is a query string I found that works for this particular Current Detainees page. This can be found by first noticing the Page Size text entry field meant for users to set a custom rows per page. It shows a default value 50. Examine its HTML code, for example in Firefox browser (52.7.3), use Ctrl+shift+i to show the Firefox's Web Developer Inspector tool window. Click the Select element button (icon resembles a box outline with a mouse cursor arrow on it), then click on the input field containing 50. HTML pane below reveals via highlight: <input class="mrcinput" name="max_rows" size="3" title="max_rowsp" value="50" type="text">. This means it submits a form variable named max_rows, which is a number, default 50. Some web pages, depending on how it is coded, can recognize such variables if appended to the URL as a query string, so it is possible to try this by appending ?max_rows= plus a number of your choice. At the time I started the page said 250 Total Items , so I chose to try the custom number 250 by changing my browser address bar to load: https://report.boonecountymo.org/mrcjava/servlet/SH01_MP.I00290s?max_rows=250. It successfully displayed 250 records, making it unnecessary to paginate, so this ?max_rows=250 is what we use to form the URL used by our script
Now however the page now says 242 Total Items, so it seems they are removing inmates, or at least inmate records listed. You can: ?max_rows=242, but ?max_rows=250 will still work because 250 is larger than the total number of records 242, and as long as it is larger the page will not need to paginate, and thus allow you to have all the records on one page.
Warranty
This isn't a universal solution for scraping table data when encountering pagination. It works for this Current Detainees page and pages that may have been coded in the same way
This is because pagination isn't universally implemented, so any code or solution would depend on how the page implements pagination. Here we use ?max_rows=.... However another website, even if they have adjustable per-page limits, may use a different name for this max_rows variable, or ignore query strings altogether and so our solution may not work on a different website
Scalability issues: If you are in a situation with a different website where you need millions of records for example, a download-all-at-once approach like this can run into perhaps memory limits both on the server side and also on your computer, both could time out and fail to finish delivering or processing. A different approach, resembling something like pagination that you had originally asked for, would definitely be more suitable
So in the future if you need to download large amounts of records, this download-all-at-once approach will likely run you into memory-related trouble, but for scraping this particular Current Detainees page, it will get the job done.

how to make python request.get wait a few seconds?

I wanted to do get some experience with html crawling, so I wanted to see if I could grab some values of the following site: http://www.iex.nl/Aandeel-Koers/11890/Royal-Imtech/koers.aspx
This site shows the price of imtech shares.
If you take a look at the site, you see there is 1 number shown in bold, this is the price of the share.
As you may have seen, this price changes, and that's okay. I only want the value at the time I run my script at this point in time.
but if you reload the page, you may notice how it first shows "laatste koers" and after a delay of 1 second it shows "realtime"
As you may have figured out by now, I'm interested in the "realtime" value.
Here is my question, how do I get this value, I've tried time.sleep(2) on different places. I've tried a timeout at the request. Both didn't work.
How can I fix this?
from lxml import html
import requests
pagina = 'http://www.iex.nl/Aandeel-Koers/11890/Royal-Imtech/koers.aspx'
page = requests.get(pagina)
tree = html.fromstring(page.text)
koers = tree.xpath('//span[#class="RealtimeLabel"]/text()')
prices = tree.xpath('//span[#id="ctl00_ctl00_Content_LeftContent_PriceDetails_lblLastPrice"]/text()')
print koers[0], pagina.split("/")[5], prices[0]
I get output like this
Laatste koers Royal-Imtech 0,093
While I want output like this
Realtime Royal-Imtech 0,093
I would suggest use a wait until the element changes.
Find the block of code below to help you.
def wait_while(condition, timeout, delta=1):
"""
#condition: lambda function which checks if the text contains "REALTIME"
#timeout: Max waiting time
#delta: time after which another check has to be made
"""
max_time = time.time() + timeout
while max_time > time.time():
if condition():
return True
time.sleep(delta)
return False

Categories