Related
I am trying to web-scrape Airbnb with selenium. However it's been a HUGE impossible mission.
First, I create a drive, where the argument "executable_path" is where my chromedriver is installed.
driver = webdriver.Chrome(executable_path=r'C:\directory\directory\directory\chromedriver.exe')
Secondly, I do the other stuffs:
driver.get('https://www.airbnb.com.br/')
a = driver.find_element(By.CLASS_NAME, "cqtsvk7 dir dir-ltr")
a.click()
a.send_keys('Poland')
Here I received the error: NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found
Moreover, when I create the variables to store the html elements, it doesn't work as well:
title = driver.find_elements(By.CLASS_NAME, 'a-size-base-plus a-color-base a-text-normal')
place = driver.find_elements(By.ID, 'title_49247685')
city = driver.find_elements(By.CLASS_NAME, 'f15liw5s s1cjsi4j dir dir-ltr')
price = driver.find_elements(By.CLASS_NAME, 'p11pu8yw dir dir-ltr')
Please, someone could help me? How can I get the place, city and price of all of my query of place to travel in airbnb? (I know how to store all in a pandas df, my problem is the use of selenium. Those "get_elements" seem not to work properly in airbnb.
I received the error: NoSuchWindowException: Message: no such window: target window already closed from unknown error: web view not found
Which line is raising this error? I don't see anything in your snippets that could be causing it, but is there anything in your code [before the included snippet], or some external factor that could be causing the automated window to get closed? You could see if any of the answers to this helps you with the issue, especially if you're using .switch_to.window anywhere in your code.
Searching
(You should include screenshots or better descriptions of the fields you are targeting, especially when the issue is that you're having difficulty targeting them.)
Secondly, I do the other stuffs:
driver.get('https://www.airbnb.com.br/')
a = driver.find_element(By.CLASS_NAME, "cqtsvk7 dir dir-ltr")
want that selenium search for me the country where I want to extract the data (Poland, in this case)
If you mean that you're trying to enter "Poland" into this input field, then the class cqtsvk7 in cqtsvk7 dir dir-ltr appears to change. The id attribute might be more reliable; but also, it seems like you need to click on the search area to make the input interactable; and after entering "Poland" you also have to click on the search icon and wait to load the results.
# from selenium.webdriver.support.ui import WebDriverWait
def search_airbnb(search_for, browsr, wait_timeout=5):
wait_til = WebDriverWait(browsr, wait_timeout).until
browsr.get('https://www.airbnb.com.br/')
wait_til(EC.element_to_be_clickable(
(By.CSS_SELECTOR, 'div[data-testid="little-search"]')))
search_area = browsr.find_element(
By.CSS_SELECTOR, 'div[data-testid="little-search"]')
search_area.click()
print('CLICKED search_area')
wait_til(EC.visibility_of_all_elements_located(
(By.ID, "bigsearch-query-location-input")))
a = browsr.find_element(By.ID, "bigsearch-query-location-input")
a.send_keys(search_for)
print(f'ENTERED "{search_for}"')
wait_til(EC.element_to_be_clickable((By.CSS_SELECTOR,
'button[data-testid="structured-search-input-search-button"]')))
search_btn = browsr.find_element(By.CSS_SELECTOR,
'button[data-testid="structured-search-input-search-button"]')
search_btn.click()
print('CLICKED search_btn')
searchFor = 'Poland'
search_airbnb(searchFor, driver) # , 15) # adjust wait_timeout if necessary
Notice that for the clicked elements, I used By.CSS_SELECTOR; if unfamiliar with CSS selectors, you can consult this reference. You can also use By.XPATH in these cases; this XPath cheatsheet might help then.
Scraping Results
How can I get the place, city and price of all of my query of place to travel in airbnb?
Again, you can use CSS selectors [or XPaths] as they're quite versatile. If you use a function like
def select_get(elem, sel='', tAttr='innerText', defaultVal=None, isv=False):
try:
el = elem.find_element(By.CSS_SELECTOR, sel) if sel else elem
rVal = el.get_attribute(tAttr)
if isinstance(rVal, str): rVal = rVal.strip()
return defaultVal if rVal is None else rVal
except Exception as e:
if isv: print(f'failed to get "{tAttr}" from "{sel}"\n', type(e), e)
return defaultVal
then even if a certain element or attribute is missing in any of the cards, it'll just fill in with defaultVal and all the other cards will still be scraped instead of raising an error and crashing the whole program.
You can get a list of dictionaries in listings by looping through the result cards with list comprehension like
listings = [{
'name': select_get(el, 'meta[itemprop="name"]', 'content'), # SAME TEXT AS
# 'title_sub': select_get(el, 'div[id^="title_"]+div+div>span'),
'city_title': select_get(el, 'div[id^="title_"]'),
'beds': select_get(el, 'div[id^="title_"]+div+div+div>span'),
'dates': select_get(el, 'div[id^="title_"]+div+div+div+div>span'),
'price': select_get(el, 'div[id^="title_"]+div+div+div+div+div div+span'),
'rating': select_get(el, 'div[id^="title_"]~span[aria-label]', 'aria-label')
# 'url': select_get(el, 'meta[itemprop="url"]', 'content', defaultVal='').split('?')[0],
} for el in driver.find_elements(
By.CSS_SELECTOR, 'div[itemprop="itemListElement"]' ## RESULT CARD SELECTOR
)]
Dealing with Pagination
If you wanted to scrape from multiple pages, you can loop through them. [You can also use while True (instead of a for loop as below) for unlimited pages, but I feel like it's safer like this, even if you set an absurdly high limit like maxPages=5000 or something; either way, it should break out of the loop once it rreaches the last page.]
maxPages = 50 # adjust as preferred
wait = WebDriverWait(browsr, 3) # adjust timeout as necessary
listings, addedIds = [], []
isFirstPage = True
for pgi in range(maxPages):
prevLen = len(listings) # just for printing progress
## wait to load all the cards ##
try:
wait.until(EC.visibility_of_all_elements_located(
(By.CSS_SELECTOR, 'div[itemprop="itemListElement"]')))
except Exception as e:
print(f'[{pgi}] Failed to load listings', type(e), e)
continue # losing one loop for additional wait time
## check current page number according to driver ##
try:
pgNum = driver.find_element(
By.CSS_SELECTOR, 'button[aria-current="page"]'
).text.strip() if not isFirstPage else '1'
except Exception as e:
print('Failed to find pgNum', type(e), e)
pgNum = f'?{pgi+1}?'
## collect listings ##
pgListings = [{
'listing_id': select_get(
el, 'div[role="group"]>a[target^="listing_"]', 'target',
defaultVal='').replace('listing_', '', 1).strip(),
# 'position': 'pg_' + str(pgNum) + '-pos_' + select_get(
# el, 'meta[itemprop="position"]', 'content', defaultVal=''),
'name': select_get(el, 'meta[itemprop="name"]', 'content'),
#####################################################
### INCLUDE ALL THE key-value pairs THAT YOU WANT ###
#####################################################
} for el in driver.find_elements(
By.CSS_SELECTOR, 'div[itemprop="itemListElement"]'
)]
## [ only checks for duplicates against listings frm previous pages ] ##
listings += [pgl for pgl in pgListings if pgl['listing_id'] not in addedIds]
addedIds += [l['listing_id'] for l in pgListings]
### [OR] check for duplicates within the same page as well ###
## for pgl in pgListings:
## if pgl['listing_id'] not in addedIds:
## listings.append(pgl)
## addedIds.append(addedIds)
print(f'[{pgi}] extracted', len(listings)-prevLen,
f'listings [of {len(pgListings)} total] from page', pgNum)
## got to next page ##
nxtPg = driver.find_elements(By.CSS_SELECTOR, 'a[aria-label="Próximo"]')
if not nxtPg:
print(f'No more next page [{len(listings)} listings so far]\n')
break ### [OR] START AGAIN FROM page1 WITH:
## try: _, isFirstPage = search_airbnb(searchFor, driver), True
## except Exception as e: print('Failed to search again', type(e), e)
## continue
### bc airbnb doesn't show all results even across all pages
### so you can get a few more every re-scrape [but not many - less than 5 per page]
try: _, isFirstPage = nxtPg[0].click(), False
except Exception as e: print('Failed to click next', type(e), e)
dMsg = f'[reduced from {len(addedIds)} after removing duplicates]'
print('extracted', len(listings), 'listings with', dMsg)
[listing_id seems to be the easiest way to ensure that only unique listings are collected. You can also form a link to that listing like f'https://www.airbnb.com.br/rooms/{listing_id}'.]
Combining with Old Data [Load & Save]
If you want to save to CSV and also load previous from the same file with old and new data combined without duplicates, you can do some thing like
# import pandas as pd
# import os
fileName = 'pol_airbnb.csv'
maxPages = 50
try:
listings = pd.read_csv(fileName).to_dict('records')
addedIds = [str(l['listing_id']).strip() for l in listings]
print(f'loaded {len(listings)} previously extracted listings')
except Exception as e:
print('failed to load previous data', type(e), e)
listings, addedIds = [], []
#################################################
# for pgi... ## LOOP THROUGH PAGES AS ABOVE #####
#################################################
dMsg = f'[reduced from {len(addedIds)} after removing duplicates]'
print('extracted', len(listings), 'listings with', dMsg)
pd.DataFrame(listings).set_index('listing_id').to_csv(fileName)
print('saved to', os.path.abspath(fileName))
Note that keeping the old data might mean that some the listings are no longer available.
View pol_airbnb.csv for my results with maxPages=999 and searching again instead of break-ing in if not nxtPg.....
I am trying to make a selenium python script to collect data from each job in an indeed job search. I can easily get the data from the first and second page. The problem I am running into is while looping through the pages, the script only clicks the next page and the previous page, in that order. Going from page 1 -> 2 -> 1 -> 2 -> ect. I know it is doing this because both the next and previous button have the same class name. So when I redeclare the webelement variable when the page uploads, it hits the previous button because that is the first location of the class in the stack. I tried making it always click the next button by using the xpath, but I still run into the same errors. I would inspect the next button element, and copy the full xpath. my code is below, I am using python 3.7.9 and pip version 21.2.4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
HTTPS = "https://"
# hard coded data to test
siteDomain = "indeed.com"
jobSearch = "Software Developer"
locationSearch = "Richmond, VA"
listOfJobs = []
def if_exists_by_id(id):
try:
driver.find_element_by_id(id)
except NoSuchElementException:
return False
return True
def if_exists_by_class_name(class_name):
try:
driver.find_element_by_class_name(class_name)
except NoSuchElementException:
return False
return True
def if_exists_by_xpath(xpath):
try:
driver.find_element_by_xpath(xpath)
except NoSuchElementException:
return False
return True
def removeSpaces(strArray):
newjobCounter = 0
jobCounter = 0
for i, word in enumerate(strArray):
jobCounter += 1
if strArray[i].__contains__("\n"):
strArray[i] = strArray[i].replace("\n", " ")
if strArray[i].__contains__("new"):
newjobCounter += 1
print(strArray[i] + "\n")
if newjobCounter == 0:
print("Unfortunately, there are no new jobs for this search")
else:
print("With " + str(newjobCounter) + " out of " + str(jobCounter) + " new jobs!")
return strArray
try:
# Goes to Site
driver.get(HTTPS + siteDomain)
# obtains access to elements from website
searchJob = driver.find_element_by_name("q")
searchLocation = driver.find_element_by_name("l")
# clear text field
searchJob.send_keys(Keys.CONTROL, "a", Keys.BACK_SPACE)
searchLocation.send_keys(Keys.CONTROL, "a", Keys.BACK_SPACE)
# inputs values into website elements
searchJob.send_keys(jobSearch)
searchLocation.send_keys(locationSearch)
# presses button to search
searchLocation.send_keys(Keys.RETURN)
# Begin looping through pages
pageList = driver.find_element_by_class_name("pagination")
page = pageList.find_elements_by_tag_name("li")
numPages = 0
for i,x in enumerate(page):
time.sleep(1)
# checks for popup, if there is popup, exit out and sleep
if if_exists_by_id("popover-x"):
driver.find_element_by_id("popover-x").click()
time.sleep(1)
# increment page counter variabke
numPages += 1
# obtains data in class name value
jobCards = driver.find_elements_by_class_name("jobCard_mainContent")
# prints number of jobs returned
print(str(len(jobCards)) + " jobs in: " + locationSearch)
# inserts each job into list of jobs array
# commented out to make debugging easier
# for jobCard in jobCards:
# listOfJobs.append(jobCard.text)
# supposed to click the next page, but keeps alternating
# between next page and previous page
driver.find_element_by_class_name("np").click()
print("On page number: " + str(numPages))
# print(removeSpaces(listOfJobs))
except ValueError:
print(ValueError)
finally:
driver.quit()
Any help will be greatly appreciated, also if I am implementing bad coding practices in the structure of the script please let me know as I am trying to learn as much as possible! :)
I have tested your code.. the thing is there are 2 'np' class elements when we go to the 2nd page.. what you can do is for first time use find_element_by_class_name('np') and for all the other time use find_elements_by_class_name('np')[1] that will select the next button.. and you can use find_elements_by_class_name('np')[0] for the previous button if needed. Here is the code!
if i == 0:
driver.find_element_by_class_name("np").click()
else:
driver.find_elements_by_class_name("np")[1].click()
Just replace the line driver.find_element_by_class_name("np").click() with the code snippet above.. I have tested it and it worked like a charm.
Also i am not as experienced as the other devs here.. But i am glad if i could help you. (This is my first answer ever on stackoverflow)
Working on the python Bot with selenium, and infinite scrolling in dialog box isn't working due to "None" return from the "arguments[0].scrollHeight"
dialogBx=driver.find_element_by_xpath("//div[#role='dialog']/div[2]")
print(dialogBx) #<selenium.webdriver.remote.webelement.WebElement (session="fcec89cc11fa5fa5eaf29a8efa9989f9", element="31bfd470-de78-XXXX-XXXX-ac1ffa6224c4")>
print(type(dialogBx)) #<class 'selenium.webdriver.remote.webelement.WebElement'>
sleep(5)
last_height=driver.execute_script("arguments[0].scrollHeight",dialogBx);
print("Height : ",last_height) #None
I needed last height to compare, please suggest solution.
Ok, to answer your question, since you are inside a dialog we should focus on it. When you execute : last_height=driver.execute_script("arguments[0].scrollHeight",dialogBx); I believe you are executing that in the main page or in a wrong div (not 100% sure). Either way I took a diferente approach, we are going to select the last <li> item currently available in the dialog and scroll down to its position, this will force the dialog to update. I will extract a code from the full code you will see below:
last_li_item = driver.find_element_by_xpath('/html/body/div[4]/div/div[2]/ul/div/li[{p}]'.format(p=start_pos))
last_li_item.location_once_scrolled_into_view
We first select the last list item and then the property location_once_scrolled_into_view. This property will scroll our dialog down to our last item and then it will load more items. start_pos is just the position in the list of <li> element we have available. ie.: <div><li></li><li></li><li></li></div> start_pos=2 which is the last li item starting from 0. I put this variable name because it is inside a for loop which is watching the changes of li items inside the div, you will get it once you see the full code.
In other hand to execute this,simply change the parameters at the top and execute the test function test(). If you are already log in to instagram you can just run get_list_of_followers().
Note: Using this function use a Follower class that is also in this code. You can remove if you wish but you will need to modify the function.
IMPORTANT:
When you execute this program, the dialog box items will be increasing until there is no more items to load, so a TODO would be remove the element you have already processed otherwise I believe performace will get slower when you start hitting big numbers!
Let me know if you need any other explanation. Now the code:
import time
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
# instagram url as our base
base_url = "https://www.instagram.com"
# =====================MODIFY THESE TO YOUR NEED=========
# the user we wish to get the followers from
base_user = "/nasa/"
# how much do you wish to sleep to wait for loading (seconds)
sleep_time = 3
# True will attempt login with facebook, False with instagram
login_with_facebook = True
# Credentials here
username = "YOUR_USERNAME"
password = "YOUR_PASSWORD"
# How many users do you wish to retrieve? -1 = all or n>0
get_users = 10
#==========================================================
# This is the div that contains all the followers info not the dialog box itself
dialog_box_xpath = '/html/body/div[4]/div/div[2]/ul/div'
total_followers_xpath = '/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span'
followers_button_xpath = '/html/body/div[1]/section/main/div/header/section/ul/li[2]/a'
insta_username_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[2]/div/label/input'
insta_pwd_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[3]/div/label/input'
insta_login_button_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[4]/button'
insta_fb_login_button_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[6]/button'
fb_username_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[1]/input'
fb_pwd_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[2]/input'
fb_login_button_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[3]/button'
u_path = fb_username_xpath if login_with_facebook else insta_username_xpath
p_path = fb_pwd_xpath if login_with_facebook else insta_pwd_xpath
lb_path = fb_login_button_xpath if login_with_facebook else insta_login_button_xpath
# Simple class of a follower, you dont actually need this but for explanation is ok.
class Follower:
def __init__(self, user_name, href):
self.username = user_name
self.href = href
#property
def get_username(self):
return self.username
#property
def get_href(self):
return self.href
def __repr__(self):
return self.username
def test():
base_user_path = base_url + base_user
driver = webdriver.Chrome()
driver.get(base_user_path)
# click the followers button and will ask for login
driver.find_element_by_xpath(followers_button_xpath).click()
time.sleep(sleep_time)
# now we decide if we will login with facebook or instagram
if login_with_facebook:
driver.find_element_by_xpath(insta_fb_login_button_xpath).click()
time.sleep(sleep_time)
username_input = driver.find_element_by_xpath(u_path)
username_input.send_keys(username)
password_input = driver.find_element_by_xpath(p_path)
password_input.send_keys(password)
driver.find_element_by_xpath(lb_path).click()
# We need to wait a little longer for the page to load so. Feel free to change this to your needs.
time.sleep(10)
# click the followers button again
driver.find_element_by_xpath(followers_button_xpath).click()
time.sleep(sleep_time)
# now we get the list of followers from the dialog box. This function will return a list of follower objects.
followers: list[Follower] = get_list_of_followers(driver, dialog_box_xpath, get_users)
# close the driver we do not need it anymore.
driver.close()
for follower in followers:
print(follower, follower.get_href)
def get_list_of_followers(driver, d_xpath=dialog_box_xpath, get_items=10):
"""
Get a list of followers from instagram
:param driver: driver instance
:param d_xpath: dialog box xpath. By default it gets the global parameter but you can change it
:param get_items: how many items do you wish to obtain? -1 = Try to get all of them. Any positive number will be
= the number of followers to obtain
:return: list of follower objects
"""
# getting the dialog content element
dialog_box: WebElement = driver.find_element_by_xpath(d_xpath)
# getting all the list items (<li></li>) inside the dialog box.
dialog_content: list[WebElement] = dialog_box.find_elements_by_tag_name("li")
# Get the total number of followers. since we get a string we need to convert to int by int(<str>)
total_followers = int(driver.find_element_by_xpath('/html/body/div[1]/section/main/div/header/section/ul/li['
'2]/a/span').get_attribute("title").replace(".",""))
# how many items we have without scrolling down?
li_items = len(dialog_content)
# We are trying to get n elements (n=get_items variable). Now we need to check if there are enough followers to
# retrieve from if not we will get the max quantity of following. This applies only if n is >=0. If -1 then the
# total amount of followers is n
if get_items == -1:
get_items = total_followers
elif -1 < get_items <= total_followers:
# no need to change anything, git is ok to work with get_items
pass
else:
# if it not -1 and not between 0 and total followers then we raise an error
raise IndexError
# You can start from greater than 0 but that will give you a shorter list of followers than what you wish if
# there is not enough followers available. i.e: total_followers = 10, get_items=10, start_from=1. This will only
# return 9 followers not 10 even if get_items is 10.
return generate_followers(0, get_items, total_followers, dialog_box, driver)
def generate_followers(start_pos, get_items, total_followers, dialog_box_element: WebElement, driver):
"""
Generate followers based on the parameters
:param start_pos: index of where to start getting the followers from
:param get_items: total items to get
:param total_followers = total number of followers
:param dialog_box_element: dialog box to get the list items count
:param driver: driver object
:return: followers list
"""
if -1 < start_pos < total_followers:
# we want to count items from our current position until the last element available without scrolling. We do
# it this way so when we scroll down, the list items will be greater but we will start generating followers
# from our last current position not from the beginning!
first = dialog_box_element.find_element_by_xpath("./li[{pos}]".format(pos=start_pos+1))
li_items = dialog_box_element.find_elements_by_xpath("./li[position()={pos}][last("
")]/following-sibling::li"
.format(pos=(start_pos + 1)))
li_items.insert(0, first)
print("Generating followers from position position: {pos} with {li_count} list items"
.format(pos=(start_pos+1), li_count=len(li_items)))
followers = []
for i in range(len(li_items)):
anchors = li_items[i].find_elements_by_tag_name("a")
anchor = anchors[0] if len(anchors) ==1 else anchors[1]
follower = Follower(anchor.text, anchor.get_attribute(
"href"))
followers.append(follower)
get_items -= 1
start_pos += 1
print("Follower {f} added to the list".format(f=follower))
# we break the loop if our starting position is greater than 0 or if get_items has reached 0 (means if we
# request 10 items we got them all no need to continue
if start_pos >= total_followers or get_items == 0:
print("finished")
return followers
print("finished loop, executing scroll down...")
last_li_item = driver.find_element_by_xpath('/html/body/div[4]/div/div[2]/ul/div/li[{p}]'.format(p=start_pos))
last_li_item.location_once_scrolled_into_view
time.sleep(sleep_time)
followers.extend(generate_followers(start_pos, get_items, total_followers, dialog_box_element, driver))
return followers
else:
raise IndexError
is expected the below is supposed to run without issues.
Solution to Reddit data:
import requests
import re
import praw
from datetime import date
import csv
import pandas as pd
import time
import sys
class Crawler(object):
'''
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
'''
def __init__(self, subreddit="apple"):
'''
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
'''
self.basic_url = "https://www.reddit.com"
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
self.REX = re.compile(r"<div class=\" thing id-t3_[\w]+")
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(client_id="your_id", client_secret="your_secret", user_agent="subreddit_comments_crawler")
def get_submission_ids(self, pages=2):
'''
Collect all ids of submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
'''
# This is page url.
url = self.basic_url + "/r/" + self.subreddit
if pages <= 0:
return []
text = requests.get(url, headers=self.headers).text
ids = self.REX.findall(text)
ids = list(map(lambda x: x[-6:], ids))
if pages == 1:
self.submission_ids = ids
return ids
count = 0
after = ids[-1]
for i in range(1, pages):
count += 25
temp_url = self.basic_url + "/r/" + self.subreddit + "?count=" + str(count) + "&after=t3_" + ids[-1]
text = requests.get(temp_url, headers=self.headers).text
temp_list = self.REX.findall(text)
temp_list = list(map(lambda x: x[-6:], temp_list))
ids += temp_list
if count % 100 == 0:
time.sleep(60)
self.submission_ids = ids
return ids
def get_comments(self, submission):
'''
Submission is an object created using praw API.
'''
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append((each.id, each.link_id[3:], each.author.name, date.fromtimestamp(each.created_utc).isoformat(), each.score, each.body) )
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
'''
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
'''
print("Start to collect all submission ids...")
self.get_submission_ids(pages)
print("Start to collect comments...This may cost a long time depending on # of pages.")
submission_url = self.basic_url + "/r/" + self.subreddit + "/comments/"
comments = []
submissions = []
count = 0
for idx in self.submission_ids:
temp_url = submission_url + idx
submission = self.reddit.submission(url=temp_url)
submissions.append((submission.name[3:], submission.num_comments, submission.score, submission.subreddit_name_prefixed, date.fromtimestamp(submission.created_utc).isoformat(), submission.title, submission.selftext))
temp_comments = self.get_comments(submission)
comments += temp_comments
count += 1
print(str(count) + " submissions have got...")
if count % 50 == 0:
time.sleep(60)
comments_fieldnames = ["comment_id", "submission_id", "author_name", "post_time", "comment_score", "text"]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = ["submission_id", "num_of_comments", "submission_score", "submission_subreddit", "post_date", "submission_title", "text"]
df_submission = pd.DataFrame(submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
return df_comments
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
but I got:
(base) UserAir:scrape_reddit user$ python reddit_crawler.py apple 2
Start to collect all submission ids...
Traceback (most recent call last):
File "reddit_crawler.py", line 127, in
c.save_comments_submissions(int(pages))
File "reddit_crawler.py", line 94, in save_comments_submissions
self.get_submission_ids(pages)
File "reddit_crawler.py", line 54, in get_submission_ids
after = ids[-1]
IndexError: list index out of range
Erik's answer diagnoses the specific cause of this error, but more broadly I think this is caused by you not using PRAW to its fullest potential. Your script imports requests and performs a lot of manual requests that PRAW has methods for already. The whole point of PRAW is to prevent you from having to write these requests that do things such as paginate a listing, so I recommend you take advantage of that.
As an example, your get_submission_ids function (which scrapes the web version of Reddit and handles paginating) could be replaced by just
def get_submission_ids(self, pages=2):
return [
submission.id
for submission in self.reddit.subreddit(self.subreddit).hot(
limit=25 * pages
)
]
because the .hot() function does everything you tried to do by hand.
I'm going to go one step further here and have the function just return a list of Submission objects, because the rest of your code ends up doing things that would better by done by interacting with the PRAW Submission object. Here's that code (I renamed the function to reflect its updated purpose):
def get_submissions(self, pages=2):
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
(I've updated this function to just return its result, as your version both returns the value and sets it as self.submission_ids, unless pages is 0. That felt quite inconsistent, so I made it just return the value.)
Your get_comments function looks good.
The save_comments_submissions function, like get_submission_ids, does a lot of manual work that PRAW can handle. You construct a temp_url that has the full URL of a post, and then use that to make a PRAW Submission object, but we can replace that with directly using the one returned by get_submissions. You also have some calls to time.sleep() which I removed because PRAW will automatically sleep the appropriate amount for you. Lastly, I removed the return value of this function because the point of the function is to save data to disk, not to return it to anywhere else, and the rest of your script doesn't use the return value. Here's the updated version of that function:
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
Here's an updated version of the whole script that uses PRAW fully:
from datetime import date
import sys
import pandas as pd
import praw
class Crawler:
"""
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
"""
def __init__(self, subreddit="apple"):
"""
Initialize a Crawler object.
subreddit is the topic you want to parse. default is r"apple"
basic_url is the reddit site.
headers is for requests.get method
REX is to find submission ids.
submission_ids save all the ids of submission you will parse.
reddit is an object created using praw API. Please check it before you use.
"""
self.subreddit = subreddit
self.submission_ids = []
self.reddit = praw.Reddit(
client_id="your_id",
client_secret="your_secret",
user_agent="subreddit_comments_crawler",
)
def get_submissions(self, pages=2):
"""
Collect all submissions..
One page has 25 submissions.
page url: https://www.reddit.com/r/subreddit/?count25&after=t3_id
id(after) is the last submission from last page.
"""
return list(self.reddit.subreddit(self.subreddit).hot(limit=25 * pages))
def get_comments(self, submission):
"""
Submission is an object created using praw API.
"""
# Remove all "more comments".
submission.comments.replace_more(limit=None)
comments = []
for each in submission.comments.list():
try:
comments.append(
(
each.id,
each.link_id[3:],
each.author.name,
date.fromtimestamp(each.created_utc).isoformat(),
each.score,
each.body,
)
)
except AttributeError as e: # Some comments are deleted, we cannot access them.
# print(each.link_id, e)
continue
return comments
def save_comments_submissions(self, pages):
"""
1. Save all the ids of submissions.
2. For each submission, save information of this submission. (submission_id, #comments, score, subreddit, date, title, body_text)
3. Save comments in this submission. (comment_id, submission_id, author, date, score, body_text)
4. Separately, save them to two csv file.
Note: You can link them with submission_id.
Warning: According to the rule of Reddit API, the get action should not be too frequent. Safely, use the defalut time span in this crawler.
"""
print("Start to collect all submission ids...")
submissions = self.get_submissions(pages)
print(
"Start to collect comments...This may cost a long time depending on # of pages."
)
comments = []
pandas_submissions = []
for count, submission in enumerate(submissions):
pandas_submissions.append(
(
submission.name[3:],
submission.num_comments,
submission.score,
submission.subreddit_name_prefixed,
date.fromtimestamp(submission.created_utc).isoformat(),
submission.title,
submission.selftext,
)
)
temp_comments = self.get_comments(submission)
comments += temp_comments
print(str(count) + " submissions have got...")
comments_fieldnames = [
"comment_id",
"submission_id",
"author_name",
"post_time",
"comment_score",
"text",
]
df_comments = pd.DataFrame(comments, columns=comments_fieldnames)
df_comments.to_csv("comments.csv")
submissions_fieldnames = [
"submission_id",
"num_of_comments",
"submission_score",
"submission_subreddit",
"post_date",
"submission_title",
"text",
]
df_submission = pd.DataFrame(pandas_submissions, columns=submissions_fieldnames)
df_submission.to_csv("submissions.csv")
if __name__ == "__main__":
args = sys.argv[1:]
if len(args) != 2:
print("Wrong number of args...")
exit()
subreddit, pages = args
c = Crawler(subreddit)
c.save_comments_submissions(int(pages))
I realize that my answer here gets into Code Review territory, but I hope that this answer is helpful for understanding some of the things PRAW can do. Your "list index out of range" error would have been avoided by using the pre-existing library code, so I do consider this to be a solution to your problem.
When my_list[-1] throws an IndexError, it means that my_list is empty:
>>> ids = []
>>> ids[-1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> ids = ['1']
>>> ids[-1]
'1'
I am trying to use a set in order to stop users being re printed in the following code. I managed to get python to accept he code without producing any bugs, but if I let the code run on a 10 second loop it continues to print the users who should have already been logged. This is my first attempt at using a set, and I am a complete novice at python (building it all so far based on examples I have seen and reverse engineering them.)
Below is an example of the code I am using
import mechanize
import urllib
import json
import re
import random
import datetime
from sched import scheduler
from time import time, sleep
######Code to loop the script and set up scheduling time
s = scheduler(time, sleep)
random.seed()
def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
s.enterabs(event_time, 0, func, ())
event_time += interval + random.randrange(-5, 45)
s.run()
###### Code to get the data required from the URL desired
def getData():
post_url = "URL OF INTEREST"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
######These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
'rp' : '250',
'sortname' : 'roi',
'sortorder' : 'desc'
}
#####Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
xmlload1 = json.loads(trans_array)
pattern1 = re.compile('> (.*)<')
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>')
##### Making the code identify each row, removing the need to numerically quantify the number of rows in the xmlfile,
##### thus making number of rows dynamic (change as the list grows, required for looping function to work un interupted)
for row in xmlload1['rows']:
cell = row["cell"]
##### defining the Keys (key is the area from which data is pulled in the XML) for use in the pattern finding/regex
user_delimiter = cell['username']
selection_delimiter = cell['race_horse']
if strikeratecalc2 < 12 : continue;
##### REMAINDER OF THE REGEX DELMITATIONS
username_delimiter_results = re.findall(pattern1, user_delimiter)[0]
userid_delimiter_results = (re.findall(pattern2, user_delimiter)[0])
user_selection = re.findall(pattern3, selection_delimiter)[0]
##### Code to stop duplicate posts of each user throughout the day
userset = set ([])
if userid_delimiter_results in userset: continue;
##### Printing the results of the code at hand
print "user id = ",userid_delimiter_results
print "username = ",username_delimiter_results
print "user selection = ",user_selection
print ""
##### Code to stop duplicate posts of each user throughout the day part 2 (udating set to add users already printed to the ignore list)
userset.update(userid_delimiter_results)
getData()
run_periodically(time()+5, time()+1000000, 300, getData)
Any comments will be greatly appreciated, this may seem common sense to you seasoned coders, but I really am just getting past "Hello world"
Kind regards AEA
This:
userset.update(userid_delimiter_results)
Should probably be this:
userset.add(userid_delimiter_results)
To prove it, try printing the contents of userset after each call.