Related
I wrote a script that gets all of the products in a online shoping website.
Website gives you more items whenever you scroll down as usualy, so I can't get enough products from the webpage.
How can I get "product" as much as I want?
Here is my current code:
from bs4 import BeautifulSoup
import requests
url = "https://www.trendyol.com/erkek-t-shirt-x-g2-c73"
html_text = requests.get(url).text
main_soup = BeautifulSoup(html_text, 'lxml')
all_items = main_soup.find_all('div', class_="p-card-wrppr with-campaign-view")
for item in all_items:
title = item.find('div', class_="prdct-desc-cntnr-ttl-w two-line-text").text
print(title)
print()
The data is loaded via Javascript through their Rest API so you can make the same request to obtain the information:
import requests
api_url = "https://public.trendyol.com/discovery-web-searchgw-service/v2/api/infinite-scroll/erkek-t-shirt-x-g2-c73"
params = {
"pi": "1",
"culture": "tr-TR",
"userGenderId": "1",
"pId": "0",
"scoringAlgorithmId": "2",
"categoryRelevancyEnabled": "false",
"isLegalRequirementConfirmed": "false",
"searchStrategyType": "DEFAULT",
"productStampType": "A",
"fixSlotProductAdsIncluded": "false",
"searchAbDeciderValues": "",
}
page = 1
while True:
params['pi'] = page
data = requests.get(api_url, params=params).json()
if not data.get('result', {}).get('products'):
break
for p in data['result']['products']:
print('{:<15} {}'.format(p['id'], p['name']))
print()
page += 1
Prints:
311550273 Erkek Nem Emici Hızlı Kuruma Atletik Teknik Performans T-shirt
114225365 Erkek Sarı Mikro Polyester Performans Antrenman Sporcu Tişört
35907408 Dry Park VII BV6708-010 Erkek Tişört
101771784 Erkek Siyah-Beyaz-Antrasit 3'lü Bisiklet Yaka Düz T-Shirt E001010
39501961 Erkek Siyah Pis Yaka Salaş T-shirt
311550264 Erkek Nem Emici Hızlı Kuruma Atletik Teknik Performans T-shirt
95401172 Erkek Koyu Lacivert Pike Kısa Kollu Basic Tişört
339923963 Fit NBA Brooklyn Nets Regular Fit Bisiklet Yaka Tişört
62247632 Ua Big Logo Ss - 1329583-600
101770796 Erkek Siyah 2'li Bisiklet Yaka %100 Pamuk Basic T-Shirt E001011
270985631 Erkek Beyaz Pis Yaka Salaş T-shirt
96531669 Logo Baskılı Kırmızı Tişört Slim Fit / Dar Kesim 065781-33099
101771002 Erkek Beyaz 2'li Bisiklet Yaka %100 Pamuk Basic T-Shirt E001012
382858755 Sw Tshirt | Bordo
336669786 Fit NBA Brooklyn Nets Oversize Fit Kapüşonlu Tişört
339562039 Fit NBA Golden State Warriors Boxy Fit Bisiklet Yaka Tişört
347257450 Oversize Dragon Vs Phoenix Unisex T-shirt
443287890 Oversize Bisiklet Yaka Les Benjamınsbakılı Tshirt
340645587 TULIO GRİ T-SHIRT S/S TEE
348227065 Fenerbahçe Sk Unisex Mavi Futbol Tişört 77313601
443316559 Oversize Bisiklet Yaka Les Benjamınsbakılı Tshirt
293975462 Chaos Karma Baskı Oversize Siyah Unisex Tshirt
249703589 Unisex Chicago Özel Baskılı Oversize Penye T-shirt Tişört
301833907 Unisex Distraction Siyah Tshirt
302756840 Unisex First Class Beyaz Tshirt
382195053 Wreck The World Oversize | Beige
321828215 Erkek 5'li Paket Dry Fit Siyah Lacivert Beyaz Haki Gri Atletik Nem Emici Günlük Tshirt
90315375 Logo Baskılı Siyah Tişört Slim Fit / Dar Kesim 065781-900
...and so on.
So, right now, what I'm trying to do is that I'm trying to scrape a table from rottentomatoes.com and but every time I run the code, I'm facing an issue that it just prints <a href tags. For now, all I want are the Movie titles numbered.
This is my code so far:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
titles = []
year_released = []
def get_requests():
try:
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
table = soup.find('table', class_='table')
for name in table:
td = soup.find_all('a', class_='unstyled articleLink')
titles.append(td)
print(titles)
break
except:
print("The result could not get fetched")
And this is my output:
[[Opening This Week, Top Box Office, Coming Soon to Theaters, Weekend Earnings, Certified Fresh Movies, On Dvd & Streaming, VUDU, Netflix Streaming, iTunes, Amazon and Amazon Prime, Top DVD & Streaming, New Releases, Coming Soon to DVD, Certified Fresh Movies, Browse All, Top Movies, Trailers, Forums,
View All
,
View All
, Top TV Shows, Certified Fresh TV, 24 Frames, All-Time Lists, Binge Guide, Comics on TV, Countdown, Critics Consensus, Five Favorite Films, Now Streaming, Parental Guidance, Red Carpet Roundup, Scorecards, Sub-Cult, Total Recall, Video Interviews, Weekend Box Office, Weekly Ketchup, What to Watch, The Zeros, View All, View All, View All,
It Happened One Night (1934),
Citizen Kane (1941),
The Wizard of Oz (1939),
Modern Times (1936),
Black Panther (2018),
Parasite (Gisaengchung) (2019),
Avengers: Endgame (2019),
Casablanca (1942),
Knives Out (2019),
Us (2019),
Toy Story 4 (2019),
Lady Bird (2017),
Mission: Impossible - Fallout (2018),
BlacKkKlansman (2018),
Get Out (2017),
The Irishman (2019),
The Godfather (1972),
Mad Max: Fury Road (2015),
Spider-Man: Into the Spider-Verse (2018),
Moonlight (2016),
Sunset Boulevard (1950),
All About Eve (1950),
The Cabinet of Dr. Caligari (Das Cabinet des Dr. Caligari) (1920),
The Philadelphia Story (1940),
Roma (2018),
Wonder Woman (2017),
A Star Is Born (2018),
Inside Out (2015),
A Quiet Place (2018),
One Night in Miami (2020),
Eighth Grade (2018),
Rebecca (1940),
Booksmart (2019),
Logan (2017),
His Girl Friday (1940),
Portrait of a Lady on Fire (Portrait de la jeune fille en feu) (2020),
Coco (2017),
Dunkirk (2017),
Star Wars: The Last Jedi (2017),
A Night at the Opera (1935),
The Shape of Water (2017),
Thor: Ragnarok (2017),
Spotlight (2015),
The Farewell (2019),
Selma (2014),
The Third Man (1949),
Rear Window (1954),
E.T. The Extra-Terrestrial (1982),
Seven Samurai (Shichinin no Samurai) (1956),
La Grande illusion (Grand Illusion) (1938),
Arrival (2016),
Singin' in the Rain (1952),
The Favourite (2018),
Double Indemnity (1944),
All Quiet on the Western Front (1930),
Snow White and the Seven Dwarfs (1937),
Marriage Story (2019),
The Big Sick (2017),
On the Waterfront (1954),
Star Wars: Episode VII - The Force Awakens (2015),
An American in Paris (1951),
The Best Years of Our Lives (1946),
Metropolis (1927),
Boyhood (2014),
Gravity (2013),
Leave No Trace (2018),
The Maltese Falcon (1941),
The Invisible Man (2020),
12 Years a Slave (2013),
Once Upon a Time In Hollywood (2019),
Argo (2012),
Soul (2020),
Ma Rainey's Black Bottom (2020),
The Kid (1921),
Manchester by the Sea (2016),
Nosferatu, a Symphony of Horror (Nosferatu, eine Symphonie des Grauens) (Nosferatu the Vampire) (1922),
The Adventures of Robin Hood (1938),
La La Land (2016),
North by Northwest (1959),
Laura (1944),
Spider-Man: Far From Home (2019),
Incredibles 2 (2018),
Zootopia (2016),
Alien (1979),
King Kong (1933),
Shadow of a Doubt (1943),
Call Me by Your Name (2018),
Psycho (1960),
1917 (2020),
L.A. Confidential (1997),
The Florida Project (2017),
War for the Planet of the Apes (2017),
Paddington 2 (2018),
A Hard Day's Night (1964),
Widows (2018),
Never Rarely Sometimes Always (2020),
Baby Driver (2017),
Spider-Man: Homecoming (2017),
The Godfather, Part II (1974),
The Battle of Algiers (La Battaglia di Algeri) (1967), View All, View All]]
Reading tables via pandas.read_html() as provided by #F.Hoque would probably the leaner approache but you can also get your results with BeautifulSoup only.
Iterate over all <tr> of the <table>, pick information from tags via .text / .get_text() and store it structured in list of dicts:
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
result = requests.get(url=url)
soup = BeautifulSoup(result.text, 'html.parser')
data = []
for row in soup.select('table.table tr')[1:]:
data.append({
'rank': row.td.text,
'title': row.a.text.split(' (')[0].strip(),
'releaseYear': row.a.text.split(' (')[1][:-1]
})
data
Output
[{'rank': '1.', 'title': 'It Happened One Night', 'releaseYear': '1934'},
{'rank': '2.', 'title': 'Citizen Kane', 'releaseYear': '1941'},
{'rank': '3.', 'title': 'The Wizard of Oz', 'releaseYear': '1939'},
{'rank': '4.', 'title': 'Modern Times', 'releaseYear': '1936'},
{'rank': '5.', 'title': 'Black Panther', 'releaseYear': '2018'},...]
So I'm trying to web scrape search results from Sportchek with BS4, specifically this link "https://www.sportchek.ca/categories/men/footwear/basketball-shoes.html?page=1". I want to get the prices off of the shoes here and put them all into a system to sort it, however, to do this I need to get the prices first and I cannot find a way to do that. In the HTML, the class is product-price-text but I can't glean anything off of it. At this point, getting even the price of only 1 shoe would be fine. I just need help on scraping anything class-related on BS4 because none of it works. I've tried
print(soup.find_all("span", class_="product-price-text"))
and even that won't work so please help.
The data is loaded dynamically via JavaScript. You can use the requests module to load it:
import json
import requests
url = "https://www.sportchek.ca/services/sportchek/search-and-promote/products?page=1&lastVisibleProductNumber=12&x1=ast-id-level-3&q1=men%3A%3Ashoes-footwear%3A%3Abasketball&preselectedCategoriesNumber=3&preselectedBrandsNumber=0&count=24"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0",
}
data = requests.get(url, headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for p in data["products"]:
print("{:<10} {:<10} {}".format(p["code"], p["price"], p["title"]))
Prints:
332799300 83.97 Nike Unisex KD Trey 5 VII TB Basketball Shoes - Black/White/Volt - Black
333323940 180.0 Nike Men's Air Jordan 1 Zoom Air Comfort Basketball Shoes - Black/Chile Red-white-university Gold
333107663 134.99 Nike Men's Mamba Fury Basketball Shoes - Black/Smoke Grey/White
333003748 134.99 Nike Men's Lebron Witness IV Basketball Shoes - White
333003606 104.99 Nike Men's Kyrie Flytrap III Basketball Shoes - Black/Uni Red/Bright Crimson
333003543 94.99 Nike Men's Precision III Basketball Shoes - Black/White
333107554 94.99 Nike Men's Precision IV Basketball Shoes - Black/Mtlc Gold/Dk Smoke Grey
333107404 215.0 Nike Men's LeBron XVII Low Basketball Shoes - Black/White/Multicolor
333107617 119.99 Nike Men's KD Trey 5 VIII Basketball Shoes - Black/White-aurora Green/Smoke Grey
333166326 125.98 Nike Men's KD13 Basketball Shoes - Black/White-wolf Grey
333166731 138.98 Nike Men's LeBron XVII Low Basketball Shoes - Particle Grey/White-lt Smoke Grey-black
333183810 129.99 adidas Men's D.O.N 2 Basketball Shoes - Gold/Black/Gold
333206770 111.97 Under Armour Men's Embid Basketball Shoes - Red/White
333181809 165.0 Nike Men's Air Jordan React Elevation Basketball Shoes - Black/White-lt Smoke Grey-volt
333307276 104.99 adidas Men's Harden Stepback 2 Basketball Shoes - White/Blackwhite/Black
333017256 89.99 Under Armour Men's Jet Mid Sneaker - Black/Halo Grey
332912833 134.99 Nike Men's Zoom LeBron Witness IV Running Shoes - Black/Gym Red/University Red
332799162 79.88 Under Armour Men's Curry 7 "Quiet Eye" Basketball Shoes - Black - Black
333276525 119.99 Nike Men's Kyrie Flytrap IV Basketball Shoes - Black/White-metallic Silver
333106290 145.97 Nike Men's KD13 Basketball Shoes - Black/White/Wolf Grey
333181345 144.99 Nike Men's PG 4 TB Basketball Shoes - Black/White-pure Platinum
333241817 149.99 PUMA Men's Clyde All-Pro Basketball Shoes - Puma White/Blue Atolpuma White/Blue Atol
333186052 77.97 adidas Men's Harden Stepback Basketball Shoes - Black/Gold/White
333316063 245.0 Nike Men's Air Jordan 13 Retro Basketball Shoes - White/Blackwhite/Starfish-black
EDIT: To extract the API Url:
import re
import json
import requests
# your URL:
url = "https://www.sportchek.ca/categories/men/footwear/basketball-shoes.html?page=1"
api_url = "https://www.sportchek.ca/services/sportchek/search-and-promote/products?page=1&x1=ast-id-level-3&q1={cat}&count=24"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0",
}
html_text = requests.get(url, headers=headers).text
cat = re.search(r"br_data\.cat_id=\'(.*?)';", html_text).group(1)
data = requests.get(api_url.format(cat=cat), headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for p in data["products"]:
print("{:<10} {:<10} {}".format(p["code"], p["price"], p["title"]))
Using Selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome('/home/cam/Downloads/chromedriver')
url='https://www.sportchek.ca/categories/men/footwear/basketball-shoes.html?page=1'
browser.get(url)
time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html)
def get_data():
links = soup.find_all('span', attrs={'class':"product-price-text"})
for i in set(links):
print(i.text)
get_data()
Output:
$245.00
$215.00
$144.99
$165.00
$129.99
$104.99
$149.99
$195.00
$180.00
$119.99
$134.99
$89.99
$94.99
$215.00
I am looking to a data science project where I will be able to sum up the fantasy football points by the college the players went to (e.g. Alabama has 56 active players in the NFL so I will go through a database and add up all of their fantasy points to compare with other schools).
I was looking at the website:
https://fantasydata.com/nfl/fantasy-football-leaders?season=2020&seasontype=1&scope=1&subscope=1&aggregatescope=1&range=3
and I was going to use Beautiful Soup to scrape the rows of players and statistics and ultimately, fantasy football points.
However, I am having trouble figuring out how to extract the players' college alma mater. To do so, I would have to:
Click each "players" name
Scrape each and every profile of the hundreds of NFL players for one line "College"
Place all of this information into its own column.
Any suggestions here?
There's no need for Selenium, or other headless, automated browsers. That's overkill.
If you take a look at your browser's network traffic, you'll notice that your browser makes a POST request to this REST API endpoint: https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read
If the POST request is well-formed, the API responds with JSON, containing information about every single player. Normally, this information would be used to populate the DOM asynchronously using JavaScript. There's quite a lot of information there, but unfortunately, the college information isn't part of the JSON response. However, there is a field PlayerUrlString, which is a relative-URL to a given player's profile page, which does contain the college name. So:
Make a POST request to the API to get information about all players
For each player in the response JSON:
Visit that player's profile
Use BeautifulSoup to extract the college name from the current
player's profile
Code:
def main():
import requests
from bs4 import BeautifulSoup
url = "https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read"
data = {
"sort": "FantasyPoints-desc",
"pageSize": "50",
"filters.season": "2020",
"filters.seasontype": "1",
"filters.scope": "1",
"filters.subscope": "1",
"filters.aggregatescope": "1",
"filters.range": "3",
}
response = requests.post(url, data=data)
response.raise_for_status()
players = response.json()["Data"]
for player in players:
url = "https://fantasydata.com" + player["PlayerUrlString"]
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
college = soup.find("dl", {"class": "dl-horizontal"}).findAll("dd")[-1].text.strip()
print(player["Name"] + " went to " + college)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Patrick Mahomes went to Texas Tech
Kyler Murray went to Oklahoma
Aaron Rodgers went to California
Russell Wilson went to Wisconsin
Josh Allen went to Wyoming
Deshaun Watson went to Clemson
Ryan Tannehill went to Texas A&M
Lamar Jackson went to Louisville
Dalvin Cook went to Florida State
...
You can also edit the pageSize POST parameter in the data dictionary. The 50 corresponds to information about the first 50 players in the JSON response (according to the filters set by the other POST parameters). Changing this value will yield more or less players in the JSON response.
I agree, API are the way to go if they are there. My second "go to" is pandas' .read_html() (which uses BeautifulSoup under the hood to parse <table> tags. Here's an alternate solution using ESPNs api to get team roster links, then use pandas to pull the table from each link. Saves you the trouble of having to iterate througheach player to get the college (I whish they just had an api that returned all players. nfl.com USED to have that, but is no longer publicly available, that I know of).
Code:
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/athletes/101'
all_teams = []
roster_links = []
for i in range(1,35):
url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams/{teamId}'.format(teamId=i)
jsonData = requests.get(url).json()
print (jsonData['team']['displayName'])
for link in jsonData['team']['links']:
if link['text'] == 'Roster':
roster_links.append(link['href'])
break
for link in roster_links:
print (link)
tables = pd.read_html(link)
df = pd.concat(tables).drop('Unnamed: 0',axis=1)
df['Jersey'] = df['Name'].str.replace("([A-Za-z.' ]+)", '')
df['Name'] = df['Name'].str.extract("([A-Za-z.' ]+)")
all_teams.append(df)
final_df = pd.concat(all_teams).reset_index(drop=True)
Output:
print (final_df)
Name POS Age HT WT Exp College Jersey
0 Matt Ryan QB 35 6' 4" 217 lbs 13 Boston College 2
1 Matt Schaub QB 39 6' 6" 245 lbs 17 Virginia 8
2 Todd Gurley II RB 26 6' 1" 224 lbs 6 Georgia 21
3 Brian Hill RB 25 6' 1" 219 lbs 4 Wyoming 23
4 Qadree Ollison RB 24 6' 1" 232 lbs 2 Pittsburgh 30
... .. ... ... ... .. ... ...
1772 Jonathan Owens S 25 5' 11" 210 lbs 2 Missouri Western 36
1773 Justin Reid S 23 6' 1" 203 lbs 3 Stanford 20
1774 Ka'imi Fairbairn PK 26 6' 0" 183 lbs 5 UCLA 7
1775 Bryan Anger P 32 6' 3" 205 lbs 9 California 9
1776 Jon Weeks LS 34 5' 10" 242 lbs 11 Baylor 46
[1777 rows x 8 columns]
I am using Selenium and the automation part is working efficiently but the data is being saved in the csv inaccurately. Even though I have four addresses in my f (csv file), it only returns the data from the first address listed redundantly. It brings back the data for the first address over and over again in the csv file. How can I tell Python to just have one heading for all the columns, not Permit, Address, Street Name, etc... every time it iterates the process. Please let me know if you anyone needs further details.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
driver = webdriver.Chrome("C:\Python27\Scripts\chromedriver.exe")
chrome = driver.get('https://etrakit.friscotexas.gov/Search/permit.aspx')
wait = WebDriverWait(driver, 10)
with open('C:/Users/list.csv','r') as f:
addresses = f.readlines()
for address in addresses:
driver.find_element_by_css_selector('#cplMain_txtSearchString').clear()
driver.find_element_by_css_selector('#cplMain_txtSearchString').send_keys(address)
driver.find_element_by_css_selector('#cplMain_btnSearch').click()
table = wait.until(EC.visibility_of_element_located((By.ID, "ctl00_cplMain_rgSearchRslts_ctl00")))
df = pd.read_html(table.get_attribute("outerHTML"))[0]
with open('thematchingresults.csv', 'a') as f:
df.to_csv(f)
The four addresses I am trying to parse for:
6525 Mountain Sky Rd
6543 Mountain Sky Rd
6561 Mountain Sky Rd
6579 Mountain Sky Rd
How the data is being fed into csv file:
Permit Number Address Street Name Applicant Name Contractor Name SITE_SUBDIVISION RECORDID
0 B13-2169 6525 MOUNTAIN SKY RD MOUNTAIN SKY RD SHADDOCK HOMES LTD SHADDOCK HOMES LTD PCR - SHERIDAN MAC:1306181017281473
1 L13-3451 6525 MOUNTAIN SKY RD MOUNTAIN SKY RD TDS IRRIGATION TDS IRRIGATION SHERIDAN ECON:131115094522681
2 ROW13-6260 6525 Mountain Sky Rd Mountain Sky Rd AT&T Broadband & Internet Serv Housley Group SSW:1312030140165722
Permit Number Address Street Name Applicant Name Contractor Name SITE_SUBDIVISION RECORDID
0 B13-2169 6525 MOUNTAIN SKY RD MOUNTAIN SKY RD SHADDOCK HOMES LTD SHADDOCK HOMES LTD PCR - SHERIDAN MAC:1306181017281473
1 L13-3451 6525 MOUNTAIN SKY RD MOUNTAIN SKY RD TDS IRRIGATION TDS IRRIGATION SHERIDAN ECON:131115094522681
2 ROW13-6260 6525 Mountain Sky Rd Mountain Sky Rd AT&T Broadband & Internet Serv Housley Group SSW:1312030140165722
Your code is almost working perfectly, but your wait.until() appears to be being satisfied immediately before the page has had a chance to update itself. Simply adding a one second delay before the wait.until() for example worked, although you need to investigate a more rigorous approach:
time.sleep(2)
This gave me the following CSV output file:
,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B13-2169,6525 MOUNTAIN SKY RD,MOUNTAIN SKY RD,SHADDOCK HOMES LTD,SHADDOCK HOMES LTD,PCR - SHERIDAN,MAC:1306181017281473
1,L13-3451,6525 MOUNTAIN SKY RD,MOUNTAIN SKY RD,TDS IRRIGATION,TDS IRRIGATION,SHERIDAN,ECON:131115094522681
2,ROW13-6260,6525 Mountain Sky Rd,Mountain Sky Rd,AT&T Broadband & Internet Serv,Housley Group,,SSW:1312030140165722
,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B14-0771,6543 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DREES CUSTOM HOMES,DREES CUSTOM HOMES,PCR - SHERIDAN,LWE:1403121043033654
1,L14-2401,6543 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DFW SITE DESIGN,DFW SITE DESIGN,SHERIDAN,ECON:140711080345627
2,ROW15-4097,6543 MOUNTAIN SKY RD,MOUNTAIN SKY RD,HOUSLEY GROUP,HOUSLEY GROUP,,TLW:1507220204411002
,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B13-2364,6561 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DREES CUSTOM HOMES,DREES CUSTOM HOMES,PCR - SHERIDAN,MAC:1307030929232194
1,L14-1500,6561 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DFW SITE DESIGN,DFW SITE DESIGN,SHERIDAN,ECON:140424040055127
2,P15-0073,6561 MOUNTAIN SKY RD,MOUNTAIN SKY RD,RIVERBEND/SANDLER POOLS,,SHERIDAN,HC:1502160438345148
,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B13-2809,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,SHADDOCK HOMES LTD,SHADDOCK HOMES LTD,PCR - SHERIDAN,MAC:1308050328358768
1,B13-4096,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,MIRAGE CUSTOM POOLS,MIRAGE CUSTOM POOLS,PCR - SHERIDAN,MAC:1312030307087756
2,L14-1640,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,TDS IRRIGATION,TDS IRRIGATION,SHERIDAN,ECON:140506012624706
3,P14-0018,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,MIRAGE CUSTOM POOLS,,SHERIDAN,LCR:1401130949212891
4,ROW14-3205,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,Housley Group,Housley Group,,TLW:1406190424422330
As an alternative approach, you could keep polling the table until you see new data has been loaded:
import pandas as pd
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
driver = webdriver.Chrome(r"C:\Python27\chromedriver.exe")
chrome = driver.get('https://etrakit.friscotexas.gov/Search/permit.aspx')
wait = WebDriverWait(driver, 10)
with open('C:/Users/list.csv','r') as f:
addresses = f.readlines()
old_table_html = []
for address in addresses:
print address
driver.find_element_by_css_selector('#cplMain_txtSearchString').clear()
driver.find_element_by_css_selector('#cplMain_txtSearchString').send_keys(address)
driver.find_element_by_css_selector('#cplMain_btnSearch').click()
while True:
try:
table = wait.until(EC.visibility_of_element_located((By.ID, "ctl00_cplMain_rgSearchRslts_ctl00")))
table_html = table.get_attribute("outerHTML")
if table_html != old_table_html:
break
except selenium.common.exceptions.StaleElementReferenceException:
pass
old_table_html = table_html
df = pd.read_html(table_html)[0]
with open('thematchingresults.csv', 'a') as f:
df.to_csv(f)