How to fix IndexError when you have tried "everything" - python

My Python web scraper gathers a lot of data and then all of the sudden stops with an IndexError. I have tried different pages and setups, but they stop at random spots.
(part of) My code is as follows
numListings = int(re.findall(r'\d+', numListingsRaw)[0])
numPages = math.ceil(numListings / 100)
print(numPages)
for numb in range(1, numPages):
pageSoup = make_soup("https://url" + str(numb) + "&pmax=5000&srt=df-a")
containers = pageSoup.findAll("li", {"class":"occasion popup_click_event
aec_popup_click"})
for container in containers:
ID = container.a["data-id"]
titel = container["data-vrnt"].replace(",", "|")
URL = container.a["href"]
merk = container["data-mrk"]
soort = container["data-mdl"]
prijs = container.find("div", {"class":"occ_price"}).text.strip()
## Bouwjaar en km
bouwjaarKM = container.span.text.strip().split(", ")
bouwjaarRaw = bouwjaarKM[0].split(": ")
bouwjaar = bouwjaarRaw[1]
km_int = int(''.join(filter(str.isdigit, bouwjaarKM[1])))
km = str(km_int)
rest = container.find("div", {"class":"occ_extrainfo"}).text.strip()
rest_split = rest.split(", ")
brandstof = rest_split[0]
inhoud = rest_split[1]
vermogen = rest_split[2]
transmissie = rest_split[3]
carroserie = rest_split[4]
kleur = rest_split[5]
This it the exact error message:
"Traceback (most recent call last):
File "Webscraper_multi2.py", line 62, in <module>
inhoud = rest_split[1]
IndexError: list index out of range"
I know it has something to do with the for loop, but I cannot get my head around it.
Your help is much appreciated.
Thanks in advance,
Tom

Check length before trying to access a given index that requires the length:
rest = container.find("div", {"class":"occ_extrainfo"}).text.strip()
rest_split = rest.split(", ")
if len(rest_split) >= 6:
brandstof = rest_split[0]
inhoud = rest_split[1]
vermogen = rest_split[2]
transmissie = rest_split[3]
carroserie = rest_split[4]
kleur = rest_split[5]
If you know that your split list is exactly the length you want (if len(rest_split) == 6:), you can unpack the list in a single line:
brandstof, inhoud, vermogen, transmissie, carroserie, kleur = rest_split

Print the value of rest_split. You will find that it is a list with length less than 2 for that is what is needed for a list to have an index 1.

Thank you all for the extremely fast replies! With your help I got it working.
For some context:
I was trying to scrape a 2nd hand automobile website. With the tips that I got I changed the output per item to print the rest_split list.
The list that I am trying to scrape is 7 elements long. But on the website, for some reason a motor cycle was added to the search results. This one only had 1 element, hence the error.
The solution for people that might have a similar problem:
rest = container.find("div", {"class":"occ_extrainfo"}).text.strip()
rest_split = rest.split(", ")
if len(rest_split) == 7:
brandstof = rest_split[0]
inhoud = rest_split[1]
vermogen = rest_split[2]
transmissie = rest_split[3]
carroserie = rest_split[4]
kleur = rest_split[5]
Special thanks to JacobIRR who actually made life so easy that I didn't even have to think about it.

Related

How to get big amount of data as fast as possible

I am trying to return an array of constructed objects that are build on top of objects that I retrieve from some url plus another fields that I get from another url.
I have an array that consists of two arrays that each has about 8000 objects...
I have tried to make each object construction as a thread however it still takes a lot of time...
Any solution? Here is my code:
def get_all_players_full_data(ea_players_json):
all = []
ea_players_json = list(ea_players_json.values())
for i in range(len(ea_players_json)):
for player_obj in ea_players_json[i]:
all.append(player_obj)
for player_obj in range(len(all)):
all_data = []
with concurrent.futures.ThreadPoolExecutor(len(all)) as executor:
for player_data in all:
future = executor.submit(build_full_player_data_obj, player_data)
print(future.result())
all_data.append(future.result())
def build_full_player_data_obj(ea_player_data):
if ea_player_data.get("c") is not None:
player_full_name = ea_player_data.get("c")
else:
player_full_name = ea_player_data.get("f") + " " + ea_player_data.get("l")
player_id = ea_player_data.get("id")
# go to futhead to find all cards of that player
futhead_url_player_data = f'{FUTHEAD_PLAYER}{player_full_name}'
details_of_specific_player = json.loads(requests.get(futhead_url_player_data).content)
cards_from_the_same_id = []
for player_in_json_futhead in details_of_specific_player:
if player_in_json_futhead["player_id"] == player_id:
rating = player_in_json_futhead["rating"]
specific_card_id = player_in_json_futhead["def_id"]
revision = player_in_json_futhead["revision_type"]
name = player_in_json_futhead["full_name"]
nation = player_in_json_futhead["nation_name"]
position = player_in_json_futhead["position"]
club = player_in_json_futhead["club_name"]
cards_from_the_same_id.append(Player(specific_card_id, name, rating, revision, nation,
position, club))
return cards_from_the_same_id

How to make it like make it's process of adding automatically based on keyword count

I am trying to make a form where if I input a medicine's name, it will show the solution of the medicine serially. But it is kind of limited bythe way I'm making it like more lines I'll code more spaces they will get to have the number of feedback. It would be great if you could help me to make something short but have the infinity process like loop.
df = pd.DataFrame({'FEVER':['NAPA_PLUS','JERIN','PARASITAMOL'],
'GASTRIC':['SECLO40','SECLO20','ANTACID'],
'WATERINESS':['ORSALINE','TESTY_SALINE','HOME_MADE_SALINE']})
def word_list(text):
return list(filter(None, re.split('\W+', text)))
session = raw_input("INPUT THE NAME OF THE MEDICINES ONE BY ONE BY KEEPING SPACE:")
feedback = session
print(word_list(feedback))
dff = pd.DataFrame({'itemlist':[feedback]})
dff['1'] = dff['itemlist'].astype(str).str.split().str[0]
dff['2'] = dff['itemlist'].astype(str).str.split().str[1]
dff['3'] = dff['itemlist'].astype(str).str.split().str[2]
dff['4'] = dff['itemlist'].astype(str).str.split().str[3]
dff['5'] = dff['itemlist'].astype(str).str.split().str[4]
for pts1 in dff['1']:
pts1 = df.columns[df.isin([pts1]).any()]
for pts2 in dff['2']:
pts2 = df.columns[df.isin([pts2]).any()]
for pts3 in dff['3']:
pts3 = df.columns[df.isin([pts3]).any()]
for pts4 in dff['4']:
pts4 = df.columns[df.isin([pts4]).any()]
for pts5 in dff['5']:
pts5 = df.columns[df.isin([pts5]).any()]
This wraps your repeated code into two loops:
...
dff = pd.DataFrame({'itemlist':[feedback]})
limit = 5
for i in xrange(limit):
name = str(i+1)
dff[name] = dff['itemlist'].astype(str).str.split().str[i]
for pts in dff[name]:
pts = df.columns[df.isin([pts]).any()]

Searching through html in scrapy?

Is it possible to use a for loop to search through the text of tags that correspond to a certain phrase. I've been trying to create this loop but isn't hasn't been working. Any help is appreciated thanks! Here is my code:
def parse_page(self, response):
titles2 = response.xpath('//div[#id = "mainColumn"]/h1/text()').extract_first()
year = response.xpath('//div[#id = "mainColumn"]/h1/span/text()').extract()[0].strip()
aud = response.xpath('//div[#id="scorePanel"]/div[2]')
a_score = aud.xpath('./div[1]/a/div/div[2]/div[1]/span/text()').extract()
a_count = aud.xpath('./div[2]/div[2]/text()').extract()
c_score = response.xpath('//a[#id = "tomato_meter_link"]/span/span[1]/text()').extract()[0].strip()
c_count = response.xpath('//div[#id = "scoreStats"]/div[3]/span[2]/text()').extract()[0].strip()
info = response.xpath('//div[#class="panel-body content_body"]/ul')
mp_rating = info.xpath('./li[1]/div[2]/text()').extract()[0].strip()
genre = info.xpath('./li[2]/div[2]/a/text()').extract_first()
date = info.xpath('./li[5]/div[2]/time/text()').extract_first()
box = response.xpath('//section[#class = "panel panel-rt panel-box "]/div')
actor1 = box.xpath('./div/div[1]/div/a/span/text()').extract()
actor2 = box.xpath('./div/div[2]/div/a/span/text()').extract()
actor3 = box.xpath('./div/div[3]/div/a/span/text()').extract_first()
for x in info.xpath('//li'):
if info.xpath("./li[x]/div[1][contains(text(), 'Box Office: ')/text()]]
box_office = info.xpath('./li[x]/div[2]/text()')
else if info.xpath('./li[x]/div[1]/text()').extract[0] == "Runtime: "):
runtime = info.xpath('./li[x]/div[2]/time/text()')
Your for loop is completely wrong:
1. You're using info. but searching from the root
for x in info.xpath('.//li'):
2. x is a HTML node element and you can use it this way:
if x.xpath("./div[1][contains(., 'Box Office: ')]"):
box_office = x.xpath('./div[2]/text()').extract_first()
I think you might need re() or re_first() to match the certain phrase.
For example:
elif info.xpath('./li[x]/div[1]/text()').re_first('Runtime:') == "Runtime: "):
runtime = info.xpath('./li[x]/div[2]/time/text()')
And you need to modify your for loop, cuz the variable x in it is actually a Selector but not a number, so it's not right to use it like this: li[x].
gangabass in the last answer made a good point on this.

Looping through scraped data and outputting the result

I am trying to e the BBC football results website to get teams, shots, goals, cards and incidents. I currently have 3 teams data passed into the URL.
I writing the script in Python and using the Beautiful soup bs4 package. When outputting the results to screen, the first team is printed, the the first and second team, then the first, second and third team. So the first team is effectively being printed 3 times, When I am trying to get the 3 teams just once.
Once I have this problem sorted I will write the results to file. I am adding the teams data into data frames then into a list (I am not sure if this is the best method).
I am sure if is something to do with the for loops, but I am unsure how to resolve the problem.
Code:
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
out_list = []
for numb in('EFBO839787', 'EFBO839786', 'EFBO815155'):
url = 'http://www.bbc.co.uk/sport/football/result/partial/' + numb + '?teamview=false'
teams_list = []
inner_page = urllib2.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')
for report in soupb.find_all('td', 'match-details'):
home_tag = report.find('span', class_='team-home')
home_team = home_tag and ''.join(home_tag.stripped_strings)
score_tag = report.find('span', class_='score')
score = score_tag and ''.join(score_tag.stripped_strings)
shots_tag = report.find('span', class_='shots-on-target')
shots = shots_tag and ''.join(shots_tag.stripped_strings)
away_tag = report.find('span', class_='team-away')
away_team = away_tag and ''.join(away_tag.stripped_strings)
df = pd.DataFrame({'away_team' : [away_team], 'home_team' : [home_team], 'score' : [score], })
out_list.append(df)
for shots in soupb.find_all('td', class_='shots'):
home_shots_tag = shots.find('span',class_='goal-count-home')
home_shots = home_shots_tag and ''.join(home_shots_tag.stripped_strings)
away_shots_tag = shots.find('span',class_='goal-count-away')
away_shots = away_shots_tag and ''.join(away_shots_tag.stripped_strings)
dfb = pd.DataFrame({'home_shots': [home_shots], 'away_shots' : [away_shots] })
out_list.append(dfb)
for incidents in soupb.find("table", class_="incidents-table").find("tbody").find_all("tr"):
home_inc_tag = incidents.find("td", class_="incident-player-home")
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
type_inc_goal_tag = incidents.find("td", "span", class_="incident-type goal")
type_inc_goal = type_inc_goal_tag and ''.join(type_inc_goal_tag.stripped_strings)
type_inc_tag = incidents.find("td", class_="incident-type")
type_inc = type_inc_tag and ''.join(type_inc_tag.stripped_strings)
time_inc_tag = incidents.find('td', class_='incident-time')
time_inc = time_inc_tag and ''.join(time_inc_tag.stripped_strings)
away_inc_tag = incidents.find('td', class_='incident-player-away')
away_inc = away_inc_tag and ''.join(away_inc_tag.stripped_strings)
df_incidents = pd.DataFrame({'home_player' : [home_inc],'event_type' : [type_inc_goal],'event_time': [time_inc],'away_player' : [away_inc]})
out_list.append(df_incidents)
print "end"
print out_list
I am new to python and stack overflow, any suggestions on formatting my questions is also useful.
Thanks in advance!
Those 3 for loops should be inside your main for loop.
out_list = []
for numb in('EFBO839787', 'EFBO839786', 'EFBO815155'):
url = 'http://www.bbc.co.uk/sport/football/result/partial/' + numb + '?teamview=false'
teams_list = []
inner_page = urllib.request.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')
for report in soupb.find_all('td', 'match-details'):
# your code as it is
for shots in soupb.find_all('td', class_='shots'):
# your code as it is
for incidents in soupb.find("table", class_="incidents-table").find("tbody").find_all("tr"):
# your code as it is
It works just fine - shows up a team just once.
Here's output of first for loop:
[{'score': ['1-3'], 'away_team': ['Man City'], 'home_team': ['Dynamo Kiev']},
{'score': ['1-0'], 'away_team': ['Zenit St P'], 'home_team': ['Benfica']},
{'score': ['1-2'], 'away_team': ['Boston United'], 'home_team': ['Bradford Park Avenue']}]
This looks like a printing problem, at what indentation level are you printing out_list ?
It should be at zero indentation, all the way to the left in your code.
Either that, or you want to move out_list into the top most for loop so that it's re-assigned after every iteration.

List comes back as empty when retrieveing data from website ; Python

I am trying to parse data from a website by inserting the data into a list, but the list comes back empty.
url =("http://www.releasechimps.org/resources/publication/whos-there-md- anderson")
http = urllib3.PoolManager()
r = http.request('Get',url)
soup = BeautifulSoup(r.data,"html.parser")
#print(r.data)
loop = re.findall(r'<td>(.*?)</td>',str(r.data))
#print(str(loop))
newLoop = str(loop)
#print(newLoop)
for x in range(1229):
if "\\n\\t\\t\\t\\t" in loop[x]:
loop[x] = loop[x].replace("\\n\\t\\t\\t\\t","")
list0_v2.append(str(loop[x]))
print(loop[x])
print(str(list0_v2))
Edit: Didn't really have anything else going on, so I made your data format into a nice list of dictionaries. There's a weird <td height="26"> on monkey 111, so I had to change the regex slightly.
Hope this helps you, I did it cause I care about the monkeys man.
import html
import re
import urllib.request
list0_v2 = []
final_list = []
url = "http://www.releasechimps.org/resources/publication/whos-there-md-anderson"
data = urllib.request.urlopen(url).read()
loop = re.findall(r'<td.*?>(.*?)</td>', str(data))
for item in loop:
if "\\n\\t\\t\\t\\t" or "em>" in item:
item = item.replace("\\n\\t\\t\\t\\t", "").replace("<em>", "")\
.replace("</em>", "")
if " " == item:
continue
list0_v2.append(item)
n = 1
while len(list0_v2) != 0:
form = {"n":0, "name":"", "id":"", "gender":"", "birthdate":"", "notes":""}
try:
if list0_v2[5][-1] == '.':
numb, name, ids, gender, birthdate, notes = list0_v2[0:6]
form["notes"] = notes
del(list0_v2[0:6])
else:
raise Exception('foo')
except:
numb, name, ids, gender, birthdate = list0_v2[0:5]
del(list0_v2[0:5])
form["n"] = int(numb)
form["name"] = html.unescape(name)
form["id"] = ids
form["gender"] = gender
form["birthdate"] = birthdate
final_list.append(form)
n += 1
for li in final_list:
print("{:3} {:10} {:10} {:3} {:10} {}".format(li["n"], li["name"], li["id"],\
li["gender"], li["birthdate"], li["notes"]))

Categories