How to save elements in excel with pandas? - python

My code does the following:
1st enter the site
2nd collects the links and saves in a dictionary
3rd the code enters the links saved in the dictionary to extract the elements and also saves in a dictionary
4th finally it saves the information of the elements that are in the dictionary in excel pandas
Problem:
Some pages do not contain information to be extracted, it is probably a bug on the site, so pandas does not save the information already collected.
Here's the error:
valueerror: all arrays must be of the same length
Here's part of my code, I didn't put it in full so it wouldn't get long.
I'm using selenium.
links = []
imagem = []
pacote = []
counter = 1
for linkAtual in links:
driver.get(linkAtual)
try:
driver.find_element(
By.XPATH, "//button[normalize-space()='Ir para a oferta']").click()
sleep(2)
except:
print("proxima pagina")
try:
titulo = driver.find_element(By.TAG_NAME, "h1")
print(titulo.text)
pacote.append(titulo.text.replace("Pacote de Viagem - ", "").replace("+", " e ").replace("2022", "").replace(
"2023", "").replace("2024", "").replace("2025", "").replace("(", "").replace(")", "").replace("-", ""))
print("baixar imagem ")
primeiro_caminho = driver.find_element(By. XPATH, "(//img)[2]")
atributoSrc = primeiro_caminho.get_attribute("src")
# file_name = f"{titulo}{counter:02d}.jpg"
file_name = f"image{counter:02d}.jpg"
imagem.append(atributoSrc)
urllib.request.urlretrieve(
atributoSrc, f"C:\\__Imagens e Planilhas Python\\Afiliacoes\\Fotos\\{file_name}")
counter += 1
except:
print("não tem conteudo")
data = {'Pacote': pacote, 'Link Afiliado': links}
#df = pd.DataFrame.from_dict(data, orient="index")
df = pd.DataFrame(data)
df.to_excel(r"C:\__Imagens e Planilhas Python\Afiliacoes\pacotes.xlsx",engine='xlsxwriter')
print(df)
Page that doesn't work I found one, it has 300 links and about 4 that don't have information:
Here is an example of an error link:
https://www.hurb.com/br/packages/la-romana-passagem-aerea-hospedagem/1416419?utm_source=Felipe-F-clubehurb&utm_medium=clubehu-promotion&utm_campaign=689696&cmp=689696
Here are examples of valid links:
https://www.hurb.com/br/packages/costa-do-sauipe-passagem-aerea-hospedagem-all-inclusive/1421105?utm_source=Felipe-F-clubehurb&utm_medium=clubehu-product&utm_campaign=689696&cmp=689696
https://www.hurb.com/br/packages/rio-de-janeiro-passagem-aerea-hospedagem/1407451?utm_source=Felipe-F-clubehurb&utm_medium=clubehu-product&utm_campaign=689696&cmp=689696
https://www.hurb.com/br/packages/pacote-aereo-hospedagem-dubrovnik/1419049?utm_source=Felipe-F-clubehurb&utm_medium=clubehu-product&utm_campaign=689696&cmp=689696

you could edit your 2nd try...except block to
titlTxt, atributoSrc = None, None # initiate/clear as null
try:
titulo = driver.find_element(By.TAG_NAME, "h1")
print(titulo.text)
# pacote.append... # outside block
titlTxt = titulo.text.replace("Pacote de Viagem - ", "").replace("+", " e ").replace("2022", "").replace(
"2023", "").replace("2024", "").replace("2025", "").replace("(", "").replace(")", "").replace("-", "")
print("baixar imagem ")
primeiro_caminho = driver.find_element(By. XPATH, "(//img)[2]")
atributoSrc = primeiro_caminho.get_attribute("src")
# file_name = f"{titulo}{counter:02d}.jpg"
file_name = f"image{counter:02d}.jpg"
# imagem.append(atributoSrc) # outside block
urllib.request.urlretrieve(
atributoSrc, f"C:\\__Imagens e Planilhas Python\\Afiliacoes\\Fotos\\{file_name}")
counter += 1
except:
print("não tem conteudo")
pacote.append(titlTxt)
imagem.append(atributoSrc)
so now pacote and imagem should have exactly one item [each] corresponding to each item in links, so all 3 lists can be expected to be of the same length.
(If you had separate try...except blocks for each list, you could append in both try and in except; but when extracting for multiple lists in the same try, some of them might already be appended to before raising an exception, so appending in except as well risks adding to the same list twince in a single loop.)

Related

Trying to update code to return HTML value in Selenium (Python)

I am using Selenium to crawl some Facebook group information:
with open("groups.txt") as file:
lines = file.readlines()
total = len(lines)
count = 1
for line in lines:
group_id = line.strip().split(".com/")[1]
if "groups" not in line:
new_line = "https://www.facebook.com/groups/" + str(group_id) + "/about"
else:
new_line = line.strip() + '/about'
sleep(2)
driver.get(new_line)
page_source = driver.page_source
page_id = page_source.split('"groupID":"')[1].split('","')[0]
page_followers = page_source.split('<!-- --> total members')[0][-15:]
page_followers = str(page_followers.split('>')[1]).replace(',', '')
page_name = page_source.split("</title>")[0].split("<title>")[1]
df1.loc[len(df1)] = [line.strip(), 'https://www.facebook.com/' + str(page_id), page_followers, page_name]
print(f"{count}/{total}", line.strip(), 'https://www.facebook.com/' + str(page_id), page_followers)
count += 1
df1.to_csv("groups.csv", encoding='utf-8', index=False, header=False)
Facebook has updated something recently so this code fails to return number of group members.
These are the relevant lines:
page_followers = page_source.split('<!-- --> total members')[0][-15:]
page_followers = str(page_followers.split('>')[1]).replace(',', '')
Taking view-source:https://www.facebook.com/groups/764385144252353/about as an example, I find two instances of "total members". Is it possible to get some advice on what I should change to be able to catch this number?
NEW
This code extract the exact number of members and convert it from string to integer
driver.get('https://www.facebook.com/groups/410943193806268/about')
members = driver.find_element(By.XPATH, "//span[contains(text(), 'total members')]").text
members = int(''.join(i for i in members if i.isdigit()))
print(members)
output
15589
OLD
I suggest don't use page_source to extract this kind of data, instead use find_element in this way
driver.find_element(By.CSS_SELECTOR, "a[href*='members']").text.split()[0]
output
'186'
Explanation: a[href*='members'] search for a elements (for example <a class='test'>...</a>) having a href attribute containing the string members (for example ...)

Python Scraping for String to get an Alarm

I think I've got to a point where I need help from professionals. I would like to build a scraper for a browser game that gives an alarm to a bot (Telegram or Discord). The connection of the bot is not the problem at first, it is more about getting the right result.
My script runs in a while-loop (it also runs without) and is supposed to look for links in an -tag. These links contain an ID. This is always incremented +1 when a new player signs up to the game and that's exactly what I need.
Since I need to compare the information, I figured I need to save it in a .csv file. And there lies the problem the output looks like this in the .csv:
index.php?section=impressum
I have two problems:
I want to limit the output to the first 5 results in the file
Only have in the file if something changes or the corresponding change.
1. + 2.
This ist my code so far:
import requests
import time
import csv
from datetime import datetime
from bs4 import BeautifulSoup
def writeCSV(data):
csv_file = open('ags_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow([data])
csv_file.close()
sleepTimer = 3
# Adresse der Webseite
url = "https://www.ag-spiel.de/"
allAGs = []
firstRun = True
while True:
response = requests.get(url + "index.php?section=live")
# BeautifulSoup HTML-Dokument aus dem Quelltext parsen
html = BeautifulSoup(response.text, 'html.parser')
# url aus dem <a href> parsen
newDetected = False
newAGs = []
possible_links = html.find_all('a')
for link in possible_links:
if link.has_attr('href'):
inhalt = str(link.attrs['href'])
if "aktie=" in inhalt:
if firstRun is True:
allAGs.append(inhalt)
else:
if str(inhalt) not in allAGs:
newDetected = True
print("ATTENTION!!! New AG! Url is: " + inhalt)
allAGs.append(inhalt)
# in Datei schreiben
writeCSV(inhalt)
else:
# print ("Debug output "+ inhalt + " already in AGlist")
continue
if firstRun is True:
print("Frist run successfull, current ags: " + str(len(allAGs)))
for AGurl in allAGs:
print(AGurl)
else:
if newDetected is False:
print(str(datetime.now().strftime("%H:%M:%S")) + ": Nothing changed")
writeCSV(inhalt)
else:
print("Something Changed, current ags: " + str(len(allAGs)))
for AGurl in allAGs:
print(AGurl)
firstRun = False
time.sleep(sleepTimer)
´´´´

How to open a file and delete the first item (item at index 0)

Atm I am working on a plug in for a Chat bot for Twitch.
I have this working so far. So that I am able to add Items to a file.
# Variables
f = open("Tank_request_list.txt","a+")
fr = open("Tank_request_list.txt","r")
tr = "EBR" # test input
tank_request = fr.read()
treq = tank_request.split("#")
with open("Tank_request_list.txt") as fr:
empty = fr.read(1)
if not empty:
f.write(tr)
f.close
else:
tr = "#" + tr
f.write(tr)
f.close
I now need to work out how to delete an item at Index 0
I also have this piece of code I need to implement also:
# List Length
list_length = len(treq)
print "There are %d tanks in the queue." % (list_length)
# Next 5 Requests
print "Next 5 Requests are:"
def tank_lst(x):
for i in range(5):
print "- " + x[i]
# Run Tank_request
tank_lst(treq)
The following will return the right answer but not write it.
def del_played(tank):
del tank[0]
return tank
tanks = treq
print del_played(tanks)
First, remove the content
Use the truncate function for removing the content from a file then write the new list into it.

Try block gives output even when exception is raised by the last command (but not the first)

I use try/except to catch problems when reading a file line-by-line. The try block contains a series of manipulations, the last of which is usually the cause of the exception. Surprisingly, I noticed that all previous manipulations are executed within the try block even when an exception is raised. This is a problem when trying to turn the dictionary I created to a data frame, because the length of the lists is unequal.
This code creates the problem:
d = {'dates':[],'states':[], 'longitude':[], 'latitude':[], 'tweet_ids':[], 'user_ids':[], 'source':[]}
for file in f:
print("Processing file "+file)
t1 = file.split('/')[-1].split("_")
date = t1[0]
state_code = t1[1]
state = list(states_ref.loc[states_ref.code==state_code]['abbr'])[0]
collection = JsonCollection(file)
counter = 0
for tweet in collection.get_iterator():
counter += 1
try:
d['dates'].append(date)
d['states'].append(state)
t2 = tweet_parser.get_entity_field('geo', tweet)
if t2 == None:
d['longitude'].append(t2)
d['latitude'].append(t2)
else:
d['longitude'].append(t2['coordinates'][1])
d['latitude'].append(t2['coordinates'][0])
#note: the 3 lines bellow are the ones that can raise an exception
temp = tweet_parser.get_entity_field('source', tweet)
t5 = re.findall(r'>(.*?)<', temp)[0]
d['source'].append(t5)
except:
c += 1
print("Tweet {} in file {} had a problem and got skipped".format(counter, file))
print("This is a total of {} tweets I am missing from the {} archive I process.".format(c, sys.argv[1]))
next
tab = pd.DataFrame.from_dict(d)
I have fixed the problem by moving the manipulation that is prone to giving the error at the top, but I would like to better understand why try/except is behaving like this. Any ideas?
This code works:
d = {'dates':[],'states':[], 'longitude':[], 'latitude':[], 'tweet_ids':[], 'user_ids':[], 'source':[]}
for file in f:
print("Processing file "+file)
t1 = file.split('/')[-1].split("_")
date = t1[0]
state_code = t1[1]
state = list(states_ref.loc[states_ref.code==state_code]['abbr'])[0]
collection = JsonCollection(file)
counter = 0
for tweet in collection.get_iterator():
counter += 1
try:
#note: the 3 lines bellow are the ones that can raise an exception
temp = tweet_parser.get_entity_field('source', tweet)
t5 = re.findall(r'>(.*?)<', temp)[0]
d['source'].append(t5)
d['dates'].append(date)
d['states'].append(state)
t2 = tweet_parser.get_entity_field('geo', tweet)
if t2 == None:
d['longitude'].append(t2)
d['latitude'].append(t2)
else:
d['longitude'].append(t2['coordinates'][1])
d['latitude'].append(t2['coordinates'][0])
except:
c += 1
print("Tweet {} in file {} had a problem and got skipped".format(counter, file))
print("This is a total of {} tweets I am missing from the {} archive I process.".format(c, sys.argv[1]))
next
tab = pd.DataFrame.from_dict(d)
You could always use a temporal object to hold the output of your functions before appending to the target object. That way if something fails, it will raise an exception before putting data into the target object.
try:
#Put all data into a temporal Dictionary
#Can raise an exception here
temp = tweet_parser.get_entity_field('source', tweet)
t2 = tweet_parser.get_entity_field('geo', tweet)
tempDictionary = {
"source" : re.findall(r'>(.*?)<', temp)[0],
"latitude" : None if (t2 is None) else t2['coordinates'][1],
"longitude" : None if (t2 is None) else t2['coordinates'][0]
}
#Append data from temporal Dictionary
d['source'].append(tempDictionary['source'])
d['latitude'].append(tempDictionary['latitude'])
d['longitude'].append(tempDictionary['longitude'])
d['dates'].append(date)
d['states'].append(state)
except:
c += 1
print("Tweet {} in file {} had a problem and got skipped".format(counter, file))
print("This is a total of {} tweets I am missing from the {} archive I process.".format(c, sys.argv[1]))

If [string] in [string] not working for requested webpage text

I am trying to open a webpage and scrape some strings from it into a list. The list would ultimately be populated by all of the names displayed on the webpage. In trying to do so, my code looks like this:
import xlsxwriter, urllib.request, string, http.cookiejar, requests
def main():
username = 'john.mauran'
password = 'fZSUME1q'
log_url = 'https://aries.case.com.pl/'
dest_url = 'https://aries.case.com.pl/main_odczyt.php?strona=eksperci'
login_values = {'username' : username , 'password' : password }
r = requests.post(dest_url, data=login_values, verify=False, allow_redirects=False)
open_sesame = r.text
#reads the expert page
readpage_list = open_sesame.splitlines()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in open_sesame:
if '<tr><td align=left><a href=' in i:
raw_list += i
if not raw_list:
print("List is empty")
if raw_list:
print(raw_list)
main()
As you can see, all I want to do is take the lines from the text returned by the Requests operation which start with the following characters '
I don't know exactly what you're trying to do, but this doesn't make any sense:
for i in open_sesame:
if '<tr><td align=left><a href=' in i:
raw_list += i
First of all, if you iterate over open_sesame, which is a string, each item in the iteration will be a character in the string. Then '<tr><td align=left><a href=' in i will always be false.
Second of all, raw_list += i is not how you append an item to a list.
Finally, why is the variable called open_sesame? Is it a joke?

Categories