Multiple headers while I just want to have a single header - python

So I have this chunk script
def alltime_tracker(sheet_url,sheet_name,concatfile):
temp2 = pd.read_csv(sheet_url)
temp2 = temp2[(temp2['Time Complete'] != today)]
temp2 = pd.concat([temp2,concatfile])
temp2.to_csv(sheet_name,encoding='utf-8', index=False)
sheet1_url = 'https://docs.google.com...'
sheet1_name = 'sheet1.csv'
alltime_tracker(sheet1_url,sheet1_name,damagedclaim)
This intended output is that the tracker will be updated everyday. The sheet1_url is a blank spreadsheet with headers. While, concatfile is a different spreadsheet filled with headers and data from today. I've connected this script to google gdocs/google API so it automates/run the code everyday. However, when the script runs on the second day, it produced multiple headers (three headers in total and the script has been running for two days). Can anybody help me with this issue so that it only produces 1 header whenever the script runs?. Thank you.

Related

How do I limit the rate of a scraper?

So I am trying to create a table by scraping hundreds of similar pages at a time and then saving them into the same Excel table, with something like this:
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()
I now hit the HTTP Error 429: too many requests, without any retry-after header.
Is there a way for me to get around this? I know that this error happens because I've basically asked to scrape too many websites in too short of an interval. Is there a way for me to limit the rate that my code opens links?
Python official documentation is the best place to go:
https://docs.python.org/3/library/time.html#time.sleep
Here an example using 5 seconds. But you can customize it according to what you need and the restrictions you have.
import time
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#New cote to wait for some time
time.sleep(5)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()
The way I did it,
scrape the whole element once and then parse through the header, tag, or name. used bs4 with robinstocks for market data, runs every 10 mins or so, and works fine. specifically the get_element_by_name functionality. or maybe just use time delay from the time lib

Append new data into an existing frame and upload to sheets Python

I'm connected to my APIs client, sent the credentials, I made the request, I asked the API for data and put it to a DF.
Then, I have to upload this data to a sheet, so then this sheet is gonna be connected to PowerBI as a datasource in order to develop a dashboard and monitor some KPIs and so on..
Simple and common ETL process. BUT: to be honest, I'm a rookie and I'm doing my best.
Above here is just code to connect to the API, here is where the "extraction" begins
if response_page.status_code == 200:
if page == 1 :
df = pd.DataFrame(json.loads(response_page.content)["list"])
else :
df2 = pd.DataFrame(json.loads(response_page.content)["list"])
df = df.append(df2)
Then I just pick up but I need:
columnas = ['orderId','totalValue','paymentNames']
df2 = df[columnas]
df2
This is what the DF looks like:
example: this df is which I need to append the new data
Then I've just connected to Sheets here, send the credentials, open the sheet("carrefourMetodosDePago") and the page("transacciones")
sa = gspread.service_account(filename="service_account.json")
sh = sa.open("carrefourMetodosDePago")
wks = sh.worksheet("transacciones")
The magic begins:
wks.update([df2.columns.values.tolist()] + df2.values.tolist())
With this sentence I upload what the picture shows, to the sheet!
I need the new data that generates the API to be appended/merged/concatenated to the current data so the code upload the current data PLUS the new everytime I run it and so forth.
How can I do that? Should I use a for loop and iterate over every new data en append it to the sheet?
This is the best I could have done, I think I reached my turning point here...
If I explained myself wrong just let me know.
If you reach up to here just let me thank you to give me some time :)

Python code not writing output in excel sheet but is able to take input from another sheet in same workbook

Background:
I am fetching Option chain for a symbol from web and then writing it to an excel sheet. I also have another excel sheet in the same workbook from which I take inputs for the program to run. All of this I am doing with excel 2016.
Sample of the code from program as the whole program is pretty long:
import xlwings as xw
excel_file = 'test.xlsx'
wb = xw.Book(excel_file)
wb.save()
# Fetching User input for Script/Ticker else it will be set to NIFTY as default
try:
Script_Input = pd.read_excel(excel_file, sheet_name = 'Input_Options', usecols = 'C')
script = Script_Input.iloc[0,0]
except:
script = 'NIFTY'
# Writing data in the sheet
sht_name = script + '_OC'
try:
wb.sheets.add(sht_name)
print('new sheet added')
wb.save()
except:
pass
# print('sheet already present')
# directing pointer towards current sheet to be written
sheet = wb.sheets(sht_name)
sheet.range('A4').options(index = False, header = False).value = df
sheet.range('B1').value = underlying
sheet.range('C1').value = underlying_Value
# sheet.range('A3').options(index = False, header = False).value = ce_data_final
# sheet.range('J3').options(index = False, header = False).value = pe_data_final
wb.save()
Problem: Since yesterday, I am able to open my excel workbook with excel 2016 and change inputs for my program but, I do not get any data written in the sheet that takes output from the program. The program runs perfectly as I can test the output on terminal. Also, once I delete the sheet no new sheet is being created as it should.
What I tried: I have uninstalled every other version of excel I had, so now only excel 2016 is present.
I have made sure that all the respective file formats use excel 2016 as the default app.
Also note that, 2 days ago I was able to write data perfectly in the respective sheet but now I am not able to do so.
Any help appreciated...
Sorry to everyone who tried to solve this question.
after #buran asked about 'df' I looked into my code and found that I had a return statement before writing 'df' into sheet (I have created a separate function to write data in excel). Now that I have moved that statement to its proper place the code is working fine. I am extremely sorry as I did not realise what the problem was in the 1st place and assumed it had to do with excel and python. Now the program runs perfectly and I am getting the output I want.

Need help creating a loop that will go through row by row in Excel

I'm a beginner at Python, and I have been trying my hand at some projects. I have an excel spreadsheet that contains a column of URLs that I want to open, pull some data from, output to a different column on my spreadsheet, and then go down to the next URL and repeat.
I was able to write code that allowed me to complete almost the entire process if I enter in a single URL, but I suck at creating loops
My list is only 10 cells long.
My question is, what code can I use that will loop through a column until it hits a stopping point. .
import urllib.request, csv, pandas as pd
from openpyxl import load_workbook
xl = pd.ExcelFile("filename.xlsx")
ws = xl.parse("Sheet1")
i = 0 # This is where I insert the row number for a specific URL
urlpage = str(ws['URLPage'][i]) # 'URLPage' is the name of the column in Excel
p = urlpage.replace(" ", "") # This line is for deleting whitespace in my URL
response = urllib.request.urlopen(p)
Also as stated, I'm newer at Python, so if you see where I can improve the code I already have, please let me know.

Python: Issue with rapidly reading and writing excel files after web scraping? Works for a bit then weird issues come up

So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.

Categories