I'm currently encountering an error with my code, and I have no idea why. I originally thought it was because I couldn't save a csv file with hyphens in it, but that turns out not to be the case. Does anyone have any suggestions to what might be causing the problem. My code is below:
import pandas as pd
import requests
query_set = ["points-per-game"]
for query in query_set:
url = 'https://www.teamrankings.com/ncaa-basketball/stat/' + str(query)
html = requests.get(url).content
df_list = pd.read_html(html)
print(df_list)
df_list.to_csv(str(query) + "stat.csv", encoding="utf-8")
the read_html() method returns a list of dataframes, not a single dataframe:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
You'll want to loop through your df_list and run to_csv on each entry, like so:
import pandas as pd import requests
query_set = ["points-per-game"]
for query in query_set:
url = 'https://www.teamrankings.com/ncaa-basketball/stat/' + str(query)
html = requests.get(url).content
df_list = pd.read_html(html)
print(df_list)
for current_df in df_list:
current_df.to_csv(str(query) + "stat.csv", encoding="utf-8")
print(current_df)
The function pd.read_html returns a list of DataFrames found in the HTML source. Use df_list[0] to get the DataFrame which is the first element of this list.
Related
I am experiencing this error with the code below:
File "\<stdin\>", line 1, in \<module\>
AttributeError: 'list' object has no attribute 'to_excel'
I want to save the table I am scraping from wikipedia to an Excel file - but I can't work out how to adjust the code to get the data list from the terminal to the Excel file using to_excel.
I can see it works for a similar problem when a dataset has data set out as a 'DataFrame'
(i.e. df = pd.DataFrame(data, columns = \['Product', 'Price'\]).
But can't work out how to adjust my code for the df = pd.read*html(str(congresstable)) line - which I think is the issue. (i.e. using read*_html and sourcing the data from a table id)
How can I adjust the code to make it save an excel file to the path specified?
from bs4 import BeautifulSoup
import requests
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, 'html.parser')
congress_table = soup.find('table', attrs={'id': table_id})
df = pd.read_html(str(congress_table))
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)
print(df)
I was expecting the data list to be saved to Excel at the folder path specified.
I tried following multiple guides, but they don't show the read_html item, only DataFrame solutions.
pandas.read_html() creates a list of tables respectiv dataframe objects, so you have to pick one by index in your case [0] - You also do not need requests and BeautifulSoup, separatly, just go with pandas.read_html()
pd.read_html(wiki_url,attrs={'id': table_id})[0]
Example
import pandas as pd
wiki_url = 'https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives'
table_id = 'votingmembers'
congress_table = soup.find('table', )
df = pd.read_html(wiki_url,attrs={'id': table_id})[0]
df.to_excel (r'C:\Users\name\OneDrive\Code\.vscode\Test.xlsx', index = False, header=True)
I need to collect data from the website - https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow= and store it in a dataframe using pandas. For this I use the following code and get the data quite easily -
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url).text
df = pd.read_html(link)
df = df[-1]
But if you notice there is another hyperlink in the table on the extreme right hand side of every row of the webpage by the name "Details". I would also like to add the data from inside that hyperlink to every row in our dataframe. How do we do that?
As suggested by Shi XiuFeng, BeautifulSoup is better suited for your problem but if you still want to proceed with your current code, you would have to use regex to extract the URLs and add them as a column like this:
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url)
link_content = str(link.content)
res = re.findall(r'(<tbody.*?>.*?</tbody>)', link_content)[0]
res = re.findall(r'(<a href=\"(.*?)\">Details\<\/a\>)', res)
res = [i[1] for i in res]
link_text = link.text
df = pd.read_html(link_text)
df = df[-1]
df['links'] = res
print(df)
Hope that solves your problem.
I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])
I have extracted a table from a site with the help of BeautifulSoup. Now I want to keep this process going in a loop with several different URL:s. If it is possible, I would like to extract these tables into different excel documents, or different sheets within a document.
I have been trying to put the code through a loop and appending the df
from bs4 import BeautifulSoup
import requests
import pandas as pd
xl = pd.ExcelFile(r'path/to/file.xlsx')
link = xl.parse('Sheet1')
#this is what I can't figure out
for i in range(0,10):
try:
url = link['Link'][i]
html = requests.get(url).content
df_list = pd.read_html(html)
soup = BeautifulSoup(html,'lxml')
table = soup.select_one('table:contains("Fees Earned")')
df = pd.read_html(str(table))
list1.append(df)
except ValueError:
print('Value')
pass
#Not as important
a = df[0]
writer = pd.ExcelWriter('mytables.xlsx')
a.to_excel(writer,'Sheet1')
writer.save()
I get a 'ValueError'(no tables found) for the first nine tables and only the last table is printed when I print mylist. However, when I print them without the for loop, one link at a time, it works.
I can't append the value of df[i] because it says 'index out of range'
I'm attempting take dates from a dataframe and loop through them within URL's. I've managed to print the URL's (1st code), but when I attempt to turn the URL's json into a dataframe (2nd code) I get this response.
AttributeError: 'str' object has no attribute 'json'
#1st code
import requests
import pandas as pd
df = pd.read_csv('NBADates.csv')
df.to_dict('series')
for row in df.loc[ : ,"Date"]:
url = url_template.format(row=row)
print(url)
Any ideas on what I'm doing wrong?
#2nd code
import requests
import csv
import pandas as pd
url_template = "https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom={row}&DateTo={row}&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=Totals&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=SpeedDistance&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
df = pd.read_csv('NBADates.csv')
df.to_dict('series')
for row in df.loc[ : ,"Date"]:
url = url_template.format(row=row)
stats = url.json()['resultSets'][0]['rowSet']
headers = url.json()['resultSets'][0]['headers']
stats_df = pd.DataFrame(stats, columns=headers)
# Append to the big dataframe
lineup_df = lineup_df.append(stats_df, ignore_index=True)
lineup_df.to_csv("Stats.csv")
I think you forgot to request the URL. you should send a request and if the response is a json, you should parse it