I am having some problems with pandas read_html. When I tried to read the table using pandas, it wouldn't work. So I tried using requests and BeautifulSoup and solved the problem. But I would like to know why it was not possible for me to get the table using pandas at the first time. Thank you.
first code
import pandas as pd
url = 'https://finance.naver.com/item/sise_day.nhn?code=005930&page=1'
r = pd.read_html(url)[0]
second code that i tried
import requests
from bs4 import BeautifulSoup
import pandas as pd
ulr = 'https://finance.naver.com/item/sise_day.nhn?code=005930&page=1'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = str(soup.select("table"))
data=pd.read_html(table)[0]
Related
I've just started to get into web scraping using Python and I'm slowly making progress. I hope someone can help me out.
I'm trying to scrape all the aircraft on Icelandic aircraft register. I've written a script that pulls all the data in from the table a prints it to the screen as shown here:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.icetra.is/aviation/aircraft/register/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
aircraft = soup.findAll('tr')
for ac in aircraft:
print(ac.get_text())
What i would like to be able to do is save it to a csv file with rows and columns, my guess would be that i need to have each of the columns as a variable and read each row of data into the relevant column.
Regards,
Mark
You can use DataFrame.to_csv() from pandas. Here's an example:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.icetra.is/aviation/aircraft/register/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
aircraft = soup.findAll('tr')
aircrafts = [ac.get_text() for ac in aircraft]
df = pd.DataFrame({"Aircrafts": aircrafts})
df.to_csv("aircrafts.csv")
Edit: I've noticed that soup.findAll('tr') might be getting more information that you wanted, in this case its getting the text from the whole row. You might want to use ac.stripped_strings (documentation) to get each string from the column.
Edit 2: You should try pd.read_html() to read this table. However, I tried fixing my last code and got this solution:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.icetra.is/aviation/aircraft/register/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
aircraft = soup.findAll('tr')
rows = [list(ac.stripped_strings) for ac in aircraft]
df = pd.DataFrame.from_records(rows)
df.columns = df.iloc[0]
df.drop(index=0, inplace=True)
df.to_csv("aircrafts.csv", index=False)
I want to create a script that fetches the all the data in the following website : https://www.bis.doc.gov/dpl/dpl.txt and store it in a excel file and count the number of records in it, using python language. I've tried to achieve by implementing the code as:
import requests
import re
from bs4 import BeautifulSoup
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
I've fetched the data but didn't know the next step of storing it as excel file. Anyone pls guide or share your valuable ideas. Thank you in advance!
You can do it with pandas easily. Since the data is in tab seperated value.
Note: openpyxl needs to be installed for this to work.
import requests
import io
import pandas as pd
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
df = pd.read_csv(io.StringIO(page.text), sep="\t")
df.to_excel(r'i_data.xlsx', index = False)
I'm new to Python and am working to extract data from website https://www.screener.in/company/ABB/consolidated/ on a particular table (the last table which is Shareholding Pattern)
I'm using BeautifulSoup library for this but I do not know how to go about it.
So far, here below is my code snippet. am failing to pick the right table due to the fact that the page has multiple tables and all tables share common classes and IDs which makes it difficult for me to filter for the one table I want.
import requests import urllib.request
from bs4 import BeautifulSoup
url = "https://www.screener.in/company/ABB/consolidated/"
r = requests.get(url)
print(r.status_code)
html_content = r.text
soup = BeautifulSoup(html_content,"html.parser")
# print(soup)
#data_table = soup.find('table', class_ = "data-table")
# print(data_table) table_needed = soup.find("<h2>ShareholdingPattern</h2>")
#sub = table_needed.contents[0] print(table_needed)
Just use requests and pandas. Grab the last table and dump it to a .csv file.
Here's how:
import pandas as pd
import requests
df = pd.read_html(
requests.get("https://www.screener.in/company/ABB/consolidated/").text,
flavor="bs4",
)
df[-1].to_csv("last_table.csv", index=False)
Output from a .csv file:
I am attempting to scrape tables from the website spotrac.com and save the data to a pandas dataframe. For whatever reason, if the table I am scraping is over 100 rows, the BeautifulSoup object only appears to grab the first 100 rows of the table. If you run my code below, you'll see that the resulting dataframe has only 100 rows, and ends with "David Montgomery." If you visit the webpage (https://www.spotrac.com/nfl/rankings/2019/base/running-back/) and ctrl+F "David Montgomery", you'll see that there are additional rows. If you change the webpage in the get row of the code to "https://www.spotrac.com/nfl/rankings/2019/base/wide-receiver/" you'll see that the same thing happens. Only the first 100 rows are included in the BeautifulSoup object and in the dataframe.
import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup
# Begin requests session
with requests.session() as s:
# Get page
r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
I have read that changing the parser can help. I have tried using different parsers by replacing the BeautifulSoup object code with the following:
soup = BeautifulSoup(r.content,'html5lib')
soup = BeautifulSoup(r.content,'html.parser')
Neither of these changes worked. I have run "pip install html5lib" and "pip install lxml" and confirmed that both were already installed.
This page uses JavaScript to load extra data.
In DevTools in Firefox/Chrome you can see it sends POST request with extra information {'ajax': True, 'mobile': False}
import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup
with requests.session() as s:
r = s.post('https://www.spotrac.com/nfl/rankings/2019/base/running-back/', data={'ajax': True, 'mobile': False})
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
print(df)
I suggest you use request-html
import pandas as pd
from bs4 import BeautifulSoup
from requests_html import HTMLSession
if __name__ == "__main__":
# Begin requests session
s = HTMLSession()
# Get page
r = s.get('https://www.spotrac.com/nfl/rankings/2019/base/running-back/')
r.html.render()
# Get page content, find first table, and save to df
soup = BeautifulSoup(r.html.html, 'lxml')
table = soup.find_all('table')[0]
df_list = pd.read_html(str(table))
df = df_list[0]
Then you will get 140 lines.
I am trying to import a table from a website and afterwards transform the data into a pandas dataframe.
The website is: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Thats my code so far:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
website_url = requests.get(
'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('table',{'class':'wikitable sortable'})
for x in soup.find_all('table',{'class':'wikitable sortable'}):
table = x.text
print(My_table)
print(table)
Output of print(My_table)
Output of print(table)
How do I convert this webpage table to a panda dataframe?
panda dataframe
have you tried
pd.read_html()
?
Also, since the table is very standard, why not directly copy the table into excel and import it as DataFrame?