Having trouble putting data into a pandas dataframe - python

I am new to coding, so take it easy on me! I recently started a pet project which scrapes data from a table and will create a csv of the data for me. I believe I have successfully pulled the data, but trying to put it into a dataframe returns the error "Shape of passed values is (31719, 1), indices imply (31719, 23)". I have tried looking at the length of my headers and my rows and those numbers are correct, but when I try to put it into a dataframe it appears that it is only pulling one column into the dataframe. Again, I am very new to all of this but would appreciate any help! Code below
from bs4 import BeautifulSoup
from pandas.core.frame import DataFrame
import requests
import pandas as pd
url = 'https://www.fangraphs.com/leaders.aspx? pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
#pulling table from HTML
Table1 = soup.find('table', id = 'LeaderBoard1_dg1_ctl00')
#finding and filling table columns
headers = []
for i in Table1.find_all('th'):
title = i.text
headers.append(title)
#finding and filling table rows
rows = []
for j in Table1.find_all('td'):
data = j.text
rows.append(data)
#filling dataframe
df = pd.DataFrame(rows, columns = headers)
#show dataframe
print(df)

You are creating a dataframe with 692 rows with 23 columns as a new dataframe. However looking at the rows array, you only have 1 dimensional array so shape of passed values is not matching with indices. You are passing 692 x 1 to a dataframe with 692 x 23 which won't work.
If you want to create with the data you have, you should just use:
df=pd.DataFrame(rows, columns=headers[1:2])

Alternativly you can achieve your goal directly by using pandas.read_html that processe the data by BeautifulSoup for you:
pd.read_html(url, attrs={'id':'LeaderBoard1_dg1_ctl00'}, header=[1])[0].iloc[:-1]
attrs={'id':'LeaderBoard1_dg1_ctl00'} selects table by id
header=[1] adjusts the header cause there are multiple headers
.iloc[:-1] removes the table footer with pagination
Example
import pandas as pd
pd.read_html('https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&page=1_1500',
attrs={'id':'LeaderBoard1_dg1_ctl00'},
header=[1])[0]\
.iloc[:-1]

Related

web scraping a dataframe

i'm currently trying to web scraping a dataframe (about sctack exchange of a companie) in a website in order to make a new dataframe in python this data.
I've tried to scrap the row of the dataframe in order to store in a csv file and use the method pandas.read_csv().
I meet some trouble because the csv file is not as good as i thought.
How can i manage to get the exactly same dataframe in python with web-scraping it
Here's my code :
from bs4 import BeautifulSoup
import urllib.request as ur
import csv
import pandas as pd
url_danone = "https://www.boursorama.com/cours/1rPBN/"
our_url = ur.urlopen(url_danone)
soup = BeautifulSoup(our_url, 'html.parser')
with open('danone.csv', 'w') as filee:
for ligne in soup.find_all("table", {"class": "c-table c-table--generic"}):
row = ligne.find("tr", {"class": "c-table__row"}).get_text()
writer = csv.writer(filee)
writer.writerow(row)
The dataframe in the website
The csv file
You can use pd.read_html to read the required table:
import pandas as pd
url = "https://www.boursorama.com/cours/1rPBN/"
df = pd.read_html(url)[1].rename(columns={"Unnamed: 0": ""}).set_index("")
print(df)
df.to_csv("data.csv")
Prints and saves data.csv (screenshot from LibreOffice):
Please try this for loop instead:
rows = []
headers = []
# loop to get the values
for tr in soup.find_all("tr", {"class": "c-table__row"})[13:18]:
row = [td.text.strip() for td in tr.select('td') if td.text.strip()]
rows.append(row)
# get the header
for th in soup.find_all("th", {"class": "c-table__cell c-table__cell--head c-table__cell--dotted c-table__title / u-text-uppercase"}):
head = th.text.strip()
headers.append(head)
This would get your values and header in the way you want. Note that, since the tables don't have ids or any unique identifiers, you need to proper stabilish which rows you want considering all tables (see [13:18] in the code above).
You can check your content making a simple dataframe from the headers and rows as below:
# write csv
df = pd.DataFrame(rows, columns=headers)
print(df.head())
Hope this helps.

scrape data to store into pandas dataframe

I am trying to scrape the table "List of chemical elements" from this website https://en.wikipedia.org/wiki/List_of_chemical_elements
I want to then store the table data into a pandas dataframe such that i can convert it into a csv file. So far i have scraped and stored the headers of the table into a dataframe. I also managed to retrieve each individual rows of data from the table. However, i am having trouble in storing the data for the table into the dataframe. Below is what i have so far
from bs4 import BeautifulSoup
import requests as r
import pandas as pd
response = r.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
wiki_text = response.text
soup = BeautifulSoup(wiki_text, 'html.parser')
table = soup.select_one('table.wikitable')
table_body = table.find('tbody')
#print(table_body)
rows = table_body.find_all('tr')
cols = [c.text.replace('\n', '') for c in rows[1].find_all('th')]
df2a = pd.DataFrame(columns = cols)
df2a
for row in rows:
records = row.find_all('td')
if records != []:
records = [r.text.strip() for r in records]
print(records)
Here i have found all columns data in which it is divided to two parts first and second columns data
all_columns=soup.find_all("tr",attrs={"style":"vertical-align:top"})
first_column_data=[i.get_text(strip=True) for i in all_columns[0].find_all("th")]
second_column_data=[i.get_text(strip=True) for i in all_columns[1].find_all("th")]
Now as we need 16 columns so take appropriate columns and added data to new_lst list which is column list
new_lst=[]
new_lst.extend(second_column_data[:3])
new_lst.extend(first_column_data[1:])
Now we have to find row data iterate through all tr with attrs and find respectivetd and it will return list of table data and append to main_lst
main_lst=[]
for i in soup.find_all("tr",attrs={"class":"anchor"}):
row_data=[row.get_text(strip=True) for row in i.find_all("td")]
main_lst.append(row_data)
Output:
Atomic numberZ Symbol Name Origin of name[2][3] Group Period Block Standardatomicweight[a] Density[b][c] Melting point[d] Boiling point[e] Specificheatcapacity[f] Electro­negativity[g] Abundancein Earth'scrust[h] Origin[i] Phase atr.t.[j]
0 1 H Hydrogen Greekelementshydro-and-gen, 'water-forming' 1 1 s-block 1.008 0.00008988 14.01 20.28 14.304 2.20 1400 primordial gas
....
Let pandas parse it for you:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_chemical_elements')[0]
df.to_csv('file.csv', index=False)

How to retrieve and store 2nd and 3rd row elements in a dataframe

I am new to Pandas and Webscraping and BeautifulSoup in Python.
As I was learning to do some basic webscraping in Python by using requests and BeautifulSoup to scrape a webpage, I am confused with the task of assigning the 2nd and 3rd elements of an html table into a pandas dataframe.
Suppose I have this table:
Here is my code so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_data, 'html.parser')
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])
for row in soup.find_all('tbody')[3].find_all('tr'): #This line will make sure to get to the third table which is this "By market capitalization" table on the webpage and finding all the rows of this table
col = row.find_all('td') #This is to target individual column values in a particular row of the table
for j, cell in enumerate(col):
#Further code here
As it can be seen, I want to target all the 2nd and 3rd column values of a row and append to the
empty dataframe, data, so that data contains the Bank names and market cap values. How can I achieve that kind of functionality?
For tables I would suggest pandas:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
tables = pd.read_html(url)
df = tables[1]
When you prefer using beautifulsoup, you can try this to accomplish the same:
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser').find_all('table')
table = soup[1]
table_rows = table.find_all('tr')
table_header = [th.text.strip() for th in table_rows[0].find_all('th')]
table_data = []
for row in table_rows[1:]:
table_data.append([td.text.strip() for td in row.find_all('td')])
df = pd.DataFrame(table_data, columns=table_header)
When needed, you can set Rank as index with df.set_index('Rank', inplace=True. Image below is the unmodified dataframe.

python pd.read_html gives error

I've been using pandas and request to pull some tables to get NFL statistics. It's been going pretty well, I've been able to pull tables from other sites, until I tried to get NFL combine table from this one particular site.
It gives me the error message after df_list = pd.read_html(html)
The error I get is:
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
Here's the code I've been using at other sites that worked really well.
import requests
import pandas as pd
df = pd.DataFrame()
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=1987&pos=&college='
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
I've read and seen a little bit about BeautifulSoup, but the simplicity of the pd.read_html() is just so nice and compact. So I don't know if there's a quick fix that I am not aware of, or if I need to indeed dive into BeautifulSoup to get these tables from 1987 - 2017.
This isn't shorter, but may be more robust:
import requests
import pandas as pd
from bs4 import BeautifulSoup
A convenience function:
def souptable(table):
for row in table.find_all('tr'):
yield [col.text for col in row.find_all('td')]
Return a DataFrame with data loaded for a given year:
def getyear(year):
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?year=%d&pos=&college=' % year
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
data = list(souptable(soup.table))
df = pd.DataFrame(data[1:], columns=data[0])
df = df[pd.notnull(df['Name'])]
return df.apply(pd.to_numeric, errors="ignore")
This function slices out the heading row when the DataFrame is created, uses the first row for column names, and filters out any rows with an empty Name value.
Finally, concatenate up as many years as you need into a single DataFrame:
dfs = pd.concat([getyear(year) for year in range(1987, 1990)])
Ok. After doing some more research. I looks like the issue is that the last row is one merged cell, and that's possibly where the issue is coming in. So I did go into using BeautifulSoup to pull the data. Here is my solution:
import requests
import pandas as pd
from bs4 import BeautifulSoup
I wanted to pull for each year from 1987 to 2017
seasons = list(range(1987, 2018))
df = pd.DataFrame()
temp_df = pd.DataFrame()
So it would run through each year. Appending each cell into a new row. Then again knowing the last cell is a "blank", I eliminate that last row by defining the dataframe as df[:-1] before it loops and appends the next years data.
for i in seasons:
df = df[:-1]
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=%s&pos=&college=' % (i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
temp_df = row
df = df.append(temp_df, ignore_index = True)
Finally, since there is no new year to append, I need to eliminate the last row. Then I reshape the dataframe into the 16 columns, rename the columns from the first row, and then eliminate the row headers within the dataframe.
df = df[:-1]
df = (pd.DataFrame(df.values.reshape(-1, 16)))
df.columns = df.iloc[0]
df = df[df.Name != 'Name']
I'm still learning python, so any input, advice, any respectful constructive criticism is always welcome. Maybe there is a better, more appropriate solution?

Creating a DataFrame using .loc to set with enlargement

I'm trying to create a Pandas DataFrame by iterating through data in a soup (from BeautifulSoup4). This SO post suggested using the .loc method to Set With Englargement to create a DataFrame.
However this method takes a long time to run (around 8 minutes for a df of 30,000 rows and 5 columns). Is there any quicker way of doing this. Here's my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://api.turfgame.com/v3/zones"
r = requests.get(url)
soup = BeautifulSoup(r.content)
col_names = ["name", "lat", "lng", "points_take", "points_hold"]
dfi = pd.DataFrame(columns=col_names)
def get_all_zones():
for attr in soup.find_all("zone"):
col_values= [attr.get("name"), attr.get("lat"), attr.get("lng"), attr.get("points_take"), attr.get("points_hold")]
pos = len(dfi.index)
dfi.loc[pos] = col_values
return dfi
get_all_zones()
Avoid
df.loc[pos] = row
whenever possible. Pandas NDFrames store the underlying data in blocks (of common dtype) which (for DataFrames) are associated with columns. DataFrames are column-based data structures, not row-based data structures.
To access a row, the DataFrame must access each block, pick out the appropriate row and copy the data into a new Series.
Adding a row to an existing DataFrame is also slow, since a new row must be appended to each block, and new data copied into the new row. Even worse, the data block has to be contiguous in memory. So adding a new row may force Pandas (or NumPy) to allocate a whole new array for the block and all the data for that block has to be copied into a larger array just to accomodate that one row. All that copying makes things very slow. So avoid it if possible.
The solution in this case is to append the data to a Python list and create the DataFrame in one fell swoop at the end:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://api.turfgame.com/v3/zones"
r = requests.get(url)
soup = BeautifulSoup(r.content)
col_names = ["name", "lat", "lng", "points_take", "points_hold"]
data = []
def get_all_zones():
for attr in soup.find_all("zone"):
col_values = [attr.get("name"), attr.get("lat"), attr.get(
"lng"), attr.get("points_take"), attr.get("points_hold")]
data.append(col_values)
dfi = pd.DataFrame(data, columns=col_names)
return dfi
dfi = get_all_zones()
print(dfi)

Categories