How to extract table using request /pandas using python - python

I tried the code for exctacting product name, year, values fro table but some where i getting issues .
my code:
import requests
import pandas as pd
import pymysql
try:
df = []
dates1 = []
try:
url = 'http://cpmaindia.com/fiscal_custom_duty.php'
html = requests.get(url).content
tab_list = pd.read_html(html)
tab = tab_list[0]
tab.apply(lambda x: x.tolist(), axis=1)
tab = tab.values.tolist()
print(tab)
except Exception as e:
raise e
except Exception as e:
raise e
This I tried But not getting desire output .
want to parse table only.
thanks

tab_list[0] produces the following:
print (tab)
0
0 <!-- function MM_swapImgRestore() { //v3.0 va...
1 Custom Duty Import Duty on Petrochemicals (%)...
2 <!-- body { \tmargin-left: 0px; \tmargin-top: ...
Did you mean to grab tab_list[8]?
Also, if you're using pandas to read in the table from html, there is no need to use requests:
import pandas as pd
url = 'http://cpmaindia.com/fiscal_custom_duty.php'
tab_list = pd.read_html(url)
table = tab_list[8]
table.columns = table.iloc[0,:]
table = table.iloc[1:,2:-1]
Output:
print (table)
0 Import Duty on Petrochemicals (%) ... Import Duty on Petrochemicals (%)
1 Product / Year - ... 16/17
2 Naphtha ... 5
3 Ethylene ... 2.5
4 Propylene ... 2.5
5 Butadiene ... 2.5
6 Benzene ... 2.5
7 Toluene ... 2.5
8 Mixed Xylene ... 2.5
9 Para Xylene ... 0
10 Ortho Xylene ... 0
11 LDPE ... 7.5
12 LLDPE ... 7.5
13 HDPE ... 7.5
14 PP ... 7.5
15 PVC ... 7.5
16 PS ... 7.5
17 EDC ... 2
18 VCM ... 2
19 Styrene ... 2
20 SBR ... 10
21 PBR ... 10
22 MEG ... 5
23 DMT ... 5
24 PTA ... 5
25 ACN ... 5
[25 rows x 7 columns]

Related

Binance Delivery Futures Historical Prices

I'm trying to get from binance the historical futures prices of the futures contract that expires on the 31st of December 2021.
I have figured this out for perps but am struggling with a futures contract with a delivery date. The code for the perps is below
df = pd.DataFrame(client.futures_historical_klines(
symbol='BTCUSDT',
interval='1d',
start_str='2021-06-01',
end_str='2021-06-30'
))
I assumed that replacing the symbol with BTCUSD_211231 or BTCUSDT_211231 would have done the trick, but unfortunately I get the below error message:
BinanceAPIException: APIError(code=-1121): Invalid symbol.
Any help is much appreciated!
Thanks
According to the binance documentation, you can set contractType for the desired contract.
The following options for contractType are available:
PERPETUAL
CURRENT_MONTH
NEXT_MONTH
CURRENT_QUARTER
NEXT_QUARTER
The following code works for me:
import binance
import pandas as pd
client = binance.Client()
r = client.futures_continous_klines(
pair='BTCUSDT',
contractType='CURRENT_QUARTER',
interval='1d',
)
df = pd.DataFrame(r)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 11
0 1612310400000 36054.1 39000.0 31111.0 38664.1 688.738 1612396799999 21632849.1822 33908 336.944 10440572.5057 0
1 1612396800000 38664.1 43820.2 37445.4 38328.3 757.761 1612483199999 29584739.5781 16925 387.058 15156395.7362 0
2 1612483200000 38304.6 39955.4 37858.3 39848.1 383.410 1612569599999 14995639.3696 5563 183.214 7170636.7752 0
3 1612569600000 39876.5 42727.3 39437.6 41245.1 453.336 1612655999999 18775322.9609 8898 225.566 9347858.6798 0
4 1612656000000 41240.3 41642.5 38639.2 40592.0 428.693 1612742399999 17269756.5859 7553 202.850 8175165.3435 0
.. ... ... ... ... ... ... ... ... ... ... ... ..
242 1633219200000 48654.6 50479.1 48100.0 49325.1 2222.058 1633305599999 109305621.1491 27633 1106.026 54396950.3347 0
243 1633305600000 49325.1 50738.3 47961.5 50461.2 1808.321 1633391999999 89174286.5122 28367 925.258 45643475.0561 0
244 1633392000000 50491.5 53151.3 50245.3 52764.5 1860.870 1633478399999 95741544.2087 29105 921.897 47452449.1528 0
245 1633478400000 52769.4 57442.1 51606.6 56710.1 2431.580 1633564799999 132849081.0296 38013 1225.920 67014360.6873 0
246 1633564800000 56723.3 56739.9 54769.0 55645.2 1188.181 1633651199999 66322580.2665 21176 570.967 31878146.7653 0
[247 rows x 12 columns]
The meaning of each column in the above dataframe is documented in the Binance documentation (see right side, under "Response").

How to scrape a non tabled list from wikipedia and create a datafram?

en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'
Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu

Permute the last column only using python

I have tried
import numpy as np
import paandas as pd
df1 = pd.read_csv('Trans_ZS_Control_64')
df = df1.apply(np.random.permutation)
This permuted the entire data but i want to permute the value of last column only upto 100 times.
How do I proceed for this.
Input Data
0.051424 0.535067 0.453645 0.161857 -0.017189 -0.001850 0.481861 0.711553 0.083747 0.583215 ... 0.541249 0.048360 0.370659 0.890987 0.723995 -0.014502 1.295998 0.150719 0.885673 1
-0.067129 0.673519 0.212407 0.195590 -0.034868 -0.231418 0.480255 0.643735 -0.054970 0.511684 ... 0.524751 0.206757 0.578314 0.614924 0.230632 -0.074980 0.747007 0.047382 1.413796 1
-0.994564 -0.881392 -1.150127 -0.589125 -0.663275 -0.955622 -1.088923 -1.210452 -0.922861 -0.689851 ... -0.442188 -1.294110 -0.934985 -1.085506 -0.808874 -0.779111 -1.032484 -1.026208 -0.248476 1
-0.856323 -0.619472 -1.113073 -0.691285 -0.515566 -1.080643 -0.513487 -0.912825 -1.010245 -0.870335 ... -0.941149 -1.012917 -1.647812 -0.654150 -0.735166 -0.984510 -0.949168 -1.052115 -0.052492 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
-0.145871 0.832727 -0.003379 0.327546 1.409891 0.840316 0.700613 0.184477 0.962488 0.200397 ... -0.337530 0.988197 0.751663 0.480126 0.663302 -0.522189 0.512744 -0.063515 1.125415 0
0.972923 0.857971 -0.195672 0.190443 1.652155 0.763571 0.604728 0.115846 0.942269 0.453387 ... -0.522834 0.985770 0.570573 0.438632 0.737030 -0.445704 0.387023 0.031686 1.266407 0
0.281427 1.060266 0.172624 0.258344 1.544505 0.859777 0.689876 0.439106 0.955198 0.335523 ... -0.442724 0.929343 0.707809 0.290670 0.688595 -0.438848 0.762695 -0.105879 0.944989 0
0.096601 1.112720 0.105861 -0.133927 1.526764 0.773759 0.661673 -0.007070 0.884725 0.478899 ... -0.404426 0.966646 0.994733 0.418965 0.862612 -0.174580 0.407309 -0.010520 1.044876 0
-0.298780 1.036580 0.131270 0.019826 1.381928 0.879310 0.619529 -0.022691 0.982060 -0.039355 ... -0.702316 0.985320 0.457767 0.215949 0.752685 -0.405060 0.166226 -0.216972 1.021018 0
Expected output: randomly permute the last column
0.051424 0.535067 0.453645 0.161857 -0.017189 -0.001850 0.481861 0.711553 0.083747 0.583215 ... 0.541249 0.048360 0.370659 0.890987 0.723995 -0.014502 1.295998 0.150719 0.885673 0
-0.067129 0.673519 0.212407 0.195590 -0.034868 -0.231418 0.480255 0.643735 -0.054970 0.511684 ... 0.524751 0.206757 0.578314 0.614924 0.230632 -0.074980 0.747007 0.047382 1.413796 0
-0.994564 -0.881392 -1.150127 -0.589125 -0.663275 -0.955622 -1.088923 -1.210452 -0.922861 -0.689851 ... -0.442188 -1.294110 -0.934985 -1.085506 -0.808874 -0.779111 -1.032484 -1.026208 -0.248476 1
-0.856323 -0.619472 -1.113073 -0.691285 -0.515566 -1.080643 -0.513487 -0.912825 -1.010245 -0.870335 ... -0.941149 -1.012917 -1.647812 -0.654150 -0.735166 -0.984510 -0.949168 -1.052115 -0.052492 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
-0.145871 0.832727 -0.003379 0.327546 1.409891 0.840316 0.700613 0.184477 0.962488 0.200397 ... -0.337530 0.988197 0.751663 0.480126 0.663302 -0.522189 0.512744 -0.063515 1.125415 0
0.972923 0.857971 -0.195672 0.190443 1.652155 0.763571 0.604728 0.115846 0.942269 0.453387 ... -0.522834 0.985770 0.570573 0.438632 0.737030 -0.445704 0.387023 0.031686 1.266407 1
0.281427 1.060266 0.172624 0.258344 1.544505 0.859777 0.689876 0.439106 0.955198 0.335523 ... -0.442724 0.929343 0.707809 0.290670 0.688595 -0.438848 0.762695 -0.105879 0.944989 0
0.096601 1.112720 0.105861 -0.133927 1.526764 0.773759 0.661673 -0.007070 0.884725 0.478899 ... -0.404426 0.966646 0.994733 0.418965 0.862612 -0.174580 0.407309 -0.010520 1.044876 0
-0.298780 1.036580 0.131270 0.019826 1.381928 0.879310 0.619529 -0.022691 0.982060 -0.039355 ... -0.702316 0.985320 0.457767 0.215949 0.752685 -0.405060 0.166226 -0.216972 1.021018 1
No sure if this is what you meant, but you could do it like this
import pandas as pd
import numpy as np
v = np.arange(0,10)
df = pd.DataFrame({'c1': v, 'c2': v, 'c3': v})
df
this would create the following df:
c1 c2 c3
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
to permute the last column you could run this:
df1 = df.copy()
df1.c3 = np.random.permutation(df1.c3)
df1
resulting in:
c1 c2 c3
0 0 0 5
1 1 1 9
2 2 2 2
3 3 3 6
4 4 4 0
5 5 5 4
6 6 6 8
7 7 7 7
8 8 8 1
9 9 9 3
I hope it helps
Just create a dataframe from your last column and permutate that. It seems like permutating individual columns with apply doesn't work the way you expect it to.
import numpy as np
import pandas as pd
df = pd.read_csv('Trans_ZS_Control_64')
column_to_change = pd.DataFrame(df['last_column_name'])
for i in range(100):
column_to_change = column_to_change.apply(np.random.permutation)
df['last_column_name'] = column_to_change

Beautiful soup, how to scrape multiple urls and save them in a csv file

So I am wondering how to scrape multiple websites/urls and save them, (the data), to a csv file. I can only save the first page right now. I have tried many different ways but it doesn´t seem to work. How can I save 5 pages in a csv file and not only one?
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time
urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html,'html.parser')
data = []
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
title = container.find('span',attrs={'class':'title'}).text
if container.find('span',attrs={'class':'win'}):
win = '1'
else:
win = '0'
if container.find('span',attrs={'class':'mac'}):
mac = '1'
else:
mac = '0'
if container.find('span',attrs={'class':'linux'}):
linux = '1'
else:
linux = '0'
data.append({
'Title':title.encode('utf-8'),
'Time':st,
'Win':win,
'Mac':mac,
'Linux':linux})
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title','Win','Mac','Linux','Time']
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)
the issue is you are re-initializing your data after each url. And then writing it after the very last iteration, meaning you'll always just have whatever the last data you got from the last url. You'll need to have that data appending and not be overwritten after each iteration:
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time
urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']
results_df = pd.DataFrame() #<-- initialize a results dataframe to dump/store the data you collect after each iteration
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html,'html.parser')
data = [] #<-- your data list is "reset" after each iteration of your urls
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
title = container.find('span',attrs={'class':'title'}).text
if container.find('span',attrs={'class':'win'}):
win = '1'
else:
win = '0'
if container.find('span',attrs={'class':'mac'}):
mac = '1'
else:
mac = '0'
if container.find('span',attrs={'class':'linux'}):
linux = '1'
else:
linux = '0'
data.append({
'Title':title,
'Time':st,
'Win':win,
'Mac':mac,
'Linux':linux})
temp_df = pd.DataFrame(data) #<-- temporary storing the data in a dataframe
results_df = results_df.append(temp_df).reset_index(drop=True) #<-- dumping that data into a results dataframe
results_df.to_csv('data.csv', index=False) #<-- writing the results dataframe to csv
testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)
Output:
print (results_df)
Linux Mac ... Title Win
0 0 0 ... Tom Clancy's Rainbow Six® Siege 1
1 0 0 ... Tom Clancy's Rainbow Six® Siege 1
2 1 1 ... Total War: WARHAMMER II 1
3 0 0 ... Tom Clancy's Rainbow Six® Siege 1
4 1 1 ... Total War: WARHAMMER II 1
5 0 1 ... Frostpunk 1
6 0 0 ... Tom Clancy's Rainbow Six® Siege 1
7 1 1 ... Total War: WARHAMMER II 1
8 0 1 ... Frostpunk 1
9 1 1 ... Two Point Hospital 1
10 0 0 ... Tom Clancy's Rainbow Six® Siege 1
11 1 1 ... Total War: WARHAMMER II 1
12 0 1 ... Frostpunk 1
13 1 1 ... Two Point Hospital 1
14 0 0 ... Black Desert Online 1
15 0 0 ... Tom Clancy's Rainbow Six® Siege 1
16 1 1 ... Total War: WARHAMMER II 1
17 0 1 ... Frostpunk 1
18 1 1 ... Two Point Hospital 1
19 0 0 ... Black Desert Online 1
20 1 1 ... Kerbal Space Program 1
21 0 0 ... Tom Clancy's Rainbow Six® Siege 1
22 1 1 ... Total War: WARHAMMER II 1
23 0 1 ... Frostpunk 1
24 1 1 ... Two Point Hospital 1
25 0 0 ... Black Desert Online 1
26 1 1 ... Kerbal Space Program 1
27 1 1 ... BioShock Infinite 1
28 0 0 ... Tom Clancy's Rainbow Six® Siege 1
29 1 1 ... Total War: WARHAMMER II 1
... .. ... ... ..
1595 0 0 ... VEGAS Pro 14 Edit Steam Edition 1
1596 0 0 ... ABZU 1
1597 0 0 ... Sacred 2 Gold 1
1598 0 0 ... Sakura Bundle 1
1599 1 1 ... Distance 1
1600 0 0 ... LEGO® Batman™: The Videogame 1
1601 0 0 ... Sonic Forces 1
1602 0 0 ... The Stronghold Collection 1
1603 0 0 ... Miscreated 1
1604 0 0 ... Batman™: Arkham VR 1
1605 1 1 ... Shadowrun Returns 1
1606 0 0 ... Upgrade to VEGAS Pro 16 Edit 1
1607 0 0 ... Girl Hunter VS Zombie Bundle 1
1608 0 1 ... Football Manager 2019 Touch 1
1609 0 1 ... Total War: NAPOLEON - Definitive Edition 1
1610 1 1 ... SteamWorld Dig 2 1
1611 0 0 ... Condemned: Criminal Origins 1
1612 0 0 ... Company of Heroes 1
1613 0 0 ... LEGO® Batman™ 2: DC Super Heroes 1
1614 1 1 ... Euro Truck Simulator 2 Map Booster 1
1615 0 0 ... Sonic Adventure DX 1
1616 0 0 ... Worms Armageddon 1
1617 1 1 ... Unforeseen Incidents 1
1618 0 0 ... Warhammer 40,000: Space Marine Collection 1
1619 0 0 ... VEGAS Pro 14 Edit Steam Edition 1
1620 0 0 ... ABZU 1
1621 0 0 ... Sacred 2 Gold 1
1622 0 0 ... Sakura Bundle 1
1623 1 1 ... Distance 1
1624 0 0 ... Worms Revolution 1
[1625 rows x 5 columns]
So I was apparently very blind to my code, that can happen when you stare at it all day. All I actually had to do was to move the "data = []" above the for loop so it wouldn´t reset every time.

Turn an HTML table into a CSV file

How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?
I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).
Here is the code I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
print url
soup = BeautifulSoup(urlopen(url), "lxml")
p_type = ""
if url[-12] == 'p':
p_type = "pitching"
elif url[-12] == 'b':
p_type = "batting"
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open("%s.csv" % id_nums[idx], 'wb') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(row for row in rows if row)
idx += 1
player_links.close()
if __name__ == "__main__":
stir_the_soup()
The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.
For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?
this code gets you the big table of stats, which is what I think you want.
Make sure you have lxml, beautifulsoup4 and pandas installed.
df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:
df[4].head(5)
Rk Gcar Gtm Date Tm Unnamed: 5 Opp Rslt Inngs PA ... CS BA OBP SLG OPS BOP aLI WPA RE24 Pos
0 1 66 2 (1) Apr 6 ARI NaN SDP L,3-6 7-8 1 ... 0 1.000 1.000 1.000 2.000 9 .94 0.041 0.51 PH
1 2 67 3 Apr 7 ARI NaN SDP W,5-3 7-8 1 ... 0 .500 .500 .500 1.000 9 1.16 -0.062 -0.79 PH
2 3 68 4 Apr 9 ARI NaN PIT W,9-1 8-GF 1 ... 0 .667 .667 .667 1.333 2 .00 0.000 0.13 PH SS
3 4 69 5 Apr 10 ARI NaN PIT L,3-6 CG 4 ... 0 .500 .429 .500 .929 2 1.30 -0.040 -0.37 SS
4 5 70 7 (1) Apr 13 ARI # LAD L,5-9 6-6 1 ... 0 .429 .375 .429 .804 9 1.52 -0.034 -0.46 PH
to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)
Example: df[4]['Gcar']
Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)
bs = BeautifulSoup(html,'lxml')
table = str(bs.find('table',{'id':'batting_gamelogs'}))
dfs = pd.read_html(table)
This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

Categories