BeautifulSoup scraping incorrect table - python

I was scraping this site with the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.pro-football-reference.com/teams/buf/2021_injuries.htm"
r = requests.get(url)
stats_page = BeautifulSoup(r.content, features="lxml")
table = stats_page.findAll('table')[0] #get FIRST table on page
for player in table.findAll("tr"):
print([i.getText() for i in player.findAll("td")])
The output is:
[]
['', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR']
['', 'Q', '', '', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['', '', '', 'O', '', '', '', 'IR']
['', '', 'Q', '', '', '', '', '']
['', '', '', 'Q', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['O', 'Q', '', '', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['', 'Q', '', 'Q', '', '', '', '']
['', '', '', 'O', '', '', '', '']
['Q', '', '', '', '', '', '', '']
['', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR']
['', '', 'Q', '', '', '', '', '']
['', 'IR', 'IR', 'IR', 'IR', '', '', '']
This is clearly the output I would expect from the 2nd table on the page, "Team Injuries", rather than the 1st table on the page, "Week 10 injury report". Any idea why BeautifulSoup is seemingly ignoring the first table on the page?

The table you want is inside a comment, as such beautifulsoup will not parse the contents for more HTML.
You will need to first locate this comment containing the table and then parse the HTML inside that separately. For example:
import requests
from bs4 import BeautifulSoup, Comment
url = "https://www.pro-football-reference.com/teams/buf/2021_injuries.htm"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
if '<table' in comment:
soup_table = BeautifulSoup(comment, "lxml")
table = soup_table.findAll('table')[0] #get FIRST table on page
for player in table.findAll("tr"):
print([i.getText() for i in player.findAll("td")])
break
This would display your output as:
[]
['DE', '', 'Injured Reserve', '']
['OG', '', 'Injured Reserve', '']
['WR', '', 'Injured Reserve', '']
['DE', 'DNP', '', 'Rest']
['WR', 'DNP', '', 'Rest']
['T', 'Limited', '', 'Back']
['ILB', 'DNP', '', 'Hamstring']
['CB', 'Limited', '', 'Hamstring']
['CB', 'Limited', '', 'Concussion']
['TE', 'Limited', '', 'Hand']
['RB', 'DNP', '', 'Concussion']

Related

Iterating over list with while sentence and values not being excluded with list.remove

I'm running a code to clean a database. Basically, if some value appears in a list they should be removed.
Below you can see the code:
pattern = re.compile("((?:\d{10}|\d{9}|\d{8}|\d{7}|\d{6}|\d{5}|\d{4})(?:-?[\d]))?(?!\S)")
cc = pattern.findall(a)
print("cpf:", cpf)
print("ag:", ag)
print("cc start:",cc)
for i in cc:
print("i:",i)
try:
while i in ag: cc.remove(i)
except:pass
try:
while i in cpf:cc.remove(i)
except:pass
try:
while "" in i:cc.remove(i)
except:pass
print("final cc:",cc)
It prints in my screen the following:
cpf: ['00770991092']
ag: 3527
cc start: ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '00770991092', '', '', '', '', '', '', '', '', '01068651-0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
i:
i: 01068651-0
final cc: ['00770991092']
Well, the '' values are removed, that's seem to be working fine. However since '00770991092' is a value inside cpf it should've been removed, but it hasn't. In the "final cc" that's the value I'm getting and it should be '01068651-0'.
Even If I run this check:
if cc in cpf:print(True)
It confirms it is True.
What am I missing?
PS.: I find quite intriguing that when I print(i) inside the for sentence only two values show up (and one is empty).
Modifying a list as you're iterating over it doesn't work very well. Is building a new list an option? Something like:
filtered_cc = [
i for i in cc
if not (i in ag or i in cpf or i == "")
]

Python Pandas read_html get rid of nested span element in table

I try to grab some stock data from a website. The german website onvista.de have all the information I need. Now I tried to get the stock data into a pandas dataframe.
Like this:
url = 'https://www.onvista.de/aktien/fundamental/ADLER-REAL-ESTATE-AG-Aktie-DE0005008007'
onvista_table = pd.read_html(url)
This works fine for other websites. But the onvista site has a nested 'span' element in the th element, which has text in it. How do I get rid of the span element in the th element, to get a proper dataframe, without the text?
So I tried it with beautifulsoup to get rid of the 'span' element:
url = 'https://www.onvista.de/aktien/fundamental/ADLER-REAL-ESTATE-AG-Aktie-DE0005008007'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
onvista_table = soup
clean_data = []
for i in range (0,len(onvista_table.find_all('table'))):
table = onvista_table.find_all('table')[i]
for tool_tip_span in table.find_all('span',{"class":"INFO_LAYER_CONTAINER"}):
tool_tip_span.decompose()
rows = table.find_all('tr')
for row in rows:
raw_data = []
for cell in row.find_all(['td','th']):
raw_data.append(cell.get_text().strip())
if len(raw_data)<9:
print(raw_data)
the result looks like this:
['Gewinn', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014']
['Gewinn pro Aktie in EUR', '-', '1,20', '0,89', '1,91', '2,11', '1,83', '4,65']
['KGV', '-', '12,52', '16,79', '6,95', '6,24', '7,06', '1,45']
['Gewinnwachstum', '-', '+45,18%', '-60,00%', '-9,47%', '+15,30%', '-60,64%', '+80,93%']
['PEG', '-', '-', '0,49', '-0,13', '-0,65', '0,46', '-0,02']
['Dividende', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014']
['Dividende (netto) in EUR', '-', '0,00', '0,00', '0,00', '0,00', '0,00', '0,00']
['Dividendenrendite', '-', '0,00', '0,05', '0,00', '0,00', '0,00', '0,00']
['Cash-Flow', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014']
['Cashflow pro Aktie in EUR', '-', '1,63', '2,38', '0,63', '2,11', '0,54', '0,52']
['Kurs-Cashflow Verhältnis (KCV)', '-', '9,25', '6,32', '21,08', '6,24', '23,94', '13,00']
['Umsatz', '', '', '', '2017', '2016', '2015', '2014']
['Umsatz in Mio. EUR', '', '', '', '299,30', '412,80', '384,80', '140,70']
['Umsatzwachstum', '', '', '', '-27,49%', '+7,27%', '+173,48%', '+632,81%']
['Umsatz pro Mitarbeiter in EUR', '', '', '', '598.600,00', '1.294.043,88', '1.485.714,28', '1.851.315,78']
['Buchwert', '', '', '', '2017', '2016', '2015', '2014']
['Buchwert pro Aktie in EUR', '', '', '', '18,03', '19,16', '16,87', '9,76']
['Kurs-Buchwert-Verhältnis', '', '', '', '0,73', '0,75', '0,84', '0,76']
['Bilanz', '', '', '', '2017', '2016', '2015', '2014']
['Bilanzsumme in Mio. EUR', '', '', '', '3.779,00', '3.430,50', '3.076,20', '1.416,50']
['Eigenkapitalquote', '', '', '', '+29,48%', '+28,71%', '+27,19%', '+23,36%']
['Verschuldungsgrad', '', '', '', '+239,10%', '+248,20%', '+267,74%', '+327,94%']
['dynam. Verschuldungsgrad', '', '', '', '+7.340,49%', '+2.430,71%', '+8.958,80%', '+6.500,00%']
['Bilanzierungsmethode', '', '', '', 'IFRS', 'IFRS', 'IFRS', 'IFRS']
['Marktkapitalisierung', '', '', '', '2017', '2016', '2015', '2014']
['Marktkapitalisierung in Mio. EUR', '', '', '', '764,52', '691,20', '655,58', '237,00']
['Marktkapitalisierung/Umsatz', '', '', '', '2,55', '1,67', '1,70', '1,68']
['Marktkapitalisierung/Mitarbeiter in EUR', '', '', '', '1.529.050,36', '2.166.794,35', '2.531.214,90', '3.118.461,26']
['Marktkapitalisierung/EBITDA', '', '', '', '2,44', '2,19', '3,69', '1,37']
['Rentabilität', '', '', '', '2017', '2016', '2015', '2014']
['Cashflow-Marge', '', '', '', '+12,12%', '+24,37%', '+6,49%', '+11,86%']
['EBIT-Marge', '', '', '', '+104,17%', '+75,82%', '+45,79%', '+122,45%']
['EBITDA-Marge', '', '', '', '+104,57%', '+76,11%', '+46,04%', '+122,81%']
['Eigenkapitalrendite', '', '', '', '+11,37%', '+12,27%', '+8,61%', '+32,87%']
['Gesamtkapitalrendite', '', '', '', '+7,55%', '+7,26%', '+5,08%', '+10,58%']
['Cashflow Return on Investment', '', '', '', '+0,96%', '+2,93%', '+0,81%', '+1,17%']
['Steuerquote', '', '', '', '+9,97%', '+28,65%', '+17,40%', '+15,96%']
This is exactly what I want, only as a pandas dataframe. So please can someone tell me, how I can do this.
Kind regards,
Hoh
Once you have each table into a list of lists you can add to a new data frame. Example data:
raw_data = [
['Gewinn', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014'],
['Gewinn pro Aktie in EUR', '-', '1,20', '0,89', '1,91', '2,11', '1,83', '4,65'],
['KGV', '-', '12,52', '16,79', '6,95', '6,24', '7,06', '1,45'],
['Gewinnwachstum', '-', '+45,18%', '-60,00%', '-9,47%', '+15,30%', '-60,64%', '+80,93%'],
['PEG', '-', '-', '0,49', '-0,13', '-0,65', '0,46', '-0,02']
]
Create data frame like so:
# get first list as headers
headers = raw_data.pop(0)
df_gewinn = DataFrame(raw_data, columns=headers)
Then repeat this for each table (Dividende, Cash-Flow, Umsatz, etc.).

How to know index of a decimal value in a python list

I have a list like the following
['UIS', '', '', '', '', '', '', '', '', '02/05/2014', 'N', '', '', '', '', '9:30:00', '', '', '', '', '', '', '', '', '31.8000', '', '', '', '', '', '', '3591', 'O', '', '', '', '', '0', '', '', '', '', '', '', '', '', '', '', '', '', '', '0']
Now how to know which element is decimal here , basically I want to track the 31.8000 value from the list. Is it possible ?
You can reliably find if a variable has a floating point number or not, by literal evaluating and checking if it is of type float, like this
from ast import literal_eval
result = []
for item in data:
temp = ""
try:
temp = literal_eval(item)
except (SyntaxError, ValueError):
pass
if isinstance(temp, float):
result.append(item)
print result
# ['31.8000']
If you want to get the indexes, just enumerate the data like this
for idx, item in enumerate(data):
...
...
and while preparing the result, add the index instead of the actual element
result.append(idx)
Iterate over the list and check if float() succeeds:
floatables = []
for i,item in enumerate(data):
try:
float(item)
floatables.append(i)
except ValueError:
pass
print floatables
Alternatively, if you want to match the decimal format you can use
import re
decimals = []
for i,item in enumerate(data):
if re.match("^\d+?\.\d+?$", item) is not None:
decimals.append(i)
print decimals
Using a list comprehension and a regular expression match:
>>> import re
>>> [float(i) for i in x if re.match(r'^[+-]\d+?[.]\d+$',i)]
[31.8]
If you want to tracking the indexes of the floats:
>>> [x.index(i) for i in x if re.match(r'[+-]?\d+?[.]\d+',i)]
[24]
data = ['UIS', '', '', '', '', '', '', '', '', '02/05/2014', 'N', '', '', '', '', '9:30:00', '', '', '', '', '', '', '', '', '31.8000', '', '', '', '', '', '', '3591', 'O', '', '', '', '', '0', '', '', '', '', '', '', '', '', '', '', '', '', '', '0']
import decimal
target = decimal.Decimal('31.8000')
def is_target(input):
try:
return decimal.Decimal(input) == target
except decimal.InvalidOperation, e:
pass
output = filter( is_target, data)
print output

Python array deleting items

I have array
a=['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '151 ihi Chun', '151 ihi Chun', '149 st Hg', '149 st Hg', '125 Tatane', '125 Tatane', '174 Sunnygat', '174 Sunnygat', '174 Sunnygat', '126 Nank', '126 Nank', '162 Rass', '162 Rass']
I want to remove all '' objects, but cant.
a.remove('')
or while a.index(''): a.remove('')
Are don't help..
Use a filter() call with None as the filter (tests for truth, so non-emptyness):
a = filter(None, a)
or a list comprehension:
a = [e for e in a if e]
If you need to explicitly allow other 'false' values and only want to filter out empty strings, use:
a = [e for e in a if e != '']
If those items are actually '', in other words, empty strings, then you can use the following:
a = [x for x in a if x]
Since an empty string evaluates to false when used in a truth testing statement.
try
for i in a:
a.remove('')
a.remove('')
i am also not sure why in first time it's not removing all but in second time sure it removes all the blank

Python merging two CSV files

I have two CSV files. One:
s555555,7
s333333,10
s666666,9
s111111,10
s999999,9
And two:
s111111,,,,,
s222222,,,,,
s333333,,,,,
s444444,,,,,
s555555,,,,,
s666666,,,,,
s777777,,,,,
I want to end up with:
[['s111111', '10', '', '', '', ''],
['s222222', '', '', '', '', ''],
['s333333', '10', '', '', '', ''],
['s444444', '', '', '', '', ''],
['s555555', '7', '', '', '', ''],
['s666666', '9', '', '', '', ''],
['s777777', '', '', '', '', '']]
Here's my code:
new_marks = get_marks_from_file('assign1_marks.csv')
marks = get_marks_from_file('marks.csv')
def merge_marks(all_marks, new_marks, column):
for n in range(len(new_marks)):
for a in range(len(all_marks)):
if all_marks[a][0]==new_marks[n][0]:
all_marks[a][column]= new_marks[n][column]
return marks
What am I doing wrong? I keep getting:
>>> merge_marks(marks, new_marks, 1)
[['s111111', '', '', '', '', ''],
['s222222', '', '', '', '', ''],
['s333333', '', '', '', '', ''],
['s444444', '', '', '', '', ''],
['s555555', '7', '', '', '', ''],
['s666666', '', '', '', '', ''],
['s777777', '', '', '', '', '']]
The line
return marks
has to be unindented by three levels, to get it out of both for loops and the if statement. Right now it is returning with the first all_marks[a][0]==new_marks[n][0] match it finds and never replacing the others.
You also want to return all_marks rather than marks: In this case, the global variable marks happens to be the same and is also changed, but it would fail if you called it with a variable named literally anything else.
The solution is thus:
def merge_marks(all_marks, new_marks, column):
for n in range(len(new_marks)):
for a in range(len(all_marks)):
if all_marks[a][0]==new_marks[n][0]:
all_marks[a][column]= new_marks[n][column]
return all_marks

Categories