Python Pandas read_html get rid of nested span element in table - python

I try to grab some stock data from a website. The german website onvista.de have all the information I need. Now I tried to get the stock data into a pandas dataframe.
Like this:
url = 'https://www.onvista.de/aktien/fundamental/ADLER-REAL-ESTATE-AG-Aktie-DE0005008007'
onvista_table = pd.read_html(url)
This works fine for other websites. But the onvista site has a nested 'span' element in the th element, which has text in it. How do I get rid of the span element in the th element, to get a proper dataframe, without the text?
So I tried it with beautifulsoup to get rid of the 'span' element:
url = 'https://www.onvista.de/aktien/fundamental/ADLER-REAL-ESTATE-AG-Aktie-DE0005008007'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
onvista_table = soup
clean_data = []
for i in range (0,len(onvista_table.find_all('table'))):
table = onvista_table.find_all('table')[i]
for tool_tip_span in table.find_all('span',{"class":"INFO_LAYER_CONTAINER"}):
tool_tip_span.decompose()
rows = table.find_all('tr')
for row in rows:
raw_data = []
for cell in row.find_all(['td','th']):
raw_data.append(cell.get_text().strip())
if len(raw_data)<9:
print(raw_data)
the result looks like this:
['Gewinn', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014']
['Gewinn pro Aktie in EUR', '-', '1,20', '0,89', '1,91', '2,11', '1,83', '4,65']
['KGV', '-', '12,52', '16,79', '6,95', '6,24', '7,06', '1,45']
['Gewinnwachstum', '-', '+45,18%', '-60,00%', '-9,47%', '+15,30%', '-60,64%', '+80,93%']
['PEG', '-', '-', '0,49', '-0,13', '-0,65', '0,46', '-0,02']
['Dividende', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014']
['Dividende (netto) in EUR', '-', '0,00', '0,00', '0,00', '0,00', '0,00', '0,00']
['Dividendenrendite', '-', '0,00', '0,05', '0,00', '0,00', '0,00', '0,00']
['Cash-Flow', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014']
['Cashflow pro Aktie in EUR', '-', '1,63', '2,38', '0,63', '2,11', '0,54', '0,52']
['Kurs-Cashflow Verhältnis (KCV)', '-', '9,25', '6,32', '21,08', '6,24', '23,94', '13,00']
['Umsatz', '', '', '', '2017', '2016', '2015', '2014']
['Umsatz in Mio. EUR', '', '', '', '299,30', '412,80', '384,80', '140,70']
['Umsatzwachstum', '', '', '', '-27,49%', '+7,27%', '+173,48%', '+632,81%']
['Umsatz pro Mitarbeiter in EUR', '', '', '', '598.600,00', '1.294.043,88', '1.485.714,28', '1.851.315,78']
['Buchwert', '', '', '', '2017', '2016', '2015', '2014']
['Buchwert pro Aktie in EUR', '', '', '', '18,03', '19,16', '16,87', '9,76']
['Kurs-Buchwert-Verhältnis', '', '', '', '0,73', '0,75', '0,84', '0,76']
['Bilanz', '', '', '', '2017', '2016', '2015', '2014']
['Bilanzsumme in Mio. EUR', '', '', '', '3.779,00', '3.430,50', '3.076,20', '1.416,50']
['Eigenkapitalquote', '', '', '', '+29,48%', '+28,71%', '+27,19%', '+23,36%']
['Verschuldungsgrad', '', '', '', '+239,10%', '+248,20%', '+267,74%', '+327,94%']
['dynam. Verschuldungsgrad', '', '', '', '+7.340,49%', '+2.430,71%', '+8.958,80%', '+6.500,00%']
['Bilanzierungsmethode', '', '', '', 'IFRS', 'IFRS', 'IFRS', 'IFRS']
['Marktkapitalisierung', '', '', '', '2017', '2016', '2015', '2014']
['Marktkapitalisierung in Mio. EUR', '', '', '', '764,52', '691,20', '655,58', '237,00']
['Marktkapitalisierung/Umsatz', '', '', '', '2,55', '1,67', '1,70', '1,68']
['Marktkapitalisierung/Mitarbeiter in EUR', '', '', '', '1.529.050,36', '2.166.794,35', '2.531.214,90', '3.118.461,26']
['Marktkapitalisierung/EBITDA', '', '', '', '2,44', '2,19', '3,69', '1,37']
['Rentabilität', '', '', '', '2017', '2016', '2015', '2014']
['Cashflow-Marge', '', '', '', '+12,12%', '+24,37%', '+6,49%', '+11,86%']
['EBIT-Marge', '', '', '', '+104,17%', '+75,82%', '+45,79%', '+122,45%']
['EBITDA-Marge', '', '', '', '+104,57%', '+76,11%', '+46,04%', '+122,81%']
['Eigenkapitalrendite', '', '', '', '+11,37%', '+12,27%', '+8,61%', '+32,87%']
['Gesamtkapitalrendite', '', '', '', '+7,55%', '+7,26%', '+5,08%', '+10,58%']
['Cashflow Return on Investment', '', '', '', '+0,96%', '+2,93%', '+0,81%', '+1,17%']
['Steuerquote', '', '', '', '+9,97%', '+28,65%', '+17,40%', '+15,96%']
This is exactly what I want, only as a pandas dataframe. So please can someone tell me, how I can do this.
Kind regards,
Hoh

Once you have each table into a list of lists you can add to a new data frame. Example data:
raw_data = [
['Gewinn', '2020e', '2019e', '2018e', '2017', '2016', '2015', '2014'],
['Gewinn pro Aktie in EUR', '-', '1,20', '0,89', '1,91', '2,11', '1,83', '4,65'],
['KGV', '-', '12,52', '16,79', '6,95', '6,24', '7,06', '1,45'],
['Gewinnwachstum', '-', '+45,18%', '-60,00%', '-9,47%', '+15,30%', '-60,64%', '+80,93%'],
['PEG', '-', '-', '0,49', '-0,13', '-0,65', '0,46', '-0,02']
]
Create data frame like so:
# get first list as headers
headers = raw_data.pop(0)
df_gewinn = DataFrame(raw_data, columns=headers)
Then repeat this for each table (Dividende, Cash-Flow, Umsatz, etc.).

Related

Requests object not filtering correctly

I'm trying to retrieve all URLs from a page using Python's Requests library. I can't figure out why my filterer is returning hundreds of items more than I am expecting. Code:
import requests
import re
r = requests.get('http://exrx.net/Lists/ExList/NeckWt', headers=headers_dict, timeout=3)
counter = 0
raw_html = r.text
listly = re.split('\"', raw_html)
for i in listly:
if "https://exrx.net" in i or "../../" in i:
pass
else:
listly.remove(i)
counter += 1
print(listly)
print('-'*5)
print('the list is now', len(listly), 'objects long')
print(counter, ' objects were removed')
print('-'*5)
The final list however contains 487 items (down from >900), including the following, which are confusingly not specified in my if / elif block.
I cannot figure out why they are not being deleted:
['en', 'Content-Type', 'text/html; charset=utf-8', '... func = ', '... func.apply: ', "----- F'D: ", '... file = ', "----- ERR'D: ", "----- F'D: ", '', 'load', '_', ' blocked = TIME DELAY!', ' blocked = ', ' blocked = ', 'markLoaded dummyfile: ', '1', "let's go", 'on', 'on', 'on', 'on', 'script', 'text/javascript', 'head', '/detroitchicago/grapefruit.gif', 'prerender', '?orig=', '&v=', '/porpoiseant/army.gif', 'compid', '0', '', 'impression', '', 'impression', 'prerender', '?orig=', '&sts=', 'domain_id', '&visit_uuid=', 'undefined', 'false', 'false', 'function', 'CustomEvent', 'false', 'false', 'content-type', 'text/html; charset=UTF-8', 'generator', 'concrete5', 'shortcut icon', 'https://exrx.net/application/files/8014/4923/2704/Runner3.jpg', 'image/x-icon', 'icon', 'https://exrx.net/application/files/8014/4923/2704/Runner3.jpg', 'image/x-icon', 'canonical', 'https://exrx.net/Lists/ExList/NeckWt', 'text/javascript', '/index.php', '/updates/concrete5-8.5.7/concrete/images', '/index.php/tools/required', 'https://exrx.net', '', 'en_US', 'text/css', 'Logo', '79715', '3471', 'text/javascript', 'https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.min.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'text/javascript', '/updates/concrete5-8.5.7/concrete/js/ie/html5-shiv.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'text/javascript', '/updates/concrete5-8.5.7/concrete/js/ie/respond.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'text/javascript', '', 'touchstart', 'https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,700,900', 'stylesheet', 'text/css', '/application/files/cache/css/fruitful/iGotStyle.css?ts=1644387679', 'stylesheet', 'text/css', 'all', 'viewport', 'width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no', '/application/files/cache/css/fruitful/accessory.css?ts=1644387679', 'stylesheet', 'text/css', 'all', 'https://use.fontawesome.com/bf47fdcc0a.js', '', 'text/css', '', '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js', 'ca-pub-6329449765532083', 'text/javascript', '1', 'https://exrx.net/Lists/ExList/NeckWt', 'false', 'false', 'text/javascript', 'false', 'ad_cache_level', 'ad_lazyload_version', 'ad_load_version', 'city', 'Sydney', 'country', 'AU', 'days_since_last_visit', 'domain_id', 'domain_test_group', 'engaged_time_visit', 'ezcache_level', 'ezcache_skip_code', 'form_factor_id', 'framework_id', 'is_return_visitor', 'is_sitespeed', 'last_page_load', '', 'last_pageview_id', '', 'lt_cache_level', 'metro_code', 'page_ad_positions', '', 'page_view_count', 'page_view_id', '578b3a09-c637-461b-4c42-c6c83546001c', 'position_selection_id', 'postal_code', '2000', 'pv_event_count', 'response_size_orig', 'response_time_orig', 'serverid', '54.66.141.238:27055', 'state', 'NSW', 't_epoch', 'template_id', 'time_on_site_visit', 'url', 'https://exrx.net/Lists/ExList/NeckWt', 'user_id', 'weather_precipitation', 'weather_summary', '', 'weather_temperature', 'word_count', 'worst_bad_word_level', '&ez_orig=1', 'expires=', 'ezux_lpl_107151=', '|', '|', '; ', 'complete', 'onload', 'attach_ezolpl', 'attach_ezolpl', '578b3a09-c637-461b-4c42-c6c83546001c', 'false', 'page527', 'ccm-page ccm-page-id-527 page-type-page page-template-directory-template', 'siteHeader', 'container', 'row', 'logo', 'col-xs-6 col-md-3', 'ccm-custom-style-container ccm-custom-style-logo-79715', 'https://exrx.net/', '/application/files/3114/3635/4565/logo_same_proportion_5_2_2015.gif', 'ExRx.net: Exercise Prescription on Internet', 'ccm-image-block img-responsive bID-79715', 'mainNav', 'clearfix hidden-xs hidden-sm col-sm-9', 'nav', '', 'https://exrx.net/Lists/Directory', '_self', '', '', '/Lists/Directory', '_self', '', '', '/WeightTraining/Instructions', '_self', '', '', '/Lists/Muscle', '_self', '', '', '/Lists/Articulations', '_self', '', '', '/Calculators', '_self', '', '', 'https://exrx.net/Beginning', '_self', '', '', '/Beginning', '_self', '', '', '/WeightTraining', '_self', '', '', '/Kinesiology', '_self', '', '', '/Aerobic', '_self', '', '', '/ExInfo', '_self', '', '', '/Sports', '_self', '', '', '/Bodybuilding', '_self', '', '', '/Drugs', '_self', '', '', '/Psychology', '_self', '', '', '/FatLoss', '_self', '', '', '/Nutrition', '_self', '', '', '/Testing', '_self', '', '', 'https://exrx.net/Notes/SiteJournal', '_self', '', '', '/Notes/SiteJournal', '_self', '', '', '/People/Contact', '_self', '', '', '/Notes/Feedback', '_self', '', '', '/Notes/Archive/Feedback10', '_self', '', '', '/Questions', '_self', '', '', '/forum/', '_blank', '', '', '/Links', '_self', '', '', '/Abstracts', '_self', '', '', '/Journals', '_self', '', '', '/Videos', '_self', '', '', '/Talks', '_self', '', '', '/Notes/Donations', '_self', '', '', 'https://exrx.net/Store', '_self', '', 'mobileAssets', 'col-xs-6 visible-xs-block visible-sm-block text-right', 'icoMobileNav', 'fa fa-bars', 'text/javascript', '/packages/fruitful/themes/fruitful/js/initExRx.js', 'headerShell', 'container', 'row', 'col-sm-12', 'fruitful-page-title fruitfull-title-padding', 'page-title', 'row Breadcrumb-Container Add-Margin-Top', 'container', 'col-sm-9', 'http://exrx.net', '../Directory', 'col-sm-3', 'google_translate_element', 'text/javascript', 'mainShell', 'container ', 'row', 'col-sm-12', 'ccm-custom-style-container ccm-custom-style-directorytopadvertise-86906 Add-Margin-Bottom', 'ezoic-pub-ad-placeholder-103', '', '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js', 'adsbygoogle', 'display:block; height:90px;', 'ca-pub-6329449765532083', '4409668012', 'container', 'row', 'col-sm-12', 'Sternocleidomastoid', '../../Muscles/Sternocleidomastoid', 'container', 'row', 'col-sm-12', 'row', 'col-sm-6', '../../WeightExercises/Sternocleidomastoid/CBNeckFlx', '../../WeightExercises/Sternocleidomastoid/CBNeckFlxBelt', '../../WeightExercises/Sternocleidomastoid/CBNeckRotationBelt', '../../WeightExercises/Sternocleidomastoid/CBNeckLtrFlxBelt', '_top', '../../WeightExercises/Sternocleidomastoid/LVNeckFlexionH', '_top', '../../WeightExercises/Sternocleidomastoid/LVLateralNeckFlexionH', '_top', '../../WeightExercises/Sternocleidomastoid/LVNeckFlx', '_top', '../../WeightExercises/Sternocleidomastoid/LVNeckLtrFlx', '_top', '../../WeightExercises/Sternocleidomastoid/WtLyingNeckFlexion', '../../WeightExercises/Sternocleidomastoid/WtNeckFlx', '_top', '../../WeightExercises/Sternocleidomastoid/WtNeckLateralFlex', '_top', 'col-sm-6', '../../WeightExercises/Sternocleidomastoid/BWFrontNeckBridge', '../../WeightExercises/Sternocleidomastoid/BWWallFrontNeckBridge', '../../WeightExercises/Sternocleidomastoid/BWWallSideNeckBridge', '../../Stretches/Sternocleidomastoid/NeckRetraction', '../../Stretches/Sternocleidomastoid/NeckRotation', 'https://exrx.net/WeightExercises/Sternocleidomastoid/STNeckFlexion', 'https://exrx.net/WeightExercises/Sternocleidomastoid/STNeckLateralFlexion', '', '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js', 'adsbygoogle', 'display:inline-block;width:300px;height:250px', 'ca-pub-6329449765532083', '2861896011', 'container', 'row', 'col-sm-12', 'Splenius', '../../Muscles/Splenius', 'container', 'row', 'col-sm-12', 'row', 'col-sm-6', '../../WeightExercises/Splenius/CBNeckExt', '_top', '../../WeightExercises/Splenius/CBNeckExtBelt', '../../WeightExercises/Splenius/LVNeckExtentionH', '../../WeightExercises/Splenius/LVNeckExt', '_top', '../../WeightExercises/Splenius/WtLyingNeckExtension', '../../WeightExercises/Splenius/WtNeckExtension', '../../WeightExercises/Splenius/WtNeckExt', '_top', '../../WeightExercises/Splenius/WtNeckHarnessExt', '#Sternocleidomastoid', 'col-sm-6', 'https://exrx.net/WeightExercises/Splenius/BRNeckRetraction', '../../WeightExercises/Splenius/BWRearNeckBridge', '../../WeightExercises/Splenius/BWWallRearNeckBridge', '../../WeightExercises/Splenius/LyingIsometricNeckRetr', '../../Stretches/Splenius/Neck', 'https://exrx.net/WeightExercises/Splenius/STNeckExtension', '../../Stretches/ErectorSpinae/Plow', 'WaistWt#Erector', 'container', 'row', 'col-sm-12', 'BackWt', 'BackWt#UpperTrap', 'WaistWt', 'WaistWt#Erector', 'container', 'row', 'col-sm-12 Add-Margin-Top', 'container', 'subfooter no-print', 'text-align: center;', 'text-align: center;', '../../Lists/Directory', '../../Notes/Notes', '_parent', 'site-footer', 'container ', 'row', 'copyright', 'col-xs-12 col-sm-3', 'col-xs-12 col-sm-9', 'margin:0px !important', 'https://exrx.net/People/Contact', 'https://exrx.net/Notes/Privacy', 'https://exrx.net/Notes/Legal', 'https://exrx.net/Notes/ADA', 'https://www.facebook.com/pages/ExRxnet/1685475628344232', 'https://exrx.net/Notes/Feedback', 'ajax', 'https://exrx.net/Notes/Archive/Feedback1', 'https://exrx.net/Store', 'amzn-assoc-ad-d457ebf0-12d4-46d4-a3f1-6d2aa75f0d88', '', '//z-na.amazon-adsystem.com/widgets/onejs?MarketPlace=US&adInstanceId=d457ebf0-12d4-46d4-a3f1-6d2aa75f0d88', '/packages/fruitful/themes/fruitful/js/functions.js', 'text/javascript', '', '/packages/fruitful/themes/fruitful/js/bootstrap.min.js', 'text/javascript', '', 'text/javascript', '', '#mainNav', 'body', 'id', 'mobileNav', 'visible-xs-block visible-sm-block', 'hidden-xs hidden-sm', '#icoMobileNav', '.ccm-page, #mobileNav', 'slideOver', 'text/javascript', '/updates/concrete5-8.5.7/concrete/js/picturefill.js?ccm_nocache=1a72ca0f3692b16db9673a9a89faff0649086c52', 'exrx_net', 'audins.js', '__ez.script.add', '//go.ezoic.net/detroitchicago/audins.js?cb=195-3', 'display:none;', '//pixel.quantserve.com/pixel/p-31iz6hfFutd16.gif?labels=Domain.exrx_net,DomainId.107151', '0', '1', '1', 'Quantcast', 'text/javascript', 'false']
Take a look as BeautifulSoup, the main Python web scraping library. The best way imo to get all the links on the page is by doing something like:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
req = Request("http://exrx.net/Lists/ExList/NeckWt")
page_source = urlopen(req)
soup = BeautifulSoup(page_source , "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
This would get all the links on the page without you manually having to deal with manually parsing the HTML of the page.
Generally, you are not permitted to remove elements from a list while iterating through it, which you are doing in your for loop. Instead, try adding the desired elements in another list, or use list compression.
Example of list comprehension:
listly = [s for s in listly if "https://exrx.net" in s or "../../" in listly]

BeautifulSoup scraping incorrect table

I was scraping this site with the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.pro-football-reference.com/teams/buf/2021_injuries.htm"
r = requests.get(url)
stats_page = BeautifulSoup(r.content, features="lxml")
table = stats_page.findAll('table')[0] #get FIRST table on page
for player in table.findAll("tr"):
print([i.getText() for i in player.findAll("td")])
The output is:
[]
['', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR']
['', 'Q', '', '', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['', '', '', 'O', '', '', '', 'IR']
['', '', 'Q', '', '', '', '', '']
['', '', '', 'Q', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['O', 'Q', '', '', '', '', '', '']
['', '', '', '', 'Q', '', '', '']
['', 'Q', '', 'Q', '', '', '', '']
['', '', '', 'O', '', '', '', '']
['Q', '', '', '', '', '', '', '']
['', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR', 'IR']
['', '', 'Q', '', '', '', '', '']
['', 'IR', 'IR', 'IR', 'IR', '', '', '']
This is clearly the output I would expect from the 2nd table on the page, "Team Injuries", rather than the 1st table on the page, "Week 10 injury report". Any idea why BeautifulSoup is seemingly ignoring the first table on the page?
The table you want is inside a comment, as such beautifulsoup will not parse the contents for more HTML.
You will need to first locate this comment containing the table and then parse the HTML inside that separately. For example:
import requests
from bs4 import BeautifulSoup, Comment
url = "https://www.pro-football-reference.com/teams/buf/2021_injuries.htm"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
if '<table' in comment:
soup_table = BeautifulSoup(comment, "lxml")
table = soup_table.findAll('table')[0] #get FIRST table on page
for player in table.findAll("tr"):
print([i.getText() for i in player.findAll("td")])
break
This would display your output as:
[]
['DE', '', 'Injured Reserve', '']
['OG', '', 'Injured Reserve', '']
['WR', '', 'Injured Reserve', '']
['DE', 'DNP', '', 'Rest']
['WR', 'DNP', '', 'Rest']
['T', 'Limited', '', 'Back']
['ILB', 'DNP', '', 'Hamstring']
['CB', 'Limited', '', 'Hamstring']
['CB', 'Limited', '', 'Concussion']
['TE', 'Limited', '', 'Hand']
['RB', 'DNP', '', 'Concussion']

Iterating over list with while sentence and values not being excluded with list.remove

I'm running a code to clean a database. Basically, if some value appears in a list they should be removed.
Below you can see the code:
pattern = re.compile("((?:\d{10}|\d{9}|\d{8}|\d{7}|\d{6}|\d{5}|\d{4})(?:-?[\d]))?(?!\S)")
cc = pattern.findall(a)
print("cpf:", cpf)
print("ag:", ag)
print("cc start:",cc)
for i in cc:
print("i:",i)
try:
while i in ag: cc.remove(i)
except:pass
try:
while i in cpf:cc.remove(i)
except:pass
try:
while "" in i:cc.remove(i)
except:pass
print("final cc:",cc)
It prints in my screen the following:
cpf: ['00770991092']
ag: 3527
cc start: ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '00770991092', '', '', '', '', '', '', '', '', '01068651-0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
i:
i: 01068651-0
final cc: ['00770991092']
Well, the '' values are removed, that's seem to be working fine. However since '00770991092' is a value inside cpf it should've been removed, but it hasn't. In the "final cc" that's the value I'm getting and it should be '01068651-0'.
Even If I run this check:
if cc in cpf:print(True)
It confirms it is True.
What am I missing?
PS.: I find quite intriguing that when I print(i) inside the for sentence only two values show up (and one is empty).
Modifying a list as you're iterating over it doesn't work very well. Is building a new list an option? Something like:
filtered_cc = [
i for i in cc
if not (i in ag or i in cpf or i == "")
]

Getting too many matches for one string segment in regex (python)

I'm trying to write a regex script for finding all instances of money in a text, and my code works correctly but I can't figure out why it's finding multiple versions of things in my strings.
For example, in this code:
string = "$50.00"
print "number dollars: "
print re.findall("\-?\(?\$?\s*\-?\s*\(?(((\d{1,3}((\,\d{3})*|\d*))?(\.\d{1,4})?)|((\d{1,3}((\,\d{3})*|\d*))(\.\d{0,4})?))\)?\ ?(one)?\ ?(two)?\ ?(three)?\ ?(four)?\ ?(five)?\ ?(six)?\ ?(seven)?\ ?(eight)?\ ?(nine)?\ ?(ten)?\ ?(eleven)?\ ?(twelve)?\ ?(thirteen)?\ ?(fourteen)?\ ?(fifteen)?\ ?(sixteen)?\ ?(seventeen)?\ ?(eighteen)?\ ?(nineteen)?\ ?(hundred)?\ ?(thousand)?\ ?(million)?\ ?(billion)?\ ?(trillion)?\ ?(dollars)?\ ?(pounds)?\ ?(euros)?", string)
This is the result I get:
number dollars:
[('50.00', '50.00', '50', '', '', '.00', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]
this is the regex by itself:
\-?\(?\$?\s*\-?\s*\(?(((\d{1,3}((\,\d{3})*|\d*))?(\.\d{1,4})?)|((\d{1,3}((\,\d{3})*|\d*))(\.\d{0,4})?))\)?\ ?(one)?\ ?(two)?\ ?(three)?\ ?(four)?\ ?(five)?\ ?(six)?\ ?(seven)?\ ?(eight)?\ ?(nine)?\ ?(ten)?\ ?(eleven)?\ ?(twelve)?\ ?(thirteen)?\ ?(fourteen)?\ ?(fifteen)?\ ?(sixteen)?\ ?(seventeen)?\ ?(eighteen)?\ ?(nineteen)?\ ?(hundred)?\ ?(thousand)?\ ?(million)?\ ?(billion)?\ ?(trillion)?\ ?(dollars)?\ ?(pounds)?\ ?(euros)?
The results contain a string from each and every parenthesized group, corresponding to the portion of the string matched by the subexpression in each group, in order of opening parentheses (e.g. (\d+(\.\d+)?) would give ['50.00', '.00']). To prevent the contents of a group from being captured, prefix the subexpression with a ?: (e.g. (?:,\d{3})*|\d*)); this creates a non-capturing group.
The majority of the groups are for words that don't appear in the string, which produces most of empty strings in the result.

Python merging two CSV files

I have two CSV files. One:
s555555,7
s333333,10
s666666,9
s111111,10
s999999,9
And two:
s111111,,,,,
s222222,,,,,
s333333,,,,,
s444444,,,,,
s555555,,,,,
s666666,,,,,
s777777,,,,,
I want to end up with:
[['s111111', '10', '', '', '', ''],
['s222222', '', '', '', '', ''],
['s333333', '10', '', '', '', ''],
['s444444', '', '', '', '', ''],
['s555555', '7', '', '', '', ''],
['s666666', '9', '', '', '', ''],
['s777777', '', '', '', '', '']]
Here's my code:
new_marks = get_marks_from_file('assign1_marks.csv')
marks = get_marks_from_file('marks.csv')
def merge_marks(all_marks, new_marks, column):
for n in range(len(new_marks)):
for a in range(len(all_marks)):
if all_marks[a][0]==new_marks[n][0]:
all_marks[a][column]= new_marks[n][column]
return marks
What am I doing wrong? I keep getting:
>>> merge_marks(marks, new_marks, 1)
[['s111111', '', '', '', '', ''],
['s222222', '', '', '', '', ''],
['s333333', '', '', '', '', ''],
['s444444', '', '', '', '', ''],
['s555555', '7', '', '', '', ''],
['s666666', '', '', '', '', ''],
['s777777', '', '', '', '', '']]
The line
return marks
has to be unindented by three levels, to get it out of both for loops and the if statement. Right now it is returning with the first all_marks[a][0]==new_marks[n][0] match it finds and never replacing the others.
You also want to return all_marks rather than marks: In this case, the global variable marks happens to be the same and is also changed, but it would fail if you called it with a variable named literally anything else.
The solution is thus:
def merge_marks(all_marks, new_marks, column):
for n in range(len(new_marks)):
for a in range(len(all_marks)):
if all_marks[a][0]==new_marks[n][0]:
all_marks[a][column]= new_marks[n][column]
return all_marks

Categories