Split numbers from api with commas? - python

import urllib.request,
urllib.parse, urllib.error
from bs4 import BeautifulSoup
url = "https://api.monzo.com/crowdfunding-investment/total"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
if 'invested_amount' in text:
result = text.split(",")
invested = str(result[1])
investedn = invested.split(':')[1]
print(investedn)
Hi all. I’m trying to split investedn into thousands with commas. Anyone know how to do this?
Also, how can I remove the last four numbers from the string?
Thanks!

Simply use
"{:,}".format(number)
https://docs.python.org/3/library/string.html#format-specification-mini-language
e.g.
In [19]: "{:,}".format(17462233620)
Out[19]: '17,462,233,620'

Managed to fix!
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url = "https://api.monzo.com/crowdfunding-investment/total"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
if 'invested_amount' in text:
result = text.split(",")
invested = str(result[1])
investedn = invested.split(':')[1]
plainnum = int(str(investedn)[:-4])
number = "{:,}".format(int(plainnum))
print(number)
I had messed up quite a bit, but figured it out.
Thanks!

The text you got back from that URL is not HTML. It is data encoded in JSON format, which is easy to parse:
import urllib.request
import json
url = "https://api.monzo.com/crowdfunding-investment/total"
json_text = urllib.request.urlopen(url).read()
json_text = json_text.decode('utf-8')
data = json.loads(json_text)
print(data)
print('Invested amount: {:,}'.format(data['invested_amount']))
Output:
{'invested_amount': 17529735495, 'share_price': 77145, 'shares_invested': 227231, 'max_shares': 2592520, 'max_amount': 199999955400, 'status': 'pending'}
Invested amount: 17,529,735,495
Notes
json_text was an array of bytes, not a string. That is why I decode it using a guess of UTF-8.
data is just a normal Python dictionary.

a = "17462233620"
b = ""
for i in range(len(a), 0 , -3):
b = a[i-3:i]+","+b
b = "£" + a[0:i] + b[:-1]
print(b) # Output £17,462,233,620

Related

Using multiple for loop with Python Using Beautiful Soup

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = "https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/109825373"
data = requests.get(url)
soup = bs(data.content,"html.parser")
The code below are a test with to get 1 item.
property_overview = soup.find(class_="p24_regularListing").find(class_="p24_propertyOverview").find(class_='p24_propertyOverviewRow').find(class_='col-xs-6 p24_propertyOverviewKey').text
property_overview
Output : 'Listing Number'
The code below is what we have to get all the col-xs-6 p24_propertyOverviewKey
p24_regularListing_items = soup.find_all(class_="p24_regularListing")
for p24_propertyOverview_item in p24_regularListing_items:
p24_propertyOverview_items = p24_propertyOverview_item.find_all(class_="p24_propertyOverview")
for p24_propertyOverviewRow_item in p24_propertyOverview_items:
p24_propertyOverviewRow_items = p24_propertyOverviewRow_item.find_all(class_="p24_propertyOverviewRow")
for p24_propertyOverviewKey_item in p24_propertyOverviewRow_items:
p24_propertyOverviewKey_items = p24_propertyOverviewKey_item.find_all(class_="col-xs-6 p24_propertyOverviewKey")
p24_propertyOverviewKey_items
The code above only outputs 1 item. and not all
To put things more simply, you can use soup.select() (and via the comments, you can then use .get_text() to extract the text from each tag).
from bs4 import BeautifulSoup
import requests
resp = requests.get(
"https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/109825373"
)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "html.parser")
texts = []
for tag in soup.select(
# NB: this selector uses Python's implicit string concatenation
# to split it onto several lines.
".p24_regularListing "
".p24_propertyOverview "
".p24_propertyOverviewRow "
".p24_propertyOverviewKey"
):
texts.append(tag.get_text())
print(texts)

how to use python to parse a html that is in txt format?

I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?
The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)
the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code

Scrape a MediaWiki website (specific html tags) using Python

I would like to scrape this specific MediaWiki website with specific tags. Here is my current code.
import urllib.request
from bs4 import BeautifulSoup
url = "https://wiki.sa-mp.com/wiki/Strfind"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
If you look at the URL, there is the description, parameters, return values and the example usage. That's what I would like to scrape. Thank you!
There may be a more efficient way to do this but the following uses css selectors to grab that information
from bs4 import BeautifulSoup
import requests as re
url ="https://wiki.sa-mp.com/wiki/Strfind"
response = re.get(url)
soup = BeautifulSoup(response.content, "lxml")
description = soup.select_one('.description').text
initial_parameters = soup.select('.parameters,.param')
final_parameters = [parameter.text for parameter in initial_parameters]
returnValues = soup.select_one('#bodyContent > .param + p + div').text
exampleUsage = soup.select_one('.pawn').text
results = [description,final_parameters,returnValues,exampleUsage]
print(results)

BeautifulSoup not letting me get the text

I'm looking to get all the text in tag
It gives me the text in the console, but it doesn't put it in the .txt file.
It works with body.text, but not with article.text. I don't know what to do.
import bs4 as bs
import urllib.request
#import re
sauce = urllib.request.urlopen('http://www.bodoniparavia.it/index.php/it/amministrazione-trasparente/bandi-di-gara-e-contratti.html')
soup = bs.BeautifulSoup(sauce,'lxml')
body = soup.body
article = body.find('article')
article1 = article.text
print(article1)
x = open('file.txt','w')
x.write(article1)
x.close
It seems to be working fine for me but try adding encoding = 'utf-8' to the write statement. So the code would now look like this
import bs4 as bs
import urllib.request
#import re
sauce = urllib.request.urlopen('http://www.bodoniparavia.it/index.php/it/amministrazione-trasparente/bandi-di-gara-e-contratti.html')
soup = bs.BeautifulSoup(sauce,'lxml')
body = soup.body
article = body.find('article')
article1 = article.text
print(article1)
x = open('file.txt','w',encoding = 'utf-8')
x.write(article1)
x.close()

extract text from html file python

I have write down a code to extract some text from the html file, This code extract the requested line from the webpage now I want to extract sequence data.Unfortunately I am not able to extract the text, its showing some error.
import urllib2
from HTMLParser import HTMLParser
import nltk
from bs4 import BeautifulSoup
# Proxy information were removed
# from these two lines
proxyOpener = urllib2.build_opener(proxyHandler)
urllib2.install_opener(proxyOpener)
response = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c')
################## BS Block ################################
soup = BeautifulSoup(response)
text = soup.get_text()
print text
##########################################################
html = response.readline()
for l in html:
if "|Rv0470c|" in l:
print l # code is running successfully till here
raw = nltk.clean_html(html)
print raw
How can I run this code successfully? I have already checked all the available threads and solution, but nothing is working.
i want to extract this part:
M. tuberculosis H37Rv|Rv0470c|pcaA
MSVQLTPHFGNVQAHYDLSDDFFRLFLDPTQTYSCAYFERDDMTLQEAQIAKIDLALGKLNLEPGMTLLDIGCGWGATMRRAIEKYDVNVVGLTLSENQAGHVQKMFDQMDTPRSRRVLLEGWEKFDEPVDRIVSIGAFEHFGHQRYHHFFEVTHRTLPADGKMLLHTIVRPTFKEGREKGLTLTHELVHFTKFILAEIFPGGWLPSIPTVHEYAEKVGFRVTAVQSLQLHYARTLDMWATALEANKDQAIAIQSQTVYDRYMKYLTGCAKLFRQGYTDVDQFTLEK
i am able to extract desired text after writing down this code: which works without any dependencies accept "urllib2" and for my case it works like a charm.
import urllib2
httpProxy = {'username': '------', '-----': '-------', 'host': '------', 'port': '-----'}
proxyHandler = urllib2.ProxyHandler({'http': 'http://'+httpProxy['username']+':'+httpProxy['password']+'#'+httpProxy['host']+':'+httpProxy['port']})
proxyOpener = urllib2.build_opener(proxyHandler)
urllib2.install_opener(proxyOpener)
response = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c')
html = response.readlines()
f = open("/home/zebrafish/Desktop/output.txt",'w')
for l in html:
if "|Rv0470c|" in l:
l = l.split("</small>")[0].split("<TR><TD><small style=font-family:courier>")[1]
l = l.split("<br />")
ttl = l[:1]
seq = "".join(l[1:])
f.write("".join(ttl))
f.write(seq)
f.close()
I'm not quite sure about what exactly you are requesting as a whole, but here's my ad hoc take on your problem (similar to yours actually) which does retrieve the part of the html you request. Maybe you can get some ideas. (adjust for Python2)
import requests
from bs4 import BeautifulSoup
url = 'http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c'
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html, "lxml")
for n in soup.find_all('tr'):
if "|Rv0470c|" in n.text:
nt = n.text
while '\n' in nt:
nt.replace('\n','\t')
nt=nt.split('\t')
nt = [x for x in nt if "|Rv0470c|" in x][0].strip()
print (nt.lstrip('>'))

Categories