Iterate through a html file and extract data to CSV file - python
I have searched high and low for a solution, but non have quite fit what I need to do.
I have an html page that is saved, as a file, lets call it sample.html and I need to extract recurring json data from it. An example file is as follows:
I need to get the info from these files regulary, so the amount of objects change every time, an object would be considered as "{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}"
I need to get each of the values to a CSV file, with column headings being SpecificIdent, SpecificNum, Meter, Power, WPower, Snumber, isI. The associated data would be the rows from each.
I apologize if this is a basic question in Python, but I am pretty new to it and cannot fathom the best way to do this. Any assistance would be greatly appreciated.
Kind regards
A
<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">[{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":3716,"SpecificNum":39,"Meter":1835,"Power":11240.0,"WPower":null,"SNumber":"0703-403548","isI":false},{"SpecificIdent":6364,"SpecificNum":27,"Meter":7768,"Power":29969.0,"WPower":null,"SNumber":"467419","isI":false},{"SpecificIdent":6583,"SpecificNum":51,"Meter":7027,"Power":36968.0,"WPower":null,"SNumber":"JE1449-521248","isI":false},{"SpecificIdent":6612,"SpecificNum":57,"Meter":12828,"Power":53918.0,"WPower":null,"SNumber":"JE1509-534327","isI":false},{"SpecificIdent":7139,"SpecificNum":305,"Meter":6264,"Power":33101.0,"WPower":null,"SNumber":"JE1449-521204","isI":false},{"SpecificIdent":7551,"SpecificNum":116,"Meter":0,"Power":21569.0,"WPower":null,"SNumber":"JE1449-521252","isI":false},{"SpecificIdent":7643,"SpecificNum":56,"Meter":7752,"Power":40501.0,"WPower":null,"SNumber":"JE1449-521200","isI":false},{"SpecificIdent":8653,"SpecificNum":49,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":9733,"SpecificNum":142,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":10999,"SpecificNum":20,"Meter":7723,"Power":6987.0,"WPower":null,"SNumber":"JE1608-625534","isI":false},{"SpecificIdent":12086,"SpecificNum":24,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":14590,"SpecificNum":35,"Meter":394,"Power":10941.0,"WPower":null,"SNumber":"BN1905-944799","isI":false},{"SpecificIdent":14954,"SpecificNum":100,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"517163","isI":false},{"SpecificIdent":14995,"SpecificNum":58,"Meter":0,"Power":38789.0,"WPower":null,"SNumber":"JE1444-511511","isI":false},{"SpecificIdent":15245,"SpecificNum":26,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"430149","isI":false},{"SpecificIdent":18824,"SpecificNum":55,"Meter":8236,"Power":31358.0,"WPower":null,"SNumber":"0703-310839","isI":false},{"SpecificIdent":20745,"SpecificNum":41,"Meter":0,"Power":60963.0,"WPower":null,"SNumber":"JE1447-517260","isI":false},{"SpecificIdent":31584,"SpecificNum":11,"Meter":0,"Power":3696.0,"WPower":null,"SNumber":"467154","isI":false},{"SpecificIdent":32051,"SpecificNum":40,"Meter":7870,"Power":13057.0,"WPower":null,"SNumber":"JE1608-625593","isI":false},{"SpecificIdent":32263,"SpecificNum":4,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":33137,"SpecificNum":132,"Meter":5996,"Power":26650.0,"WPower":null,"SNumber":"459051","isI":false},{"SpecificIdent":33481,"SpecificNum":144,"Meter":4228,"Power":16136.0,"WPower":null,"SNumber":"JE1603-617807","isI":false},{"SpecificIdent":33915,"SpecificNum":145,"Meter":5647,"Power":3157.0,"WPower":null,"SNumber":"JE1518-549610","isI":false},{"SpecificIdent":36051,"SpecificNum":119,"Meter":2923,"Power":12249.0,"WPower":null,"SNumber":"135493","isI":false},{"SpecificIdent":37398,"SpecificNum":21,"Meter":58,"Power":5540.0,"WPower":null,"SNumber":"BN1925-982761","isI":false},{"SpecificIdent":39024,"SpecificNum":50,"Meter":7217,"Power":38987.0,"WPower":null,"SNumber":"JE1445-511599","isI":false},{"SpecificIdent":39072,"SpecificNum":59,"Meter":5965,"Power":32942.0,"WPower":null,"SNumber":"JE1449-521199","isI":false},{"SpecificIdent":40601,"SpecificNum":9,"Meter":0,"Power":59655.0,"WPower":null,"SNumber":"JE1447-517150","isI":false},{"SpecificIdent":40712,"SpecificNum":37,"Meter":0,"Power":5715.0,"WPower":null,"SNumber":"JE1502-525840","isI":false},{"SpecificIdent":41596,"SpecificNum":53,"Meter":8803,"Power":60669.0,"WPower":null,"SNumber":"JE1503-527155","isI":false},{"SpecificIdent":50276,"SpecificNum":30,"Meter":2573,"Power":4625.0,"WPower":null,"SNumber":"JE1545-606334","isI":false},{"SpecificIdent":51712,"SpecificNum":69,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":56140,"SpecificNum":10,"Meter":5169,"Power":26659.0,"WPower":null,"SNumber":"JE1547-609024","isI":false},{"SpecificIdent":56362,"SpecificNum":6,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":58892,"SpecificNum":113,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65168,"SpecificNum":5,"Meter":12739,"Power":55833.0,"WPower":null,"SNumber":"JE1449-521284","isI":false},{"SpecificIdent":65255,"SpecificNum":60,"Meter":5121,"Power":27784.0,"WPower":null,"SNumber":"JE1449-521196","isI":false},{"SpecificIdent":65665,"SpecificNum":47,"Meter":11793,"Power":47576.0,"WPower":null,"SNumber":"JE1509-534315","isI":false},{"SpecificIdent":65842,"SpecificNum":8,"Meter":10783,"Power":46428.0,"WPower":null,"SNumber":"JE1509-534401","isI":false},{"SpecificIdent":65901,"SpecificNum":22,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65920,"SpecificNum":17,"Meter":9316,"Power":38242.0,"WPower":null,"SNumber":"JE1509-534360","isI":false},{"SpecificIdent":66119,"SpecificNum":43,"Meter":12072,"Power":52157.0,"WPower":null,"SNumber":"JE1449-521259","isI":false},{"SpecificIdent":70018,"SpecificNum":34,"Meter":11172,"Power":49706.0,"WPower":null,"SNumber":"JE1449-521285","isI":false},{"SpecificIdent":71388,"SpecificNum":54,"Meter":6947,"Power":36000.0,"WPower":null,"SNumber":"JE1445-512406","isI":false},{"SpecificIdent":71892,"SpecificNum":36,"Meter":15398,"Power":63691.0,"WPower":null,"SNumber":"JE1447-517256","isI":false},{"SpecificIdent":72600,"SpecificNum":38,"Meter":14813,"Power":62641.0,"WPower":null,"SNumber":"JE1447-517189","isI":false},{"SpecificIdent":73645,"SpecificNum":2,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77208,"SpecificNum":28,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77892,"SpecificNum":15,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":78513,"SpecificNum":31,"Meter":6711,"Power":36461.0,"WPower":null,"SNumber":"JE1445-511601","isI":false},{"SpecificIdent":79531,"SpecificNum":18,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}]</pre></body></html>
I have tried examples from bs4, jsontoxml, and others, but I am sure there is a simple way to iterate and extract this?
I would harness python's standard library following way
import csv
import json
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
if data.strip():
self.data = data
parser = MyHTMLParser()
with open("sample.html","r") as f:
parser.feed(f.read())
with open('sample.csv', 'w', newline='') as csvfile:
fieldnames = ['SpecificIdent', 'SpecificNum', 'Meter', 'Power', 'WPower', 'SNumber', 'isI']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(json.loads(parser.data))
which creates file starting with following lines
SpecificIdent,SpecificNum,Meter,Power,WPower,SNumber,isI
2588,29,0,0.0,,,False
3716,39,1835,11240.0,,0703-403548,False
6364,27,7768,29969.0,,467419,False
6583,51,7027,36968.0,,JE1449-521248,False
6612,57,12828,53918.0,,JE1509-534327,False
7139,305,6264,33101.0,,JE1449-521204,False
7551,116,0,21569.0,,JE1449-521252,False
7643,56,7752,40501.0,,JE1449-521200,False
8653,49,0,0.0,,,False
Disclaimer: this assumes JSON array you want is last text element which is not empty (i.e. contain at least 1 non-whitespace character).
There is a python library, called BeautifulSoup, that you could utilize to parse the whole HTML file:
# pip install bs4
from bs4 import BeautifulSoup
html = BeautifulSoup(your-html)
From here on, you can perform any actions upon the html. In your case, you just need to find the <pre> element, and get its contents. This can be achieved easily:
pre = html.body.find('pre')
text = pre.text
Finally, you need to parse the text, which it seems is JSON. You can do with Python's internal json library:
import json
result = json.loads(text)
Now, we need to convert this to a CSV file. This could be done, using the csv library:
import csv
with open('GFG', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)
Finally, your code should look something like this:
from bs4 import BeautifulSoup
import json
import csv
with open('raw.html', 'r') as f:
raw = f.read()
html = BeautifulSoup(raw)
pre = html.body.find('pre')
text = pre.text
result = json.loads(text)
with open('result.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)
Related
how to replace HTML codes in HTML file using python?
I'm trying to replace all HTML codes in my HTML file in a for Loop (not sure if this is the easiest approach) without changing the formatting of the original file. When I run the code below I don't get the codes replaced. Does anyone know what could be wrong? import re tex=open('ALICE.per-txt.txt', 'r') tex=tex.read() for i in tex: if i =='õ': i=='õ' elif i == 'ç': i=='ç' with open('Alice1.replaced.txt', "w") as f: f.write(tex) f.close()
You can use html.unescape. >>> import html >>> html.unescape('õ') 'õ' With your code: import html with open('ALICE.per-txt.txt', 'r') as f: html_text = f.read() html_text = html.unescape(html_text) with open('ALICE.per-txt.txt', 'w') as f: f.write(html_text) Please note that I opened the files with a with statement. This takes care of closing the file after the with block - something you forgot to do when reading the file.
extracting hyperlinks from rtf with python
I'm trying to extract hyperlinks from rtfs, with python. I have like a 1000 rtfs to go through so figured if this could ease my task. But my code doesn't extract links to the articles, just the front page of that database. Here's what I wrote: import csv import re with open('text.rtf', 'r') as file: for line in file: urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line) print(urls) with open ('some.csv','w') as fw: writer = csv.writer(fw) writer.writerows(urls) And this is what's printed out: [] [] [] ['https://database.com'] [] [] csv file is empty ...(And I want to write those urls into a csv file... Is it even possible?) I guess this needs to be modified: 'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line' I do not know how.
Python's built-in function open cannot alone decode RTF files. You'll need to install another package to handle that. There's another that can help with extracting urls from text as well. Not sure it's the most accurate url extractor solution, though. pip install striprtf urlextract Back in your main file you can try something like: import csv from striprtf import striprtf from urlextract import URLExtract with open( 'text.rtf', 'r') as rtf_file: file_text = striprtf.rtf_to_text( rtf_file.read() ) extractor = URLExtract() urls = extractor.find_urls(file_text) with open('some.csv', 'w', newline='') as fw: fieldnames = ['urls'] writer = csv.DictWriter(fw, fieldnames = fieldnames) writer.writeheader() for link in urls: writer.writerow( {'urls': link} ) Hopefully this will get you what you need.
How to download a csv file in python from a server?
from pip._vendor import requests import csv url = 'https://docs.google.com/spreadsheets/abcd' dataReader = csv.reader(open(url), delimiter=',', quotechar='"') exampleData = list(dataReader) exampleData
Use Python Requests. import requests r = requests.get(url) lines = r.text.splitlines() We use splitlines to turn the text into an iterable like a file handle. You should probably wrap it up in a try, catch block in case of errors.
You need to use something like urllib2 to retrieve the file. for example: import urllib2 import csv csvfile = urllib2.urlopen('https://docs.google.com/spreadsheets/abcd') dataReader = csv.reader(csvfile,delimiter=',', quotechar='"') do_stuff(dataReader)
You can import urllib.request and then simply call data_stream = urllib.request.urlopen(url) to get a buffer of the file. You can then save the csv data as data = str(data_stream.read(), which may be a bit unclean depending on your source or encoded, so you may need to do some manipulation, but if not then you can just throw it into csv.reader(data, delimiter=',') An example requiring translating from byte format that may work for you: data = urllib.request.urlopen(url) data_csv = str(data.read()) # split out b' flag from string, then also split at newlines up to the last one dataReader = csv.reader(data_csv.split("b\'",1)[1].split("\\n")[:-1], delimiter=",") headers = reader.__next__() exampleData = list(dataReader)
Import CSV NBA Stats in Excel
So after struggling a long time I've found a way to get the data from nba.com in comma separated values This is the result http://stats.nba.com/stats/leaguedashplayerstats?DateFrom=&DateTo=&GameScope=&GameSegment=&LastNGames=15&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2015-16&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&VsConference=&VsDivision= How do I get that into a nice CSV or excel file? Or even better if possible, how can I automatically query this data like web querying a table through excel web query?
The following should get you started: import requests import csv url = "http://stats.nba.com/stats/leaguedashplayerstats?DateFrom=&DateTo=&GameScope=&GameSegment=&LastNGames=15&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2015-16&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&VsConference=&VsDivision=" data = requests.get(url) entries = data.json() with open('output.csv', 'wb') as f_output: csv_output = csv.writer(f_output) csv_output.writerow(entries['resultSets'][0]['headers']) csv_output.writerows(entries['resultSets'][0]['rowSet']) This would produce an output.csv file starting as follows: PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,MIN,OFF_RATING,DEF_RATING,NET_RATING,AST_PCT,AST_TO,AST_RATIO,OREB_PCT,DREB_PCT,REB_PCT,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,PACE,PIE,FGM,FGA,FGM_PG,FGA_PG,FG_PCT,CFID,CFPARAMS 201166,Aaron Brooks,1610612741,CHI,31.0,13,6,7,0.462,17.5,105.8,106.8,-0.9,0.243,2.4,25.9,0.015,0.077,0.046,10.8,0.5,0.511,0.198,95.84,0.065,36,85,2.8,6.5,0.424,5,"201166,1610612741" 203932,Aaron Gordon,1610612753,ORL,20.0,15,3,12,0.2,23.0,98.9,106.4,-7.5,0.1,1.91,15.7,0.089,0.228,0.158,8.2,0.575,0.608,0.151,94.16,0.124,46,87,3.1,5.8,0.529,5,"203932,1610612753" 1626151,Aaron Harrison,1610612766,CHA,21.0,7,3,4,0.429,4.2,103.3,95.4,7.9,0.0,0.0,0.0,0.08,0.08,0.08,16.7,0.0,0.0,0.095,100.22,-0.032,0,5,0.0,0.7,0.0,5,"1626151,1610612766"
Download JSON data and convert it to CSV using Python
I'm currently using Yahoo Pipes which provides me with a JSON file from an URL. I would like to be able to fetch it and convert it into a CSV file, and I have no idea where to begin (I'm a complete beginner in Python). How can I fetch the JSON data from the URL? How can I transform it to CSV? Thank you
import urllib2 import json import csv def getRows(data): # ?? this totally depends on what's in your data return [] url = "http://www.yahoo.com/something" data = urllib2.urlopen(url).read() data = json.loads(data) fname = "mydata.csv" with open(fname,'wb') as outf: outcsv = csv.writer(outf) outcsv.writerows(getRows(data))