Importing HTML code into CSV using python - python
I have an HTML with data that I want to bring into python and put into a CSV. I'm not sure which package and program will allow me to complete this as I've tried a few different ones with no success (bs4 and urllib).
This is the HTML link:
https://www.cmegroup.com/CmeWS/mvc/Volume/Details/F/8478/20200807/F?tradeDate=20200807
Out of interest, what kind of HTML link is this? It appears to almost be in CSV format already. Apologies if this is a silly question. I've tried to search file types on the internet too.
I tried a URL request on this web link but received an error when trying to make the request:
from urllib.request import urlopen as uReq
cme_url = "https://www.cmegroup.com/CmeWS/mvc/Volume/Details/F/8478/20200807/F?tradeDate=20200807"
#opening up connection
uClient = uReq(cme_url)
I have scoured StackOver for examples which could solve my questions, but I was unsuccessful. For example, this example didn't help because it's using a specifically CSV file already: Importing CSV into Python
I really appreciate your assistance.
You can read json from a URL and convert it to csv in a couple steps:
Use requests to get the json text and convert it to a dictionary
Use pandas to convert the dictionary to a csv file
I assume you only want the month data.
Here's the code:
import requests
import pandas as pd
url = 'https://www.cmegroup.com/CmeWS/mvc/Volume/Details/F/8478/20200807/F?tradeDate=20200807'
r = requests.get(url)
dj = r.json()
df = pd.DataFrame(dj['monthData'])
df.to_csv('out.csv', index=False)
Output (out.csv)
month,monthID,globex,openOutcry,totalVolume,blockVolume,efpVol,efrVol,eooVol,efsVol,subVol,pntVol,tasVol,deliveries,opnt,aon,atClose,change,strike,exercises
AUG 20,AUG-20-Calls,"10,007",0,"10,007",0,0,0,0,0,0,0,0,0,-,-,"9,372","-1,103",0,0
SEP 20,SEP-20-Calls,"1,316",0,"1,316",0,0,0,0,0,0,0,0,0,-,-,"2,899",47,0,0
OCT 20,OCT-20-Calls,115,0,115,0,0,0,0,0,0,0,0,0,-,-,614,32,0,0
NOV 20,NOV-20-Calls,16,0,16,0,0,0,0,0,0,0,0,0,-,-,68,6,0,0
DEC 20,DEC-20-Calls,13,0,13,0,0,0,0,0,0,0,0,0,-,-,105,-3,0,0
JAN 21,JAN-21-Calls,6,0,6,0,0,0,0,0,0,0,0,0,-,-,5,4,0,0
DEC 21,DEC-21-Calls,0,0,0,0,0,0,0,0,0,0,0,0,-,-,1,0,0,0
The data format in the URL you provided is almost in JSON.
Your question is "How to convert JSON file to CSV" in fact.
Python itself can solve this problem, with JSON encoder and decoder.
Related
How to mport a file with extension .A?
I downloaded a file with extension .A which contains a time series I would like to work on in Python. I'm not an expert at all with .A files, but if I open it with a notepad I see it contains the data I'd like to work on. How can I conver that file in Python in order to work on it (i.e. an array, a pandas series...)? import requests response = requests.get("https://sdw-wsrest.ecb.europa.eu/service/data/EXR/D.USD.EUR.SP00.A?startPeriod=2021-02-20&endPeriod=2021-02-25") data = response.text
You need to read up on parsing XML. This code will get the data into a data structure typical for XML. You may mangle it as you see fit from there. You need to provide more information about how you'd like these data to look in order to get a more complete answer. import requests import xml.etree.ElementTree as ET response = requests.get("https://sdw-wsrest.ecb.europa.eu/service/data/EXR/D.USD.EUR.SP00.A?startPeriod=2021-02-20&endPeriod=2021-02-25") data = response.text root = ET.fromstring(data)
Reading CSV using Pandas
I am attempting to read the following csv so I can process it further but I am getting an pandas.errors.ParserError. I would really appreciate any help on how I can read it. Can you help me identify what I am doing wrong? My code: import pandas as pd logic_df = pd.read_csv("http://www.sharecsv.com/s/6c1b912f54d87d45f4728f8fb1510a5eb/random.csv") I am not sure if there is something wrong with my csv because I used csv lint and it said my csv is fine so I am not sure what the issue is. I also tried to do the following logic_df = pd.read_csv("http://www.sharecsv.com/s/6cb912f54d87d45f4728f81fb1510a5eb/random.csv", error_bad_lines=False) with no luck.
Changing the url to the direct link of the table should work: df = pd.read_csv("http://www.sharecsv.com/dl/6cb912f54d87d45f4728f8fb1510a5eb/random.csv") The thing is, your url is pointing to a html page, not a csv file per se. You can either use the url above, or reading the your url source with pd.read_html, like this: df = pd.read_html('http://www.sharecsv.com/s/6cb912f54d87d45f4728f8fb1510a5eb/random.csv', header=0)[0] Hope it helps!
Need to extract data from html tables
I am new to scraping and I am trying to extract the data from html tables and save it as a csv file. How do I do that? This is what I have done so far: from bs4 import BeautifulSoup import os os.chdir('/Users/adityavemuganti/Downloads/Accounts_Monthly_Data-June2018') soup=BeautifulSoup(open('Prod224_0055_00007464_20170930.html'),"html.parser") Format=soup.prettify() table=soup.find("table",attrs={"class":"details"}) Here is the html file I am trying to scrape from: http://download.companieshouse.gov.uk/Accounts_Bulk_Data-2019-08-03.zip (It is a zip file). I have uncompressed the zipfile and read the contents into 'soup' as mentioned above. Now I am trying to read the data sitting in the tag into a csv/xlsx format.
Pandas is the way to go here. read_html and to_csv or if you desire you can also output to xlsx to_excel. import pandas as pd dataframes = pd.read_html('yoururlhere') # Assuming there is only one table in the file, if not then you may need to do a little more digging df = dataframes[0] df.to_csv('filename.csv')
HTML hidden elements
I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction. I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this : import pandas import numpy as np df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients') df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences') matrix=df.as_matrix(columns=None) clients = np.squeeze(np.asarray(matrix)) matrix2=df2.as_matrix(columns=None) agences = np.squeeze(np.asarray(matrix2)) compteagences=0 comptetotal=0 for j in agences: compteclients=0 for i in clients: print agences[compteagences] print clients[compteclients] url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km¤cy=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=' print url compteclients+=1 comptetotal+=1 compteagences+=1 All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project. Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this: Michelin When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help? Itinerary I have already tried this, without succeeding : import os import csv import requests from bs4 import BeautifulSoup requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km¤cy=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=") page = requete.content soup = BeautifulSoup(page, "html.parser") print soup
Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL. The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)
How to Read a WebPage with Python and write to a flat file?
Very novice at Python here. Trying to read the table presented at this page (w/ the current filters set as is) and then write it to a csv file. http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB I tried this next approach. It creates the csv file but does not fill it w/ the actual table contents. Appreciate any help in advance. thanks. import requests import pandas as pd url = 'http://www65.myfantasyleague.com/2017/optionsL=47579&O=243&TEAM=DAL&POS=RB' csv_file='DAL.RB.csv' pd.read_html(requests.get(url).content)[-1].to_csv(csv_file)
Generally, try to emphasize your problems better, try to debug and don't put everything in one line. With that said, your specific problem here was the index and the missing ? in the code (after options): import requests import pandas as pd url = 'http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB' # -^- csv_file='DAL.RB.csv' pd.read_html(requests.get(url).content)[1].to_csv(csv_file) # -^- This yields a CSV file with the table in it.