I have an HTML with data that I want to bring into python and put into a CSV. I'm not sure which package and program will allow me to complete this as I've tried a few different ones with no success (bs4 and urllib).
This is the HTML link:
https://www.cmegroup.com/CmeWS/mvc/Volume/Details/F/8478/20200807/F?tradeDate=20200807
Out of interest, what kind of HTML link is this? It appears to almost be in CSV format already. Apologies if this is a silly question. I've tried to search file types on the internet too.
I tried a URL request on this web link but received an error when trying to make the request:
from urllib.request import urlopen as uReq
cme_url = "https://www.cmegroup.com/CmeWS/mvc/Volume/Details/F/8478/20200807/F?tradeDate=20200807"
#opening up connection
uClient = uReq(cme_url)
I have scoured StackOver for examples which could solve my questions, but I was unsuccessful. For example, this example didn't help because it's using a specifically CSV file already: Importing CSV into Python
I really appreciate your assistance.
You can read json from a URL and convert it to csv in a couple steps:
Use requests to get the json text and convert it to a dictionary
Use pandas to convert the dictionary to a csv file
I assume you only want the month data.
Here's the code:
import requests
import pandas as pd
url = 'https://www.cmegroup.com/CmeWS/mvc/Volume/Details/F/8478/20200807/F?tradeDate=20200807'
r = requests.get(url)
dj = r.json()
df = pd.DataFrame(dj['monthData'])
df.to_csv('out.csv', index=False)
Output (out.csv)
month,monthID,globex,openOutcry,totalVolume,blockVolume,efpVol,efrVol,eooVol,efsVol,subVol,pntVol,tasVol,deliveries,opnt,aon,atClose,change,strike,exercises
AUG 20,AUG-20-Calls,"10,007",0,"10,007",0,0,0,0,0,0,0,0,0,-,-,"9,372","-1,103",0,0
SEP 20,SEP-20-Calls,"1,316",0,"1,316",0,0,0,0,0,0,0,0,0,-,-,"2,899",47,0,0
OCT 20,OCT-20-Calls,115,0,115,0,0,0,0,0,0,0,0,0,-,-,614,32,0,0
NOV 20,NOV-20-Calls,16,0,16,0,0,0,0,0,0,0,0,0,-,-,68,6,0,0
DEC 20,DEC-20-Calls,13,0,13,0,0,0,0,0,0,0,0,0,-,-,105,-3,0,0
JAN 21,JAN-21-Calls,6,0,6,0,0,0,0,0,0,0,0,0,-,-,5,4,0,0
DEC 21,DEC-21-Calls,0,0,0,0,0,0,0,0,0,0,0,0,-,-,1,0,0,0
The data format in the URL you provided is almost in JSON.
Your question is "How to convert JSON file to CSV" in fact.
Python itself can solve this problem, with JSON encoder and decoder.
I am new to scraping and I am trying to extract the data from html tables and save it as a csv file. How do I do that?
This is what I have done so far:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/adityavemuganti/Downloads/Accounts_Monthly_Data-June2018')
soup=BeautifulSoup(open('Prod224_0055_00007464_20170930.html'),"html.parser")
Format=soup.prettify()
table=soup.find("table",attrs={"class":"details"})
Here is the html file I am trying to scrape from:
http://download.companieshouse.gov.uk/Accounts_Bulk_Data-2019-08-03.zip (It is a zip file). I have uncompressed the zipfile and read the contents into 'soup' as mentioned above. Now I am trying to read the data sitting in the tag into a csv/xlsx format.
Pandas is the way to go here. read_html and to_csv or if you desire you can also output to xlsx to_excel.
import pandas as pd
dataframes = pd.read_html('yoururlhere')
# Assuming there is only one table in the file, if not then you may need to do a little more digging
df = dataframes[0]
df.to_csv('filename.csv')
I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km¤cy=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km¤cy=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup
Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)
Newb to python and working with APIs. My source create an ftp url where they are dumping files on a daily basis and I would like to grab file to perform engineering + analysis. My problem is, how do I specify username and password to pull the csv?
import pandas as pd
data = pd.read_csv('http://site-ftp.site.com/test/cat/filename.csv)
How do I include credentials to this?
PS- url is fake for the sake of an example.
With older versions of Pandas, you can use something like requests.get() to download the CSV data into memory. Then you could use StringIO to make the data "file like" so that pd.read_csv() can read it in. This approach avoids having to first write the data to a file.
import requests
import pandas as pd
from io import StringIO
csv = requests.get("http://site-ftp.site.com/test/cat/filename.csv", auth=HTTPBasicAuth('user', 'password'))
data = pd.read_csv(StringIO(csv.text))
print(data)
From pandas 0.19.2, the pd.read_csv() function now lets you just pass the URL directly. For example:
data = pd.read_csv('http://site-ftp.site.com/test/cat/filename.csv')