How to download a csv file in python from a server? - python

from pip._vendor import requests
import csv
url = 'https://docs.google.com/spreadsheets/abcd'
dataReader = csv.reader(open(url), delimiter=',', quotechar='"')
exampleData = list(dataReader)
exampleData

Use Python Requests.
import requests
r = requests.get(url)
lines = r.text.splitlines()
We use splitlines to turn the text into an iterable like a file handle. You should probably wrap it up in a try, catch block in case of errors.

You need to use something like urllib2 to retrieve the file.
for example:
import urllib2
import csv
csvfile = urllib2.urlopen('https://docs.google.com/spreadsheets/abcd')
dataReader = csv.reader(csvfile,delimiter=',', quotechar='"')
do_stuff(dataReader)

You can import urllib.request and then simply call data_stream = urllib.request.urlopen(url) to get a buffer of the file. You can then save the csv data as data = str(data_stream.read(), which may be a bit unclean depending on your source or encoded, so you may need to do some manipulation, but if not then you can just throw it into csv.reader(data, delimiter=',')
An example requiring translating from byte format that may work for you:
data = urllib.request.urlopen(url)
data_csv = str(data.read())
# split out b' flag from string, then also split at newlines up to the last one
dataReader = csv.reader(data_csv.split("b\'",1)[1].split("\\n")[:-1], delimiter=",")
headers = reader.__next__()
exampleData = list(dataReader)

Related

Iterate through a html file and extract data to CSV file

I have searched high and low for a solution, but non have quite fit what I need to do.
I have an html page that is saved, as a file, lets call it sample.html and I need to extract recurring json data from it. An example file is as follows:
I need to get the info from these files regulary, so the amount of objects change every time, an object would be considered as "{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}"
I need to get each of the values to a CSV file, with column headings being SpecificIdent, SpecificNum, Meter, Power, WPower, Snumber, isI. The associated data would be the rows from each.
I apologize if this is a basic question in Python, but I am pretty new to it and cannot fathom the best way to do this. Any assistance would be greatly appreciated.
Kind regards
A
<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">[{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":3716,"SpecificNum":39,"Meter":1835,"Power":11240.0,"WPower":null,"SNumber":"0703-403548","isI":false},{"SpecificIdent":6364,"SpecificNum":27,"Meter":7768,"Power":29969.0,"WPower":null,"SNumber":"467419","isI":false},{"SpecificIdent":6583,"SpecificNum":51,"Meter":7027,"Power":36968.0,"WPower":null,"SNumber":"JE1449-521248","isI":false},{"SpecificIdent":6612,"SpecificNum":57,"Meter":12828,"Power":53918.0,"WPower":null,"SNumber":"JE1509-534327","isI":false},{"SpecificIdent":7139,"SpecificNum":305,"Meter":6264,"Power":33101.0,"WPower":null,"SNumber":"JE1449-521204","isI":false},{"SpecificIdent":7551,"SpecificNum":116,"Meter":0,"Power":21569.0,"WPower":null,"SNumber":"JE1449-521252","isI":false},{"SpecificIdent":7643,"SpecificNum":56,"Meter":7752,"Power":40501.0,"WPower":null,"SNumber":"JE1449-521200","isI":false},{"SpecificIdent":8653,"SpecificNum":49,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":9733,"SpecificNum":142,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":10999,"SpecificNum":20,"Meter":7723,"Power":6987.0,"WPower":null,"SNumber":"JE1608-625534","isI":false},{"SpecificIdent":12086,"SpecificNum":24,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":14590,"SpecificNum":35,"Meter":394,"Power":10941.0,"WPower":null,"SNumber":"BN1905-944799","isI":false},{"SpecificIdent":14954,"SpecificNum":100,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"517163","isI":false},{"SpecificIdent":14995,"SpecificNum":58,"Meter":0,"Power":38789.0,"WPower":null,"SNumber":"JE1444-511511","isI":false},{"SpecificIdent":15245,"SpecificNum":26,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"430149","isI":false},{"SpecificIdent":18824,"SpecificNum":55,"Meter":8236,"Power":31358.0,"WPower":null,"SNumber":"0703-310839","isI":false},{"SpecificIdent":20745,"SpecificNum":41,"Meter":0,"Power":60963.0,"WPower":null,"SNumber":"JE1447-517260","isI":false},{"SpecificIdent":31584,"SpecificNum":11,"Meter":0,"Power":3696.0,"WPower":null,"SNumber":"467154","isI":false},{"SpecificIdent":32051,"SpecificNum":40,"Meter":7870,"Power":13057.0,"WPower":null,"SNumber":"JE1608-625593","isI":false},{"SpecificIdent":32263,"SpecificNum":4,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":33137,"SpecificNum":132,"Meter":5996,"Power":26650.0,"WPower":null,"SNumber":"459051","isI":false},{"SpecificIdent":33481,"SpecificNum":144,"Meter":4228,"Power":16136.0,"WPower":null,"SNumber":"JE1603-617807","isI":false},{"SpecificIdent":33915,"SpecificNum":145,"Meter":5647,"Power":3157.0,"WPower":null,"SNumber":"JE1518-549610","isI":false},{"SpecificIdent":36051,"SpecificNum":119,"Meter":2923,"Power":12249.0,"WPower":null,"SNumber":"135493","isI":false},{"SpecificIdent":37398,"SpecificNum":21,"Meter":58,"Power":5540.0,"WPower":null,"SNumber":"BN1925-982761","isI":false},{"SpecificIdent":39024,"SpecificNum":50,"Meter":7217,"Power":38987.0,"WPower":null,"SNumber":"JE1445-511599","isI":false},{"SpecificIdent":39072,"SpecificNum":59,"Meter":5965,"Power":32942.0,"WPower":null,"SNumber":"JE1449-521199","isI":false},{"SpecificIdent":40601,"SpecificNum":9,"Meter":0,"Power":59655.0,"WPower":null,"SNumber":"JE1447-517150","isI":false},{"SpecificIdent":40712,"SpecificNum":37,"Meter":0,"Power":5715.0,"WPower":null,"SNumber":"JE1502-525840","isI":false},{"SpecificIdent":41596,"SpecificNum":53,"Meter":8803,"Power":60669.0,"WPower":null,"SNumber":"JE1503-527155","isI":false},{"SpecificIdent":50276,"SpecificNum":30,"Meter":2573,"Power":4625.0,"WPower":null,"SNumber":"JE1545-606334","isI":false},{"SpecificIdent":51712,"SpecificNum":69,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":56140,"SpecificNum":10,"Meter":5169,"Power":26659.0,"WPower":null,"SNumber":"JE1547-609024","isI":false},{"SpecificIdent":56362,"SpecificNum":6,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":58892,"SpecificNum":113,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65168,"SpecificNum":5,"Meter":12739,"Power":55833.0,"WPower":null,"SNumber":"JE1449-521284","isI":false},{"SpecificIdent":65255,"SpecificNum":60,"Meter":5121,"Power":27784.0,"WPower":null,"SNumber":"JE1449-521196","isI":false},{"SpecificIdent":65665,"SpecificNum":47,"Meter":11793,"Power":47576.0,"WPower":null,"SNumber":"JE1509-534315","isI":false},{"SpecificIdent":65842,"SpecificNum":8,"Meter":10783,"Power":46428.0,"WPower":null,"SNumber":"JE1509-534401","isI":false},{"SpecificIdent":65901,"SpecificNum":22,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65920,"SpecificNum":17,"Meter":9316,"Power":38242.0,"WPower":null,"SNumber":"JE1509-534360","isI":false},{"SpecificIdent":66119,"SpecificNum":43,"Meter":12072,"Power":52157.0,"WPower":null,"SNumber":"JE1449-521259","isI":false},{"SpecificIdent":70018,"SpecificNum":34,"Meter":11172,"Power":49706.0,"WPower":null,"SNumber":"JE1449-521285","isI":false},{"SpecificIdent":71388,"SpecificNum":54,"Meter":6947,"Power":36000.0,"WPower":null,"SNumber":"JE1445-512406","isI":false},{"SpecificIdent":71892,"SpecificNum":36,"Meter":15398,"Power":63691.0,"WPower":null,"SNumber":"JE1447-517256","isI":false},{"SpecificIdent":72600,"SpecificNum":38,"Meter":14813,"Power":62641.0,"WPower":null,"SNumber":"JE1447-517189","isI":false},{"SpecificIdent":73645,"SpecificNum":2,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77208,"SpecificNum":28,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77892,"SpecificNum":15,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":78513,"SpecificNum":31,"Meter":6711,"Power":36461.0,"WPower":null,"SNumber":"JE1445-511601","isI":false},{"SpecificIdent":79531,"SpecificNum":18,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}]</pre></body></html>
I have tried examples from bs4, jsontoxml, and others, but I am sure there is a simple way to iterate and extract this?
I would harness python's standard library following way
import csv
import json
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
if data.strip():
self.data = data
parser = MyHTMLParser()
with open("sample.html","r") as f:
parser.feed(f.read())
with open('sample.csv', 'w', newline='') as csvfile:
fieldnames = ['SpecificIdent', 'SpecificNum', 'Meter', 'Power', 'WPower', 'SNumber', 'isI']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(json.loads(parser.data))
which creates file starting with following lines
SpecificIdent,SpecificNum,Meter,Power,WPower,SNumber,isI
2588,29,0,0.0,,,False
3716,39,1835,11240.0,,0703-403548,False
6364,27,7768,29969.0,,467419,False
6583,51,7027,36968.0,,JE1449-521248,False
6612,57,12828,53918.0,,JE1509-534327,False
7139,305,6264,33101.0,,JE1449-521204,False
7551,116,0,21569.0,,JE1449-521252,False
7643,56,7752,40501.0,,JE1449-521200,False
8653,49,0,0.0,,,False
Disclaimer: this assumes JSON array you want is last text element which is not empty (i.e. contain at least 1 non-whitespace character).
There is a python library, called BeautifulSoup, that you could utilize to parse the whole HTML file:
# pip install bs4
from bs4 import BeautifulSoup
html = BeautifulSoup(your-html)
From here on, you can perform any actions upon the html. In your case, you just need to find the <pre> element, and get its contents. This can be achieved easily:
pre = html.body.find('pre')
text = pre.text
Finally, you need to parse the text, which it seems is JSON. You can do with Python's internal json library:
import json
result = json.loads(text)
Now, we need to convert this to a CSV file. This could be done, using the csv library:
import csv
with open('GFG', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)
Finally, your code should look something like this:
from bs4 import BeautifulSoup
import json
import csv
with open('raw.html', 'r') as f:
raw = f.read()
html = BeautifulSoup(raw)
pre = html.body.find('pre')
text = pre.text
result = json.loads(text)
with open('result.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)

From gzip to json to dataframe to csv

I am trying to get some data from an open API:
https://data.brreg.no/enhetsregisteret/api/enheter/lastned
but I am having difficulties understanding the different type of objects and the order the conversions should be in. Is it strings to bytes, is it BytesIO or StringIO, is it decode('utf-8) or decode('unicode) etc..?
So far:
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
and now is where I am stuck, how should I write the next line of code?
json_str = json.loads(decompressed_file.read().decode('utf-8'))
My workaround is if I write it as a json file then read it in again and do the transformation to df then it works:
with io.open('brreg.json', 'wb') as f:
f.write(decompressed_file.read())
with open(f_path, encoding='utf-8') as fin:
d = json.load(fin)
df = json_normalize(d)
with open('brreg_2.csv', 'w', encoding='utf-8', newline='') as fout:
fout.write(df.to_csv())
I found many SO posts about it, but I am still so confused. This first one explains it quite good, but I still need some spoon feeding.
Python 3, read/write compressed json objects from/to gzip file
TypeError when trying to convert Python 2.7 code to Python 3.4 code
How can I create a GzipFile instance from the “file-like object” that urllib.urlopen() returns?
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
It works fine for me using the decompress function rather than the GZipFile class to decompress the file, but not sure why yet...
import urllib.request
import gzip
import io
import json
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.decompress(compressed_file.read())
json_str = json.loads(decompressed_file.decode('utf-8'))
EDIT, in fact the following also works fine for me which appears to be your exact code...
(Further edit, turns out it's not quite your exact code because your final line was outside the with block which meant response was no longer open when it was needed - see comment thread)
import urllib.request
import gzip
import io
import json
url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned'
with urllib.request.urlopen(url_get) as response:
encoding = response.info().get_param('charset', 'utf8')
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
json_str = json.loads(decompressed_file.read().decode('utf-8'))

Read csv from url one line at the time in Python 3.X

I have to read an online csv-file into a postgres database, and in that context I have some problems reading the online csv-file properly.
If I just import the file it reads as bytes, so I have to decode it. During the decoding it, however, seems that the entire file is turned into one long string.
# Libraries
import csv
import urllib.request
# Function for importing csv from url
def csv_import(url):
url_open = urllib.request.urlopen(url)
csvfile = csv.reader(url_open.decode('utf-8'), delimiter=',')
return csvfile;
# Reading file
p_pladser = csv_import("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326")
When I try to read the imported file line by line it only reads one character at the time.
for row in p_pladser:
print(row)
break
['F']
Can you help me identify where it goes wrong? I am using Python 3.6.
EDIT: Per request my solution in R
# Loading library
library(RPostgreSQL)
# Reading dataframe
p_pladser = read.csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326", encoding = "UTF-8", stringsAsFactors = FALSE)
# Creating database connection
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "secretdatabase", host = "secrethost", user = "secretuser", password = "secretpassword")
# Uploading dataframe to postgres database
dbWriteTable(con, "p_pladser", p_pladser , append = TRUE, row.names = FALSE, encoding = "UTF-8")
I have to upload several tables for 10,000 to 100,000 rows, and it total in R it takes 1-2 seconds to upload them all.
csv.reader expect as argument a file like object and not a string. You have 2 options here:
either you read the data into a string (as you currently do) and then use a io.StringIO to build a file like object around that string:
def csv_import(url):
url_open = urllib.request.urlopen(url)
csvfile = csv.reader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',')
return csvfile;
or you use a io.TextIOWrapper around the binary stream provided by urllib.request:
def csv_import(url):
url_open = urllib.request.urlopen(url)
csvfile = csv.reader(io.TextIOWrapper(url_open, encoding = 'utf-8'), delimiter=',')
return csvfile;
How about loading the CSV with pandas!
import pandas as pd
csv = pd.read_csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326")
print csv.columns
OR if you have the CSV downloaded in your machine, then directly
csv = pd.read_csv("<path_to_csv>")
Ok! You may consider passing delimiter and quotechar arguments to csv.reader, because the CSV contains quotes as well! Something like this,
with open('p_pladser.csv') as f:
rows = csv.reader(f, delimiter=',', quotechar='"')
for row in rows:
print(row)

Print JSON data from csv list of multiple urls

Very new to Python and haven't found specific answer on SO but apologies in advance if this appears very naive or elsewhere already.
I am trying to print 'IncorporationDate' JSON data from multiple urls of public data set. I have the urls saved as a csv file, snippet below. I am only getting as far as printing ALL the JSON data from one url, and I am uncertain how to run that over all of the csv urls, and write to csv just the IncorporationDate values.
Any basic guidance or edits are really welcomed!
try:
# For Python 3.0 and later
from urllib.request import urlopen
except ImportError:
# Fall back to Python 2's urllib2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url = ("http://data.companieshouse.gov.uk/doc/company/01046514.json")
print(get_jsonparsed_data(url))
import csv
with open('test.csv') as f:
lis=[line.split() for line in f]
for i,x in enumerate(lis):
print ()
import StringIO
s = StringIO.StringIO()
with open('example.csv', 'w') as f:
for line in s:
f.write(line)
Snippet of csv:
http://business.data.gov.uk/id/company/01046514.json
http://business.data.gov.uk/id/company/01751318.json
http://business.data.gov.uk/id/company/03164710.json
http://business.data.gov.uk/id/company/04403406.json
http://business.data.gov.uk/id/company/04405987.json
Welcome to the Python world.
For dealing with making http requests, we commonly use requests because it's dead simple api.
The code snippet below does what I believe you want:
It grabs the data from each of the urls you posted
It creates a new CSV file with each of the IncorporationDate keys.
```
import csv
import requests
COMPANY_URLS = [
'http://business.data.gov.uk/id/company/01046514.json',
'http://business.data.gov.uk/id/company/01751318.json',
'http://business.data.gov.uk/id/company/03164710.json',
'http://business.data.gov.uk/id/company/04403406.json',
'http://business.data.gov.uk/id/company/04405987.json',
]
def get_company_data():
for url in COMPANY_URLS:
res = requests.get(url)
if res.status_code == 200:
yield res.json()
if __name__ == '__main__':
for data in get_company_data():
try:
incorporation_date = data['primaryTopic']['IncorporationDate']
except KeyError:
continue
else:
with open('out.csv', 'a') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([incorporation_date])
```
First step, you have to read all the URLs in your CSV
import csv
csvReader = csv.reader('text.csv')
# next(csvReader) uncomment if you have a header in the .CSV file
all_urls = [row for row in csvReader if row]
Second step, fetch the data from the URL
from urllib.request import urlopen
def get_jsonparsed_data(url):
response = urlopen(url)
data = response.read().decode("utf-8")
return json.loads(data)
url_data = get_jsonparsed_data("give_your_url_here")
Third step:
Go through all URLs that you got from CSV file
Get JSON data
Fetch the field what you need, in your case "IncorporationDate"
Write into an output CSV file, I'm naming it as IncorporationDates.csv
Code below:
for each_url in all_urls:
url_data = get_jsonparsed_data(each_url)
with open('IncorporationDates.csv', 'w' ) as abc:
abc.write(url_data['primaryTopic']['IncorporationDate'])

Python csv library leaves empty rows even when using a valid lineterminator

I am trying to fetch data from the internet to save it to a csv file.
I am facing a problem with writing to the csv file, the library leaves an empty row in the file
The data is random.org integers in text/plain format.
I'm using urllib.request to fetch the data and I am using this code to get the data and decode it
req = urllib.request.Request(url, data=None, headers={
'User-Agent': '(some text)'})
with urllib.request.urlopen(req) as response:
html = response.read()
encoding = response.headers.get_content_charset('utf-8')
data = html.decode(encoding)
I am using this line of code to open the csv file :csvfile = open('data.csv', "a")
Writing to the file:
writer = csv.writer(csvfile, lineterminator = '\n')
writer.writerows(data)
and of course I close the file at the end
Things I tried and didn't help :
Using (lineterminator = '\n') when writing
Using (newline = "") when opening the file
Defining a "delimiter" and a "quotechar" and "quoting"
Updated Answer
If when you create the list that is being written to the data list, if you add it as a list so that data becomes a list of lists, then set your line delimiter to be '\n', it should work. Below is the working code I used to test.
import csv
import random
csvfile = open('csvTest.csv', 'a')
data = []
for x in range(5):
data.append([str(random.randint(0, 100))])
writer = csv.writer(csvfile, lineterminator = '\n')
writer.writerows(data)
csvfile.close()
and it outputs

Categories