How to download a csv file in python from a server? - python
from pip._vendor import requests
import csv
url = 'https://docs.google.com/spreadsheets/abcd'
dataReader = csv.reader(open(url), delimiter=',', quotechar='"')
exampleData = list(dataReader)
exampleData
Use Python Requests.
import requests
r = requests.get(url)
lines = r.text.splitlines()
We use splitlines to turn the text into an iterable like a file handle. You should probably wrap it up in a try, catch block in case of errors.
You need to use something like urllib2 to retrieve the file.
for example:
import urllib2
import csv
csvfile = urllib2.urlopen('https://docs.google.com/spreadsheets/abcd')
dataReader = csv.reader(csvfile,delimiter=',', quotechar='"')
do_stuff(dataReader)
You can import urllib.request and then simply call data_stream = urllib.request.urlopen(url) to get a buffer of the file. You can then save the csv data as data = str(data_stream.read(), which may be a bit unclean depending on your source or encoded, so you may need to do some manipulation, but if not then you can just throw it into csv.reader(data, delimiter=',')
An example requiring translating from byte format that may work for you:
data = urllib.request.urlopen(url)
data_csv = str(data.read())
# split out b' flag from string, then also split at newlines up to the last one
dataReader = csv.reader(data_csv.split("b\'",1)[1].split("\\n")[:-1], delimiter=",")
headers = reader.__next__()
exampleData = list(dataReader)
Related
Iterate through a html file and extract data to CSV file
I have searched high and low for a solution, but non have quite fit what I need to do. I have an html page that is saved, as a file, lets call it sample.html and I need to extract recurring json data from it. An example file is as follows: I need to get the info from these files regulary, so the amount of objects change every time, an object would be considered as "{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}" I need to get each of the values to a CSV file, with column headings being SpecificIdent, SpecificNum, Meter, Power, WPower, Snumber, isI. The associated data would be the rows from each. I apologize if this is a basic question in Python, but I am pretty new to it and cannot fathom the best way to do this. Any assistance would be greatly appreciated. Kind regards A <html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">[{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":3716,"SpecificNum":39,"Meter":1835,"Power":11240.0,"WPower":null,"SNumber":"0703-403548","isI":false},{"SpecificIdent":6364,"SpecificNum":27,"Meter":7768,"Power":29969.0,"WPower":null,"SNumber":"467419","isI":false},{"SpecificIdent":6583,"SpecificNum":51,"Meter":7027,"Power":36968.0,"WPower":null,"SNumber":"JE1449-521248","isI":false},{"SpecificIdent":6612,"SpecificNum":57,"Meter":12828,"Power":53918.0,"WPower":null,"SNumber":"JE1509-534327","isI":false},{"SpecificIdent":7139,"SpecificNum":305,"Meter":6264,"Power":33101.0,"WPower":null,"SNumber":"JE1449-521204","isI":false},{"SpecificIdent":7551,"SpecificNum":116,"Meter":0,"Power":21569.0,"WPower":null,"SNumber":"JE1449-521252","isI":false},{"SpecificIdent":7643,"SpecificNum":56,"Meter":7752,"Power":40501.0,"WPower":null,"SNumber":"JE1449-521200","isI":false},{"SpecificIdent":8653,"SpecificNum":49,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":9733,"SpecificNum":142,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":10999,"SpecificNum":20,"Meter":7723,"Power":6987.0,"WPower":null,"SNumber":"JE1608-625534","isI":false},{"SpecificIdent":12086,"SpecificNum":24,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":14590,"SpecificNum":35,"Meter":394,"Power":10941.0,"WPower":null,"SNumber":"BN1905-944799","isI":false},{"SpecificIdent":14954,"SpecificNum":100,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"517163","isI":false},{"SpecificIdent":14995,"SpecificNum":58,"Meter":0,"Power":38789.0,"WPower":null,"SNumber":"JE1444-511511","isI":false},{"SpecificIdent":15245,"SpecificNum":26,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"430149","isI":false},{"SpecificIdent":18824,"SpecificNum":55,"Meter":8236,"Power":31358.0,"WPower":null,"SNumber":"0703-310839","isI":false},{"SpecificIdent":20745,"SpecificNum":41,"Meter":0,"Power":60963.0,"WPower":null,"SNumber":"JE1447-517260","isI":false},{"SpecificIdent":31584,"SpecificNum":11,"Meter":0,"Power":3696.0,"WPower":null,"SNumber":"467154","isI":false},{"SpecificIdent":32051,"SpecificNum":40,"Meter":7870,"Power":13057.0,"WPower":null,"SNumber":"JE1608-625593","isI":false},{"SpecificIdent":32263,"SpecificNum":4,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":33137,"SpecificNum":132,"Meter":5996,"Power":26650.0,"WPower":null,"SNumber":"459051","isI":false},{"SpecificIdent":33481,"SpecificNum":144,"Meter":4228,"Power":16136.0,"WPower":null,"SNumber":"JE1603-617807","isI":false},{"SpecificIdent":33915,"SpecificNum":145,"Meter":5647,"Power":3157.0,"WPower":null,"SNumber":"JE1518-549610","isI":false},{"SpecificIdent":36051,"SpecificNum":119,"Meter":2923,"Power":12249.0,"WPower":null,"SNumber":"135493","isI":false},{"SpecificIdent":37398,"SpecificNum":21,"Meter":58,"Power":5540.0,"WPower":null,"SNumber":"BN1925-982761","isI":false},{"SpecificIdent":39024,"SpecificNum":50,"Meter":7217,"Power":38987.0,"WPower":null,"SNumber":"JE1445-511599","isI":false},{"SpecificIdent":39072,"SpecificNum":59,"Meter":5965,"Power":32942.0,"WPower":null,"SNumber":"JE1449-521199","isI":false},{"SpecificIdent":40601,"SpecificNum":9,"Meter":0,"Power":59655.0,"WPower":null,"SNumber":"JE1447-517150","isI":false},{"SpecificIdent":40712,"SpecificNum":37,"Meter":0,"Power":5715.0,"WPower":null,"SNumber":"JE1502-525840","isI":false},{"SpecificIdent":41596,"SpecificNum":53,"Meter":8803,"Power":60669.0,"WPower":null,"SNumber":"JE1503-527155","isI":false},{"SpecificIdent":50276,"SpecificNum":30,"Meter":2573,"Power":4625.0,"WPower":null,"SNumber":"JE1545-606334","isI":false},{"SpecificIdent":51712,"SpecificNum":69,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":56140,"SpecificNum":10,"Meter":5169,"Power":26659.0,"WPower":null,"SNumber":"JE1547-609024","isI":false},{"SpecificIdent":56362,"SpecificNum":6,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":58892,"SpecificNum":113,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65168,"SpecificNum":5,"Meter":12739,"Power":55833.0,"WPower":null,"SNumber":"JE1449-521284","isI":false},{"SpecificIdent":65255,"SpecificNum":60,"Meter":5121,"Power":27784.0,"WPower":null,"SNumber":"JE1449-521196","isI":false},{"SpecificIdent":65665,"SpecificNum":47,"Meter":11793,"Power":47576.0,"WPower":null,"SNumber":"JE1509-534315","isI":false},{"SpecificIdent":65842,"SpecificNum":8,"Meter":10783,"Power":46428.0,"WPower":null,"SNumber":"JE1509-534401","isI":false},{"SpecificIdent":65901,"SpecificNum":22,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65920,"SpecificNum":17,"Meter":9316,"Power":38242.0,"WPower":null,"SNumber":"JE1509-534360","isI":false},{"SpecificIdent":66119,"SpecificNum":43,"Meter":12072,"Power":52157.0,"WPower":null,"SNumber":"JE1449-521259","isI":false},{"SpecificIdent":70018,"SpecificNum":34,"Meter":11172,"Power":49706.0,"WPower":null,"SNumber":"JE1449-521285","isI":false},{"SpecificIdent":71388,"SpecificNum":54,"Meter":6947,"Power":36000.0,"WPower":null,"SNumber":"JE1445-512406","isI":false},{"SpecificIdent":71892,"SpecificNum":36,"Meter":15398,"Power":63691.0,"WPower":null,"SNumber":"JE1447-517256","isI":false},{"SpecificIdent":72600,"SpecificNum":38,"Meter":14813,"Power":62641.0,"WPower":null,"SNumber":"JE1447-517189","isI":false},{"SpecificIdent":73645,"SpecificNum":2,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77208,"SpecificNum":28,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77892,"SpecificNum":15,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":78513,"SpecificNum":31,"Meter":6711,"Power":36461.0,"WPower":null,"SNumber":"JE1445-511601","isI":false},{"SpecificIdent":79531,"SpecificNum":18,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}]</pre></body></html> I have tried examples from bs4, jsontoxml, and others, but I am sure there is a simple way to iterate and extract this?
I would harness python's standard library following way import csv import json from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def handle_data(self, data): if data.strip(): self.data = data parser = MyHTMLParser() with open("sample.html","r") as f: parser.feed(f.read()) with open('sample.csv', 'w', newline='') as csvfile: fieldnames = ['SpecificIdent', 'SpecificNum', 'Meter', 'Power', 'WPower', 'SNumber', 'isI'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(json.loads(parser.data)) which creates file starting with following lines SpecificIdent,SpecificNum,Meter,Power,WPower,SNumber,isI 2588,29,0,0.0,,,False 3716,39,1835,11240.0,,0703-403548,False 6364,27,7768,29969.0,,467419,False 6583,51,7027,36968.0,,JE1449-521248,False 6612,57,12828,53918.0,,JE1509-534327,False 7139,305,6264,33101.0,,JE1449-521204,False 7551,116,0,21569.0,,JE1449-521252,False 7643,56,7752,40501.0,,JE1449-521200,False 8653,49,0,0.0,,,False Disclaimer: this assumes JSON array you want is last text element which is not empty (i.e. contain at least 1 non-whitespace character).
There is a python library, called BeautifulSoup, that you could utilize to parse the whole HTML file: # pip install bs4 from bs4 import BeautifulSoup html = BeautifulSoup(your-html) From here on, you can perform any actions upon the html. In your case, you just need to find the <pre> element, and get its contents. This can be achieved easily: pre = html.body.find('pre') text = pre.text Finally, you need to parse the text, which it seems is JSON. You can do with Python's internal json library: import json result = json.loads(text) Now, we need to convert this to a CSV file. This could be done, using the csv library: import csv with open('GFG', 'w') as f: writer = csv.DictWriter(f, fieldnames=[ "SpecificIdent", "SpecificNum", "Meter", "Power", "WPower", "SNumber", "isI" ]) writer.writeheader() writer.writerows(result) Finally, your code should look something like this: from bs4 import BeautifulSoup import json import csv with open('raw.html', 'r') as f: raw = f.read() html = BeautifulSoup(raw) pre = html.body.find('pre') text = pre.text result = json.loads(text) with open('result.csv', 'w') as f: writer = csv.DictWriter(f, fieldnames=[ "SpecificIdent", "SpecificNum", "Meter", "Power", "WPower", "SNumber", "isI" ]) writer.writeheader() writer.writerows(result)
From gzip to json to dataframe to csv
I am trying to get some data from an open API: https://data.brreg.no/enhetsregisteret/api/enheter/lastned but I am having difficulties understanding the different type of objects and the order the conversions should be in. Is it strings to bytes, is it BytesIO or StringIO, is it decode('utf-8) or decode('unicode) etc..? So far: url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned' with urllib.request.urlopen(url_get) as response: encoding = response.info().get_param('charset', 'utf8') compressed_file = io.BytesIO(response.read()) decompressed_file = gzip.GzipFile(fileobj=compressed_file) and now is where I am stuck, how should I write the next line of code? json_str = json.loads(decompressed_file.read().decode('utf-8')) My workaround is if I write it as a json file then read it in again and do the transformation to df then it works: with io.open('brreg.json', 'wb') as f: f.write(decompressed_file.read()) with open(f_path, encoding='utf-8') as fin: d = json.load(fin) df = json_normalize(d) with open('brreg_2.csv', 'w', encoding='utf-8', newline='') as fout: fout.write(df.to_csv()) I found many SO posts about it, but I am still so confused. This first one explains it quite good, but I still need some spoon feeding. Python 3, read/write compressed json objects from/to gzip file TypeError when trying to convert Python 2.7 code to Python 3.4 code How can I create a GzipFile instance from the “file-like object” that urllib.urlopen() returns? JSONDecodeError: Expecting value: line 1 column 1 (char 0)
It works fine for me using the decompress function rather than the GZipFile class to decompress the file, but not sure why yet... import urllib.request import gzip import io import json url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned' with urllib.request.urlopen(url_get) as response: encoding = response.info().get_param('charset', 'utf8') compressed_file = io.BytesIO(response.read()) decompressed_file = gzip.decompress(compressed_file.read()) json_str = json.loads(decompressed_file.decode('utf-8')) EDIT, in fact the following also works fine for me which appears to be your exact code... (Further edit, turns out it's not quite your exact code because your final line was outside the with block which meant response was no longer open when it was needed - see comment thread) import urllib.request import gzip import io import json url_get = 'https://data.brreg.no/enhetsregisteret/api/enheter/lastned' with urllib.request.urlopen(url_get) as response: encoding = response.info().get_param('charset', 'utf8') compressed_file = io.BytesIO(response.read()) decompressed_file = gzip.GzipFile(fileobj=compressed_file) json_str = json.loads(decompressed_file.read().decode('utf-8'))
Read csv from url one line at the time in Python 3.X
I have to read an online csv-file into a postgres database, and in that context I have some problems reading the online csv-file properly. If I just import the file it reads as bytes, so I have to decode it. During the decoding it, however, seems that the entire file is turned into one long string. # Libraries import csv import urllib.request # Function for importing csv from url def csv_import(url): url_open = urllib.request.urlopen(url) csvfile = csv.reader(url_open.decode('utf-8'), delimiter=',') return csvfile; # Reading file p_pladser = csv_import("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326") When I try to read the imported file line by line it only reads one character at the time. for row in p_pladser: print(row) break ['F'] Can you help me identify where it goes wrong? I am using Python 3.6. EDIT: Per request my solution in R # Loading library library(RPostgreSQL) # Reading dataframe p_pladser = read.csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326", encoding = "UTF-8", stringsAsFactors = FALSE) # Creating database connection drv <- dbDriver("PostgreSQL") con <- dbConnect(drv, dbname = "secretdatabase", host = "secrethost", user = "secretuser", password = "secretpassword") # Uploading dataframe to postgres database dbWriteTable(con, "p_pladser", p_pladser , append = TRUE, row.names = FALSE, encoding = "UTF-8") I have to upload several tables for 10,000 to 100,000 rows, and it total in R it takes 1-2 seconds to upload them all.
csv.reader expect as argument a file like object and not a string. You have 2 options here: either you read the data into a string (as you currently do) and then use a io.StringIO to build a file like object around that string: def csv_import(url): url_open = urllib.request.urlopen(url) csvfile = csv.reader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',') return csvfile; or you use a io.TextIOWrapper around the binary stream provided by urllib.request: def csv_import(url): url_open = urllib.request.urlopen(url) csvfile = csv.reader(io.TextIOWrapper(url_open, encoding = 'utf-8'), delimiter=',') return csvfile;
How about loading the CSV with pandas! import pandas as pd csv = pd.read_csv("http://wfs-kbhkort.kk.dk/k101/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=k101:p_pladser&outputFormat=csv&SRSNAME=EPSG:4326") print csv.columns OR if you have the CSV downloaded in your machine, then directly csv = pd.read_csv("<path_to_csv>") Ok! You may consider passing delimiter and quotechar arguments to csv.reader, because the CSV contains quotes as well! Something like this, with open('p_pladser.csv') as f: rows = csv.reader(f, delimiter=',', quotechar='"') for row in rows: print(row)
Print JSON data from csv list of multiple urls
Very new to Python and haven't found specific answer on SO but apologies in advance if this appears very naive or elsewhere already. I am trying to print 'IncorporationDate' JSON data from multiple urls of public data set. I have the urls saved as a csv file, snippet below. I am only getting as far as printing ALL the JSON data from one url, and I am uncertain how to run that over all of the csv urls, and write to csv just the IncorporationDate values. Any basic guidance or edits are really welcomed! try: # For Python 3.0 and later from urllib.request import urlopen except ImportError: # Fall back to Python 2's urllib2 from urllib2 import urlopen import json def get_jsonparsed_data(url): response = urlopen(url) data = response.read().decode("utf-8") return json.loads(data) url = ("http://data.companieshouse.gov.uk/doc/company/01046514.json") print(get_jsonparsed_data(url)) import csv with open('test.csv') as f: lis=[line.split() for line in f] for i,x in enumerate(lis): print () import StringIO s = StringIO.StringIO() with open('example.csv', 'w') as f: for line in s: f.write(line) Snippet of csv: http://business.data.gov.uk/id/company/01046514.json http://business.data.gov.uk/id/company/01751318.json http://business.data.gov.uk/id/company/03164710.json http://business.data.gov.uk/id/company/04403406.json http://business.data.gov.uk/id/company/04405987.json
Welcome to the Python world. For dealing with making http requests, we commonly use requests because it's dead simple api. The code snippet below does what I believe you want: It grabs the data from each of the urls you posted It creates a new CSV file with each of the IncorporationDate keys. ``` import csv import requests COMPANY_URLS = [ 'http://business.data.gov.uk/id/company/01046514.json', 'http://business.data.gov.uk/id/company/01751318.json', 'http://business.data.gov.uk/id/company/03164710.json', 'http://business.data.gov.uk/id/company/04403406.json', 'http://business.data.gov.uk/id/company/04405987.json', ] def get_company_data(): for url in COMPANY_URLS: res = requests.get(url) if res.status_code == 200: yield res.json() if __name__ == '__main__': for data in get_company_data(): try: incorporation_date = data['primaryTopic']['IncorporationDate'] except KeyError: continue else: with open('out.csv', 'a') as csvfile: writer = csv.writer(csvfile) writer.writerow([incorporation_date]) ```
First step, you have to read all the URLs in your CSV import csv csvReader = csv.reader('text.csv') # next(csvReader) uncomment if you have a header in the .CSV file all_urls = [row for row in csvReader if row] Second step, fetch the data from the URL from urllib.request import urlopen def get_jsonparsed_data(url): response = urlopen(url) data = response.read().decode("utf-8") return json.loads(data) url_data = get_jsonparsed_data("give_your_url_here") Third step: Go through all URLs that you got from CSV file Get JSON data Fetch the field what you need, in your case "IncorporationDate" Write into an output CSV file, I'm naming it as IncorporationDates.csv Code below: for each_url in all_urls: url_data = get_jsonparsed_data(each_url) with open('IncorporationDates.csv', 'w' ) as abc: abc.write(url_data['primaryTopic']['IncorporationDate'])
Python csv library leaves empty rows even when using a valid lineterminator
I am trying to fetch data from the internet to save it to a csv file. I am facing a problem with writing to the csv file, the library leaves an empty row in the file The data is random.org integers in text/plain format. I'm using urllib.request to fetch the data and I am using this code to get the data and decode it req = urllib.request.Request(url, data=None, headers={ 'User-Agent': '(some text)'}) with urllib.request.urlopen(req) as response: html = response.read() encoding = response.headers.get_content_charset('utf-8') data = html.decode(encoding) I am using this line of code to open the csv file :csvfile = open('data.csv', "a") Writing to the file: writer = csv.writer(csvfile, lineterminator = '\n') writer.writerows(data) and of course I close the file at the end Things I tried and didn't help : Using (lineterminator = '\n') when writing Using (newline = "") when opening the file Defining a "delimiter" and a "quotechar" and "quoting"
Updated Answer If when you create the list that is being written to the data list, if you add it as a list so that data becomes a list of lists, then set your line delimiter to be '\n', it should work. Below is the working code I used to test. import csv import random csvfile = open('csvTest.csv', 'a') data = [] for x in range(5): data.append([str(random.randint(0, 100))]) writer = csv.writer(csvfile, lineterminator = '\n') writer.writerows(data) csvfile.close() and it outputs