How to read a CSV file from a URL with Python?

How to read a CSV file from a URL with Python? - python

when I do curl to a API call link http://example.com/passkey=wedsmdjsjmdd
curl 'http://example.com/passkey=wedsmdjsjmdd'
I get the employee output data on a csv file format, like:
"Steve","421","0","421","2","","","","","","","","","421","0","421","2"
how can parse through this using python.
I tried:
import csv
cr = csv.reader(open('http://example.com/passkey=wedsmdjsjmdd',"rb"))
for row in cr:
print row
but it didn't work and I got an error
http://example.com/passkey=wedsmdjsjmdd No such file or directory:
Thanks!

Using pandas it is very simple to read a csv file directly from a url
import pandas as pd
data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd')
This will read your data in tabular format, which will be very easy to process

You need to replace open with urllib.urlopen or urllib2.urlopen.
e.g.
import csv
import urllib2
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
print row
This would output the following
Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
...
The original question is tagged "python-2.x", but for a Python 3 implementation (which requires only minor changes) see below.

You could do it with the requests module as well:
url = 'http://winterolympicsmedals.com/medals.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')

To increase performance when downloading a large file, the below may work a bit more efficiently:
import requests
from contextlib import closing
import csv
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
for row in reader:
# Handle each row here...
print row
By setting stream=True in the GET request, when we pass r.iter_lines() to csv.reader(), we are passing a generator to csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader.
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.

This question is tagged python-2.x so it didn't seem right to tamper with the original question, or the accepted answer. However, Python 2 is now unsupported, and this question still has good google juice for "python csv urllib", so here's an updated Python 3 solution.
It's now necessary to decode urlopen's response (in bytes) into a valid local encoding, so the accepted answer has to be modified slightly:
import csv, urllib.request
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
cr = csv.reader(lines)
for row in cr:
print(row)
Note the extra line beginning with lines =, the fact that urlopen is now in the urllib.request module, and print of course requires parentheses.
It's hardly advertised, but yes, csv.reader can read from a list of strings.
And since someone else mentioned pandas, here's a pandas rendition that displays the CSV in a console-friendly output:
python3 -c 'import pandas
df = pandas.read_csv("http://winterolympicsmedals.com/medals.csv")
print(df.to_string())'
Pandas is not a lightweight library, though. If you don't need the things that pandas provides, or if startup time is important (e.g. you're writing a command line utility or any other program that needs to load quickly), I'd advise that you stick with the standard library functions.

import pandas as pd
url='https://raw.githubusercontent.com/juliencohensolal/BankMarketing/master/rawData/bank-additional-full.csv'
data = pd.read_csv(url,sep=";") # use sep="," for coma separation.
data.describe()

I am also using this approach for csv files (Python 3.6.9):
import csv
import io
import requests
r = requests.get(url)
buff = io.StringIO(r.text)
dr = csv.DictReader(buff)
for row in dr:
print(row)

what you were trying to do with the curl command was to download the file to your local hard drive(HD). You however need to specify a path on HD
curl http://example.com/passkey=wedsmdjsjmdd -o ./example.csv
cr = csv.reader(open('./example.csv',"r"))
for row in cr:
print row

Related

Python Web scraping multiple URLs Output CSV

I have literally started with python today, I have managed to get one url to display data in python using.
import requests
URL = "https://gateway.pinata.cloud/ipfs/QmUJrnabRCMLsnvXNryojLWcysc4WwJCLqWYvJcWADfZFo/chadsJSON/1.json"
page = requests.get(URL)
print(page.text)
I need to look up multiple urls (like above but numbered 1-upto 10,000)
and save it as a csv with each url data in one cell I will be able to manipulate the data to make it useable in excel.
Please can anyone make they python code I can run.

This is a very simple answer but will be very useful in your case although you still need to do some work to reach the required output.
This is how you can send requests from 0 to 100, using a for loop:
For Loop
import requests
for i in range(100):
URL = "https://gateway.pinata.cloud/ipfs/QmUJrnabRCMLsnvXNryojLWcysc4WwJCLqWYvJcWADfZFo/chadsJSON/" + str(i) + ".json"
page = requests.get(URL)
print(page.text)
And to store the data in a CSV file, I advice you to use csv library which is really helpful in that, you can read more in its documentation: https://docs.python.org/3/library/csv.html
import csv
with open('file.csv', 'a+', newline='') as file:
writer = csv.writer(file)
writer.writerow(["COL1", "COL2"])

How to parse jsonlines file using pandas

I am new to python and trying to parse data from a file that contains millions of lines. Tried to go old school to parse it using excel but it fails. How can I parse the information efficiently and export them into an excel file so that it is easier for other people to read?
I tried using this code provided by someone else but no luck so far
import re
import pandas as pd
def clean_data(filename):
with open(filename, "r") as inputfile:
for row in inputfile:
if re.match("\[", row) is None:
yield row
with open(clean_file, 'w') as outputfile:
for row in clean_data(filename):
outputfile.write(row)
NameError: name 'clean_file' is not defined

It looks like clean_file is not defined, which is probably a problem from copy/pasteing code.
Did you mean to write to a file called "clean_file"? In which case you need to wrap it in quotes: with open("clean_file", 'w')
If you want to work with json I sugget looking into the json package which has lots of tools for loading and parsing json. Otherwise, if the json is flat, you can just use the inbuilt pandas function read_json

Download "csv-like" text data file and convert it to CSV in python

First question here so forgive any lapses in the etiquette.
I'm new to python. I have a small project I'm trying to accomplish both for practical reasons and as a learning experience and maybe some people here can help me out. There's a proprietary system I regularly retrieve data from. Unfortunately they don't use standard CSV format. They use a strange character to separate data, its a ‡. I need it in CSV format in order to import it into another system. So what I need to do is take the data and replace the special character (with a comma) and format the data by removing whitespaces among other minor things like unrecognized characters etc...so it's the way I need it in CSV to import it.
I want to learn some python so I figured I'd write it in python. I'll be reading it from a webservice URL, but for now I just have some test data in the same format I'd receive.
In reality it will be tons of data per request but I can scale it when I understand how to retrieve and manipulate the data properly.
My code so far just trying to read and write two columns from the data:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0')
data = r.text
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for elem in data:
f.writerow([elem["PlayerID"], elem["Partner"]])
I'm getting this error.
File "csvTest.py", line 14, in
f.writerow([elem["PlayerID"], elem["Partner"]])
TypeError: string indices must be integers
It's probably evident by that, that I don't know how to manipulate the data much nor read it properly. I was able to pull back some JSON data and output it so i know the structure works at core with standardized data.
Thanks in advance for any tips.
I'll continue to poke at it.
Sample data is at the dropbox link mentioned in the script.
https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0

There are multiple problems. First, the link is incorrect, since it returns the html. To get the raw file, use:
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
Then, data is a string, so elem in data will iterate over all the characters of the string, which is not what you want.
Then, your data are unicode, not string. So you need to decode them first.
Here is your program, with some changes:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
data = str(r.text.encode('utf-8').replace("\xc2\x87", ",")).splitlines()
headers = data.pop(0).split(",")
pidx = headers.index('PlayerID')
partidx = headers.index('Partner')
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for data in data[1:]:
words = data.split(',')
f.writerow([words[pidx], words[partidx]])
Output:
PlayerID,Partner
1038005,EXT
254034,EXT

Use split:
lines = data.split('\n') # split your data to lines
headers = lines[0].split('‡')
player_index = headers.index('PlayerID')
partner_index = headers.index('Partner')
for line in lines[1:]: # skip the headers line
words = line.split('‡') # split each line by the delimiter '‡'
print words[player_index], words[partner_index]
For this to work, define the encoding of your python source code as UTF-8 by adding this line to the top of your file:
# -*- coding: utf-8 -*-
Read more about it in PEP 0263.

remove <feff> from a file

I am using this Python script to convert CSV to XML. After conversion I see tags in the text (vim), which causes XML parsing error.
I am already tried answers from here, without success.
The converted XML file.
Thanks for any help!

Your input file has BOM (byte-order mark) characters, and Python doesn't strip them automatically when file is encoded in utf8. See: Reading Unicode file data with BOM chars in Python
>>> s = '\xef\xbb\xbfABC'
>>> s.decode('utf8')
u'\ufeffABC'
>>> s.decode('utf-8-sig')
u'ABC'
So for your specific case, try something like
from io import StringIO
s = StringIO(open(csvFile).read().decode('utf-8-sig'))
csvData = csv.reader(s)
Very terrible style, but that script is a hacked together script anyway for a one-shot job.

Change utf-8 to utf-8-sig
import csv
with open('example.txt', 'r', encoding='utf-8-sig') as file:

Here's an example of a script that uses a real XML-aware library to run a similar conversion. It doesn't have the exact same output, but, well, it's an example -- salt to taste.
import csv
import lxml.etree
csvFile = 'myData.csv'
xmlFile = 'myData.xml'
reader = csv.reader(open(csvFile, 'r'))
with lxml.etree.xmlfile(xmlFile) as xf:
xf.write_declaration(standalone=True)
with xf.element('root'):
for row in reader:
row_el = lxml.etree.Element('row')
for col in row:
col_el = lxml.etree.SubElement(row_el, 'col')
col_el.text = col
xf.write(row_el)
To refer to the content of, say, row 2 column 3, you'd then use XPath like /row[2]/col[3]/text().

Error with urlopen: new-line character seen in unquoted field

I am using urllib.urlopen with Python 2.7 to read csv files located on an external webserver:
# Try & Except statements removed for clarity
import urllib
import csv
url = ...
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
do_something()
All 100+ files can be read fine, except one that has been updated recently and that returns:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
The file is accessible here. According to my text editor, its mode is Mac (CR), as opposed to Windows (CRLF) for the other files.
I found that based on this thread, python urlopen will handle correctly all formats of newlines. Therefore, the problem is likely to come from somewhere else. I have no clue though. The file opens fine with all my text editors and my speadsheet editors.
Does any one have any idea how to diagnose the problem ?
* EDIT *
The creator of the file informed me by email that I was not the only one to experience such issues. Therefore, he decided to make it again. The code above now works fine again. Unfortunately, using a new file also means that the issue can no longer be reproduced, and the solutions tested properly.
Before closing the question, I want to thank all the stackers who dedicated some of their time to figure out a solution and post it here.

It might be a corrupt .csv file? Otherwise, this code runs perfectly.
#!/usr/bin/python
import urllib
import csv
url = "http://www.football-data.co.uk/mmz4281/1213/I1.csv"
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
print row
Credits to J.F. Sebastian for the .csv file.
Altough, you might want to consider sharing the specific .csv file with us? So we can try to re-create the error.

The following code runs without any error:
#!/usr/bin/env python
import csv
import urllib2
r = urllib2.urlopen('http://www.football-data.co.uk/mmz4281/1213/I1.csv')
for row in csv.reader(r):
print row

I was having the same problem with a downloaded csv.
I know the fix would be to use open with 'rU'. But I would rather not have to save the file to disk, just to open back up into a variable. That seems unnecessary.
file = open(filepath,'rU')
mydata = csv.reader(file)
So if someone has a better solution that would be nice. Stackoverflow links that got me this far:
CSV new-line character seen in unquoted field error
Open the file in universal-newline mode using the CSV Django module
I found what I actually wanted with stringIO, or cStringIO, or io:
Using Python, how do I to read/write data in memory like I would with a file?
I ended up getting io working,
import csv
import urllib2
import io
# warning its a 20MB csv
url = 'http://poweredgec.com/latest_poweredge-11g.csv'
urlRead = urllib2.urlopen(url).read()
ramFile = io.open(urlRead, mode='w')
openRamFile = open(ramFile, 'rU')
csvCurrent = csv.reader(openRamFile)
csvTuple = map(tuple, csvCurrent)
print csvTuple

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read a CSV file from a URL with Python? - python

Using pandas it is very simple to read a csv file directly from a url import pandas as pd data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd') This will read your data in tabular format, which will be very easy to process

You could do it with the requests module as well: url = 'http://winterolympicsmedals.com/medals.csv' r = requests.get(url) text = r.iter_lines() reader = csv.reader(text, delimiter=',')

import pandas as pd url='https://raw.githubusercontent.com/juliencohensolal/BankMarketing/master/rawData/bank-additional-full.csv' data = pd.read_csv(url,sep=";") # use sep="," for coma separation. data.describe()

I am also using this approach for csv files (Python 3.6.9): import csv import io import requests r = requests.get(url) buff = io.StringIO(r.text) dr = csv.DictReader(buff) for row in dr: print(row)

what you were trying to do with the curl command was to download the file to your local hard drive(HD). You however need to specify a path on HD curl http://example.com/passkey=wedsmdjsjmdd -o ./example.csv cr = csv.reader(open('./example.csv',"r")) for row in cr: print row

Related

Python Web scraping multiple URLs Output CSV

How to parse jsonlines file using pandas

Download "csv-like" text data file and convert it to CSV in python

remove <feff> from a file

Error with urlopen: new-line character seen in unquoted field

Categories

Resources