I'm new to Python3 and requests. I found a Dataset on Harvard Dataverse but I've been stuck for hours trying to extract the Dataset. Instead I get question marks in my content and no readable data. I found similar issues but I'm still unable to solve mine.
Can anyone help me please ?
It would be so much appreciated ;)
Many thanks !!
import requests
import pandas as pd
import csv
import sys
#print(sys.executable)
#print(sys.version)
#print(sys.version_info)
url = "https://dataverse.harvard.edu/api/access/datafile/5856951"
r = requests.get(url)
print(type(r))
print('*************')
print('Response Code:', r.status_code)
print('*************')
print('Response Headers:\n', r.headers)
print('*************')
print('Response Content:\n',r.text)
print(r.encoding)
print(r.content)
with open('myfile.csv', mode='w', newline='') as f:
writer = csv.writer(f)
writer.writerows(r.text)
df = pd.read_csv('myfile.csv')
data = pd.DataFrame(df)
print("The content of the file is:\n", data)
print(data.head(10))
It seems that the request URL is not giving valid json response instead it is returning the whole excel file that contains the dataset which you want.
Instead of directly accessing the response object you should first save the response in excel file 'dataset.xlsx' then try to access that excel file in order to get results which you want.
The following code will help you to save the response in excel file. Then you can use xlrd https://www.geeksforgeeks.org/reading-excel-file-using-python/ python library to extract data from the file.
url = "https://dataverse.harvard.edu/api/access/datafile/5856951"
resp = requests.get(url)
open('dataset.xlsx', 'wb').write(resp.content)
Related
trying to loop through a csv with a column of URLs and post a http request for each url, parse the response html for a url and store the url into a new column. I've been looking through similar posts on here trying to script kiddie my way into finishing this task but i'm at a loss. any assistance is appreciated greatly.
so all I have so far is importing csv and requests and opening a csv file.
import requests
import csv
with open('UsersWithoutSignaturesOU.csv', newline='') as csvfile:
errors = []
reader = csv.DictReader(csvfile)
for row in reader:
try:
r = requests.get(row['thumbnailPhotoUrl'])
Maybe someone knows what's a problem with downloading from the site below... I run this code in Jupiter, and nothing happens.
import requests
import os
url = 'http://www.football-data.co.uk/mmz4281/1920/E0.csv'
response = requests.get(url)
with open(os.path.join("folder", "file"), 'wb') as f:
f.write(response.content)
I've also tried this code and it works fine on my side, assuming folder and file were defined correctly. Alternatively, you can try using pandas which can read a CSV file from URL. So the code would become:
import pandas as pd
import csv
url = '{Some CSV target}'
df = pd.read_csv(url)
df.to_csv('{absoulte path to CSV}', sep=',', index=False, quoting=csv.QUOTE_ALL)
Its has started working only after declaring proxy.
Im practicing on my work laptop. Maybe the local net is blocking my requests.
I hope it can be helpful for someone.
Thanks to everyone for your help!
import requests
import os
os.environ['HTTP_PROXY'] = 'your proxy'
url = 'http://www.football-data.co.uk/mmz4281/1920/E0.csv'
response = requests.get(url)
with open(os.path.join("C://DownloadLocation", "file.csv"), 'wb') as f:
f.write(response.content)
I'm trying to use request to download the content of some web pages which are in fact PDFs.
I've tried the following code but the output that comes back is not properly decoded it seems:
link= 'http://www.pdf995.com/samples/pdf.pdf'
import requests
r = requests.get(link)
r.text
The output looks like below:
'%PDF-1.3\n%�쏢\n30 0 obj\n<>\nstream\nx��}ݓ%�m���\x15S�%NU���M&O7�㛔]ql�����+Kr�+ْ%���/~\x00��=����{feY�T�\x05��\r�\x00�/���q�8�8�\x7f�\x7f�~����\x1f�ܷ�O�z�7�7�o\x1f����7�\'�{��\x7f<~��\x1e?����C�%\ByLշK����!_b^0o\x083�K\x0b\x0b�\x05z�E�S���?�~ �]rb\x10C�y�>_r�\x10�<�K��<��!>��(�\x17���~�.m��]2\x11��
etc
I was hoping to get the html. I also tried with beautifulsoup but it does not decode it either.. I hope someone can help. Thank you, BR
Yes; a PDF file is a binary file, not a text file, so you should use r.content instead of r.text to access the binary data.
PDF files are not easy to deal with programmatically; but you might (for example) save it to a file:
import requests
link = 'http://www.pdf995.com/samples/pdf.pdf'
r = requests.get(link)
with open('pdf.pdf', 'wb') as f:
f.write(r.content)
I need to download a file from an external source, I am using Basic authentication to login to the URL
import requests
response = requests.get('<external url', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<API URL to download the attachment>', auth=('<username>', '<password>'), stream=True)
print (data.content)
I am getting below output
<url to download the binary data>
\x00\x00\x13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0f\xcb\x00\x00\x1e\x00\x1e\x00\xbe\x07\x00\x00.\xcf\x05\x00\x00\x00'
I am expecting the URL to download the word document within the same session.
Working solution
import requests
import shutil
response = requests.get('<url>', auth=('<username>', '<password>'))
data = response.json()
html = data['list'][0]['attachments'][0]['url']
print (html)
data = requests.get('<url>', auth=('<username>', '<password>'), stream=True)
with open("C:/myfile.docx", 'wb') as f:
data.raw.decode_content = True
shutil.copyfileobj(data.raw, f)
I am able to download the file as it is.
When you want to download a file directly you can use shutil.copyfileobj():
https://docs.python.org/2/library/shutil.html#shutil.copyfileobj
You already are passing stream=True to requests which is what you need to get a file-like object back. Just pass that as the source to copyfileobj().
I'm trying to open a csv file from a url but for some reason I get an error saying that there is an invalid mode or filename. I'm not sure what the issue is. Help?
url = "http://...."
data = open(url, "r")
read = csv.DictReader(data)
Download the stream, then process:
import urllib2
url = "http://httpbin.org/get"
response = urllib2.urlopen(url)
data = response.read()
read = csv.DictReader(data)
I recommend pandas for this:
import pandas as pd
read = pandas.io.parsers.read_csv("http://....", ...)
please see the documentation.
You can do the following :
import csv
import urllib2
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
print row
Slightly tongue in cheek:
require json
>>> for line in file(','):
... print json.loads('['+line+']')
CSV is not a well defined format. JSON is so this will parse a certain type of CSV correctly every time.