Download csv file and convert to JSON - python

I would like to write Python script that download csv file from URL and then return this in JSON. The problem is that I need execute it as fast as it possible. What is the best way to do it? I was thinking about something like this:
r_bytes = requests.get(URL).content
r = r_bytes.decode('utf8')
reader = csv.DictReader(io.StringIO(r))
json_data = json.dumps(list(reader))
What do you think? It doesn't look good for me but I can t find any better way to solve this problem.

I tried comparing your conversion process with pandas and used this code:
import io
import pandas as pd
import requests
import json
import csv
import time
r_bytes = requests.get("https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv").content
print("finished download")
r = r_bytes.decode('utf8')
print("finished decode")
start_df_timestamp = time.time()
df = pd.read_csv(io.StringIO(r), sep=";")
result_df = json.dumps(df.to_dict('records'))
end_df_timestamp = time.time()
print("The df method took {d_t}s".format(d_t=end_df_timestamp-start_df_timestamp))
start_csv_reader_timestamp = time.time()
reader = csv.DictReader(io.StringIO(r))
result_csv_reader = json.dumps(list(reader))
end_csv_reader_timestamp = time.time()
print("The csv-reader method took {d_t}s".format(d_t=end_csv_reader_timestamp-start_csv_reader_timestamp))
and the result was:
finished download
finished decode
The df method took 0.200181245803833s
The csv-reader method took 0.3164360523223877s
this was using a random 37k row CSV file and i noticed that downloading it was by far the most time-intensive thing to do. Even if the the pandas.df functions were faster for me, you should probably try to profile your code, to see whether the conversion really is significantly adding to your runtime. :-)
PS: If you need to constantly monitor the CSV and processing updates turns out to be time-intensive, you could use hashes to only process alterations if the CSV has changed since your last download.

Related

How to I adjust this nested loop to store the output of different URL requests in separate databases or .csv files?

so I'm working at a simple project but apparently I'm stuck at the first step. Basically I'm requesting the .json files from a public github repository. 7 different files which I aim to download and convert to 7 differently named databases.
I tried to use this nested loop, trying to create 7 different csv files, the only problem is that it gives me 7 different named csv files with the same content (the one from the last URL).
I think it has something to do with the way I store the data from the json output in the list "data".
How could I solve this problem?
import pandas as pd
import datetime
import re, json, requests #this is needed to import the data from the github repository
naz_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale-latest.json'
naz_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale.json'
reg_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni-latest.json'
reg_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json'
prov_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province-latest.json'
prov_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province.json'
news_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-note.json'
list_of_url= [naz_l_url,naz_url, reg_l_url,reg_url,prov_url,prov_l_url,news_url]
csv_names = ['01','02','03','04','05','06','07']
for i in list_of_url:
resp = requests.get(i)
data = pd.read_json(resp.text, convert_dates=True)
for x in csv_names:
data.to_csv(f"{x}_df.csv")
I want to try two different ways. 1 with the loop giving me csv files, and another with the loop giving me pd dataframes. But I need to solve the problem of the loop giving me the same output for now.
The problem is that you are iterating over the full list of names for each URL you download. Note how for x in csv_names is inside the for i in list_of_url loop.
Where the problem comes from
Python uses indentation levels to determine when you are in and out of a loop (as other languages might use curly braces, begin/end, or do/end). I'd recommend you brush up on this topic. For example, with Concept of Indentation in Python. You can see the official documentation about Compound statements, too.
Proposed solution
I'd recommend you replace the naming of the files, and do something like this instead:
import pandas as pd
import datetime
import re, json, requests #this is needed to import the data from the github repository
from urllib.parse import urlparse
naz_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale-latest.json'
naz_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-andamento-nazionale.json'
reg_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni-latest.json'
reg_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json'
prov_l_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province-latest.json'
prov_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-province.json'
news_url = 'https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-note.json'
list_of_url= [naz_l_url,naz_url, reg_l_url,reg_url,prov_url,prov_l_url,news_url]
csv_names = ['01','02','03','04','05','06','07']
for url in list_of_url:
resp = requests.get(url)
data = pd.read_json(resp.text, convert_dates=True)
# here is where you DON'T want to have a nested `for` loop
file_name = urlparse(url).path.split('/')[-1].replace('json', 'csv')
data.to_csv(file_name)

Reading excel file with pandas and printing it for inserting it in http GET statement for Rest-API

I want to read each line of an excel file (.xlsx-file) in the column called 'ABC'. There are 4667 lines and each line there is a string.
I want to print each string. But it does not work.
import requests
import pandas as pd
get_all_ABC = pd.read_excel('C:\Users\XXX\XXX2\XXX3\table.xlsx', header = 0)
row_iterator = get_all_ABC.iterrows()
_, last = row_iterator.__next__()`
for i, row in row_iterator:
r= requests.get(row["ABC"])
r= requests.get(last["ABC"])
last = row
data = (r.text)
print ((r.text))
Why are you using the requests library? That is for making HTTP requests. Also, it's almost always bad practice to iterate over rows in pandas, and 99% of the time unnecessary.
Also, r.text will be undefined as it's outside of the for loop scope.
Could you explain exactly what you're trying to accomplish? I don't think I'm understanding correctly.
Julian L is right in his points. I mixed a lot up. I have to use the requests method for my overallproblem because I use the GET Method on a RESTApi Server and use the strings which are written in the round about 4000 lines in the column 'ABC' in the excel file. Before I tried the following python script (in that script I also do not use an iteration):
import requests
import pandas as pd
get_all_ABC = pd.read_excel('C:\Users\XXX\XXX2\XXX3\table.xlsx', skiprows=1).set_index('ABC')
r = requests.get('http://localhost:5000/api/sensors/data?ABC={get_all_ABC}')
print(r.json())
But this does not work either.
This thread leads nowhere. I delete this one and open a new one in which I describe the problem in more detail.

Python: Issue with rapidly reading and writing excel files after web scraping? Works for a bit then weird issues come up

So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.

How to download xlsx file from URL and save in data frame via python

I would like the following code to download the xlsx files from the URL and save in drive.
I receive this error:
AttributeError: 'str' object has no attribute 'content'
Below is the code:
import requests
import xlrd
import pandas as pd
filed = 'https://www.icicipruamc.com/downloads/others/monthly-portfolio-disclosures/monthly-portfolio-disclosure-november19/Arbitrage.xlsx'
resp = requests.get(filed)
workbook = xlrd.open_workbook(file_contents = filed.content)
worksheet = workbook.sheet_by_index(0)
first_row = worksheet.row(0)
df = pd.DataFrame(first_row)
pandas already has a function thas converts excel direclty into pandas dataframe (using xlrd):
import pandas as pd
MY_EXCEL_URL="www.yes.com/xl.xlsx"
xl_df = pd.read_excel(MY_EXCEL_URL,
sheet_name='my_sheet',
skiprows=range(5),
skipfooter=0)
then yo can handle /save file using pd.DataFrame.to_excel
This function works, tested individual components. The ICICI website you have seems to give me a 404. So make sure the website works and has an excel sheet before trying this out.
import requests
import pandas as pd
def excel_to_pandas(URL, local_path):
resp = requests.get(URL)
with open(local_path, 'wb') as output:
output.write(resp.content)
df = pd.read_excel(local_path)
return df
print(excel_to_pandas("www.websiteforxls.com", '~/Desktop/my_downloaded.xls'))
As a footnote, this was super simple. And I'm disappointed you couldn't do this on your own. I might not have been able to do this 5 years ago, and that's why I decided to help.
If you want to code. Learn the basics, literally the basics: Class, Functions, Variables, Types, OOP principals. And that's all you need to start. Then you need to learn how to search, and make different components to work together the way you require them too. And with SO, if you show some effort, we are happy to help. We are a community, not a place to solve your homework. Try harder next time.

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.
It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize, the slower it is to process the same amount of data.
(Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
(I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)
The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.
import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
chunk = pd.read_csv(filename)
df = chunk.iloc[:,[5,8,15,16]]
df = df.dropna(how='any')
counter += 1
print(counter)
you may try to parse only those columns that you need (as #BrenBarn said in comments):
import os
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]
df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')
print(df.head())
PS this will include only 4 out of at least 17 columns in your resulting data frame
Thanks for the reply.
After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here
I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks.
Syntax:
csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)
then concatenate the chunks (Only valid with C parser)

Categories