Advice on optimizing process - JSON -> pandas -> csv and MySQL

Advice on optimizing process - JSON -> pandas -> csv and MySQL - python

Disclosure: I am not developer or something and just had to do it because, well, I had to do it. Of course, I was super proud when coded hangman in python but that was pretty it.
So I had to put data from one service to MySQL table, and connect to it through their aggregation API. To my surprise everything is working as expected BUT there are two problems:
Script is suuuuuper slow. It takes around 500-700 seconds to execute it.
It works when I run it manually, but it timeouts on scheduler.
So my question to you, fellow community, could you hint me what should I read or, maybe, change to make it at least a little bit faster.
As a business background, I have to run separate queries on 10 different languages, but in the code below, I provide only one language and put description around it.
The timeout on scheduled execution is somewhere between 5th and 6th language.
# used modules
import requests
import json
import pandas as pd
import MySQLdb
url = 'here comes URI to service API aggregation call'
headers = {'Integration-Key':'Key','Content-Type' : 'application/json'}
# the next one is different request for each of 10 languages, so 10 variables.
data_language = '''{Here comes a long long JSON request so API can aggregate it all }'''
# requesting data from API
# Again, 10 times for the next block
response = requests.post(url, headers=headers, data=data_en)
json_data = json.loads(response.text)
df_en = pd.DataFrame(json_data['results'])
# So on schedule, it time outs after 5th or 6th language
# creating merged table
df = pd.concat([df_en,df_sv,and_so_on],ignore_index=True)
db=MySQLdb.connect(host="host", user="user",passwd="pws",db="db")
df.to_sql(con=db, name='nps', if_exists='replace', flavor='mysql')

I have never found to_sql to work for datasets that are at all large. I recommend turning your dataframe into a CSV and then using psycopg2 to do a bulk COPY up to your table.

Related

Dealing with SQL Alchemy Session when calling thousands of time a function that adds or edit data

I am trying to understand the best way to deal with Sessions.
I am scraping data from a website (and using scrapy, though kind of irrelevant here) and, when facing new data, saving it to my database, or when it match something that already exists, updates some values.
My issue is that each webpage I am scraping concerns only one entry in my table, and may need an update in some other table.
For instance, let's assume I am scraping cars and some features (gearbox type, engine type, owner, ...) are within linked tables.
At the moment, I am doing something like :
parse_cars_page(self, response) # note that this is a standard scrapy method, response holds the scraped html data
# do some stuff here to find the car, its parameters, among which its gearbox and motor types
engine = create_engine("connection_string")
with Session(engine):
stmt_cars = select(model.Cars).where(model.Cars.id == car.id)
current_car = session.execute(stmt_cars).scalar_one_or_none()
if current_car != None:
# update some data
else:
stmt_motor = select(model.Motors).where(model.Motors.type == car.motor_type)
current_motor = session.execute(stmt_motor).scalar_one_or_none()
if current_motor == None:
current_motor = model.Motors(create_it_using_my_data)
current_car.motor = current_motor
# then do the same for the gearbox, owner, .........
session.commit()
This works fine. But it's slow, very slow. The whole with statement takes about a minute to execute.
As the info is contained within a single page, and that I am opening thousands of them, getting the data is tedious and takes an enormous amount of time.
I tried finding best practices from the SQL Alchemy documentation but I can't find what I am looking for, and I understand that keeping a Session opened and share it within my whole app isn't a good idea.
My app is the only thing that will update the data in the database. Other processes may read it but not write to it.
Is there a way to open a Session, copy in memory a snapshot of the database, update this snapshot while the Session is closed, and only open a Session for synchronization once I generated X new data ?

Best way to update/store a basic table in Azure, after Azure Function fires?

I have a Twitter bot (here's the GitHub page). It has two TimerTrigger functions (that rely on Tweepy), that each create and push Tweets. Pretty simple, it works, and it costs basically nothing. My question: what's the best method to capture those Tweets in a table? (First column would be Tweet ID, second col would be the first Twitter handle, third col would be the Tweet text, etc.)
I have read about creating some sort of Blob storage function that fires when another function fires (or HTTP, but I think this should be more simple than that) - but I'd really prefer to just use a script in a shared code folder (to instantiate in my __init__.py scripts) to capture & store each Tweet as they get fired off into the Twitterverse. Azure provides this sample code, which I think is pretty good, but I have no idea of the best place to stick it.

Using the Azure Tables client you can access Azure Storage Tables and CosmosDB. I think for what you are trying to do, the simplest (and more cost effective solution) is to use Azure Storage Tables.
from azure.data.tables import TableServiceClient
from datetime import datetime
TWEET_ID = u'001234'
HANDLE = u'RedMarker'
my_entity = {
u'PartitionKey': HANDLE,
u'RowKey': TWEET_ID,
u'TweetText': u"blah.....blah....blah"
}
table_service_client = TableServiceClient.from_connection_string(conn_str="<connection_string>")
table_client = table_service_client.get_table_client(table_name="myTable")
entity = table_client.create_entity(entity=my_entity)

How to setup database for reporting dashboard

I'm a very novice web developer and I am currently building a website from scratch. I have most of the frontend part setup, but I am really struggling with backend and databases.
The point of the website is to display a graph with class completion status (for each class, it will display what percent is complete/incomplete, and how many total users). It will retrieve this data from a CSV file on an SFTP server. The issue I am having is when I try to directly access the data, it loads incredibly slowly.
Here is the code I am using to retrieve the data:
Courses = ['']
Total =[0]
Compl =[0]
csvreal = pandas.read_csv(file)
for index, row in csvreal.iterrows():
string =(csvreal.loc[[index]].to_string(index=False, header=False))
if(Courses[i] !=string.split(' ')[0]):
i+=1
Courses.append(string.split(' ')[0])
Total.append(0)
Compl.append(0)
if(len(string.split(' ')[2])>3):
Compl[i]+=1
Total[i]+=1
To explain it a little bit, the CSV file has the roster information, i.e. each row has a name of course, name of user, completion date, and course code. The course name is the first column so that is why in the code, you see string,split(' ')[0], as it is the first part of the string. If the user has completed it, then the third column (completion date) is empty, so that is why it checks if it is longer than 3 chars, because if it is, then the user has completed it.
This takes entirely too long to compute. About 30 seconds with around 7,000 entries. Recently the CSV size was increased to something like 36,000.
I was advised to setup a database using SQL and have a nightly cronjob to parse the data and have the website retrieve the data from the database, instead of the CSV.
Any advice on where to even begin, or how to do this would be greatly appreciated.

This takes entirely too long to compute. About 30 seconds with around 7,000 entries. Recently the CSV size was increased to something like 36,000.
I was advised to setup a database using SQL and have a nightly cronjob to parse the data and have the website retrieve the data from the database, instead of the CSV.
Before I recommend using a database, how fast is the connection to the SFTP server you are getting the data from? Would it be faster to host it on the local machine? If this isn't the issue, so see below.
Yes, in this case a database would speed up your computation time and retrieval time. You need to setup a SQL database, have a way to put data into it, and then retrieve it. I included resources at the bottom that will help familiarize yourself with SQL. Knowledge of PHP will be needed in order to interact and manipulate the database.
Using SQl will be much simpler for you to interact with. For example, you needed to check to see if a cell is empty. In SQL, this can be done with;
SELECT * FROM table WHERE some_col IS NULL OR some_col = '';
https://www.khanacademy.org/computing/computer-programming/sql
https://www.w3schools.com/sql/
https://www.guru99.com/introduction-to-database-sql.html

Render JSON response from BigQuery using Pandas?

I'm a Ruby dev doing a lot of data work that's decided to switch to Python. I'm enjoying making the transition so far and have been blown away by Pandas, Jupyter Notebooks etc.
My current task is to write a lightweight RESTful API that under the hood is running queries against Google BigQuery.
I have a really simple test running in Flask, this works fine, but I did have trouble rendering the BigQuery response as JSON. To get around this, I used Pandas and then converted the dataframe to JSON. While it works, this feels like an unnecessary step and I'm not even sure this is a legitimate use case for Pandas. I have also read that creating a dataframe can be slow as data volume increases.
Below is my little mock up test in Flask. It would be really helpful to hear from experienced Python Devs how you'd approach this and if there are any other libraries I should be looking at here.
from flask import Flask
from google.cloud import bigquery
import pandas
app = Flask(__name__)
#app.route("/bq_test")
def bq_test():
client = bigquery.Client.from_service_account_json('/my_creds.json')
sql = """select * from `my_dataset.my_table` limit 1000"""
query_job = client.query(sql).to_dataframe()
return query_job.to_json(orient = "records")
if __name__ == "__main__":
app.run()

From the BigQuery documentation-
BigQuery supports functions that help you retrieve data stored in
JSON-formatted strings and functions that help you transform data into
JSON-formatted strings:
JSON_EXTRACT or JSON_EXTRACT_SCALAR
JSON_EXTRACT(json_string_expr, json_path_string_literal), which returns JSON values as STRINGs.
JSON_EXTRACT_SCALAR(json_string_expr, json_path_string_literal), which returns scalar JSON values as STRINGs.
Description
The json_string_expr parameter must be a JSON-formatted string. ...
The json_path_string_literal parameter identifies the value or values you want to obtain from the JSON-formatted string. You construct this parameter using the JSONPath format.
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions

Using Python to Query multiple SQL databases on different servers

I have been doing a fair amount of manual data analysis, reporting and dash boarding recently via SQL and wonder if perhaps python would be able to automate a lot of this. I am not familiar with Python at all so I hope my question makes sense. For security/performance issues, we store databases on a number of servers (more than 5) which contain data that would be pertinent to a query. Unfortunately, these servers are set up so they cannot talk to each other so I cant pull data from the two servers in the same query. I believe this is a limitation due to using windows credentials/security.
For my data analysis and reporting needs, I need to be able to grab pertinent data from two or more of these so the way I currently do this is by running a query, grabbing the results, running another query with the results, doing some formula work in excel, and then running another query and so on and so forth until I get what I need.
Unfortunately this both time consuming, and also makes me pull massive datasets (in the multiple millions of rows), which I then have to continually narrow down based on criteria that are in said databases.
I know Python has the ability to query SQL Server, however I figured I would ask the experts:
Can I manipulate the data in the background with Python similar to how I can do with excel (lookups, statistical functions, etc, perhaps even XML/webAPI?
Can Python handle connections to multiple different database servers at the same time?
Does Python handle windows credentials well?
If Python is not the tool for this, can you name one that would work better?
Please let me know if I can provide additional pertinent details.
Ideally, I would like to end up creating our own separate database and creating automated processes to pull everything from other databases but currently that is not possible due to project constraints.
Thanks!

I didn't use windows credential. But i have used Python to work with multiple MS-SQL databases at the same time. It worked very well. You can use the library pymssql or better with SQLAlchemy
But i think you should start with a basic tutorial about Python first. Because you want to work with millions of rows, it's very important to understand list, set, tuple, dict in Python. For good performance, you should use the right type.
A basic example with pymssql
import pymssql
conn1 = pymssql.connect("Host1", "user1", "password1", "db1")
conn2 = pymssql.connect("Host2", "user2", "password2", "db2")
cursor1 = conn1.cursor()
cursor2 = conn2.cursor()
cursor1.execute('SELECT * FROM TABLE1 LIMIT 10')
cursor2.execute('SELECT * FROM TABLE2 LIMIT 10')
result1 = cursor1.fetchall()
result2 = cursor2.fetchall()
# print each row
for row in result1:
print(row)
# print each row
for row in result2:
print(row)

You can do all of what you asked. Python allows to create multiple connection objects via a library, so for example, let's say you use MySQL python you would create two different objects like this:
NOT ACTUAL CODE, JUST EXAMPLE
conn1 = mysqlConnect(server1, user, pass)
conn2 = mysqlConnect(server2, user, pass)
Like this, conn1 connects to one database and conn2 connects to a different one, usually you would do:
conn1.execute(query_to_server_1)
conn2.execute(query_to_server_2)
This helps maintain two different connections in the same script. If you are looking for multi threading, python offers an incredible library that will help you execute multiple task from one master script.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Advice on optimizing process - JSON -> pandas -> csv and MySQL - python

I have never found to_sql to work for datasets that are at all large. I recommend turning your dataframe into a CSV and then using psycopg2 to do a bulk COPY up to your table.

Related

Dealing with SQL Alchemy Session when calling thousands of time a function that adds or edit data

Best way to update/store a basic table in Azure, after Azure Function fires?

How to setup database for reporting dashboard

Render JSON response from BigQuery using Pandas?

Using Python to Query multiple SQL databases on different servers

Categories

Resources