function to reduce redundancy for reading database in sqlite3 - python

Hi guys:) I'm a newbie at programming and would like to ask for help in creating a function to help reduce redundancy in my code. I have successfully created a database holding 5 different tables for data of different countries. All tables have the same structure (see attached screenshots for reference). My objective is to calculate the summation of all rows within all the different tables for a particular parameter (type of pollution).
I have managed to write code to only select the particular data I need of one country (I tried writing code to calculate the summation but I can't figure that out, so I decided to just select the data and then manually calculate the values myself with a calculator -I know that sort of defeats the purpose of programming but at my programming level (beginner) I feel like it's the only way that I can do the code) my issue is that I have five countries, so I don't want to repeat the same block of code for the different countries. this is my code for one country:
def read_MaltaData():
conn = sqlite3.connect('FinalProjectDatabase.sqlite3')
Malta = conn.cursor()
Malta.execute("SELECT * FROM MaltaData WHERE AirPollutant = 'PM10'")
result = Malta.fetchall()
print(result)
my result is this:
[('Malta', 'Valletta', 'MT00005', 'Msida', 'PM10', 64.3, 'ug/m3', 'Traffic', 'urban', 14.489985999999998, 35.895835999489535, 2.0), ('Malta', None, etc.
(I am going to manually calculate the data I require -in this case 64.3 + the value from the next row- as I don't know how to do it in python)
To clarify, my aim isn't to have a sum total across all the tables as one whole value (i.e. I don't want to add the values of all the countries all together). My desired output should look something like this:
Malta summation value
italy summation value
france summation value
and not like this
countries all together = one whole value (i.e. all summation values added together)
I would greatly appreciate any help I can get. Unfortunately I am not able to share the database with you, which is why I am sharing screenshots of it instead.
image of all 5 different tables in one database:
image of one table (all tables look the same, just with different values)

You can use UNION ALL to get a row for each country:
SELECT 'France' country, SUM(AirPolutionLevel) [summation value] FROM FranceData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Germany' country, SUM(AirPolutionLevel) [summation value] FROM GermanyData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Italy' country, SUM(AirPolutionLevel) [summation value] FROM ItalyData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Malta' country, SUM(AirPolutionLevel) [summation value] FROM MaltaData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Poland' country, SUM(AirPolutionLevel) [summation value] FROM PolandData WHERE AirPollutant = 'PM10'

If you pass the country name as argument to the data retrieval function, you can generate the table names dynamically (note the f-string arguments in execute and print):
First draft
def print_CountryData(country):
conn = sqlite3.connect('FinalProjectDatabase.sqlite3')
cur = conn.cursor()
cur.execute(f"SELECT SUM(AirPollutionLevel) FROM {country}Data WHERE AirPollutant = 'PM10'")
sumVal = cur.fetchone()[0]
print(f"{country} {sumVal}")
# example call:
for country in ('France', 'Germany', 'Italy', 'Malta', 'Poland'):
print_CountryData(country)
While building query strings your own with simple string functions is discouraged in the sqlite3 documentation for security reasons, in your very case where you have total control of the actual arguments I'd consider it as safe.
This answer adapts the summation from the great answer given by forpas but refuses to move the repetition to SQL. It also shows both integration with python and output formatting.
MRE-style version
This is an improved version of my first answer, transformed into a Minimal, Reproducible Example and combined with output. Also, some performance improvements were made, for instance opening the database only once.
import sqlite3
import random # to simulate actual pollution values
# Countries we have data for
countries = ('France', 'Germany', 'Italy', 'Malta', 'Poland')
# There is one table for each country
def tableName(country):
return country+'Data'
# Generate minimal version of tables filled with random data
def setup_CountryData(cur):
for country in countries:
cur.execute(f'''CREATE TABLE {tableName(country)}
(AirPollutant text, AirPollutionLevel real)''')
for i in range(5):
cur.execute(f"""INSERT INTO {tableName(country)} VALUES
('PM10', {100*random.random()})""")
# Get sum up pollution data for each country
def print_CountryData(cur):
for country in countries:
cur.execute(f"""SELECT SUM(AirPollutionLevel) FROM
{tableName(country)} WHERE AirPollutant = 'PM10'""")
sumVal = cur.fetchone()[0]
print(f"{country:10} {sumVal:9.5f}")
# For testing, we use an in-memory database
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
setup_CountryData(cur)
# The functionality actually required
print_CountryData(cur)
Sample output:
France 263.79430
Germany 245.20942
Italy 225.72068
Malta 167.72690
Poland 290.64190
It's often hard to evaluate a solution without actually trying it. That's the reason why questioners on StackOverflow are constantly encouraged to ask in this style: it makes it much more likely someone will understand and fix the problem ... quickly

If the database is not too big you could use pandas.
This approach is less efficient than using SQL queries directly but can be used if you want to explore the data interactively in a notebook for example.
You can create a dataframe from your SQLite db using pandas.read_sql_query
and then perform your calculation using pandas.DataFrame methods, which are designed for this type of tasks.
For your specific case:
import sqlite3
import pandas as pd
conn = sqlite3.connect(db_file)
query = "SELECT * FROM MaltaData WHERE AirPollutant = 'PM10'"
df = pd.read_sql_query(query, conn)
# check dataframe content
print(df.head())
If I understood and then you want to compute the sum of the values in a given column:
s = df['AirPollutionLevel'].sum()
If you have missing values you might want to fill them with 0s before summing:
s = df['AirPollutionLevel'].fillna(0).sum()

Related

How to use variable column name in filter in Django ORM?

I have two tables BloodBank(id, name, phone, address) and BloodStock(id, a_pos, b_pos, a_neg, b_neg, bloodbank_id). I want to fetch all the columns from two tables where the variable column name (say bloodgroup) which have values like a_pos or a_neg... like that and their value should be greater than 0. How can I write ORM for the same?
SQL query is written like this to get the required results.
sql="select * from public.bloodbank_bloodbank as bb, public.bloodbank_bloodstock as bs where bs."+blood+">0 and bb.id=bs.bloodbank_id order by bs."+blood+" desc;"
cursor = connection.cursor()
cursor.execute(sql)
bloodbanks = cursor.fetchall()
You could be more specific in your questions, but I believe you have a variable called blood which contains the string name of the column and that the columns a_pos, b_pos, etc. are numeric.
You can use a dictionary to create keyword arguments from strings:
filter_dict = {bloodstock__blood + '__gt': 0}
bloodbanks = Bloodbank.objects.filter(**filter_dict)
This will get you Bloodbank objects that have a related bloodstock with a greater than zero value in the bloodgroup represented by the blood variable.
Note that the way I have written this, you don't get the bloodstock columns selected, and you may get duplicate bloodbanks. If you want to get eliminate duplicate bloodbanks you can add .distinct() to your query. The bloodstocks are available for each bloodbank instance using .bloodstock_set.all().
The ORM will generate SQL using a join. Alternatively, you can do an EXISTS in the where clause and no join.
from django.db.models import Exists, OuterRef
filter_dict = {blood + '__gt': 0}
exists = Exists(Bloodstock.objects.filter(
bloodbank_id=OuterRef('id'),
**filter_dict
)
bloodbanks = Bloodbank.objects.filter(exists)
There will be no need for a .distinct() in this case.

Pull column names along with data in Teradata Python module

I am running the below snippet in python:
with udaExec.connect(method="ODBC", system=<server IP>,username=<user>,password=<pwd>) as session:
for row in session.execute("""sel top 3 * from retail.employee"""):
print(row)
The above query is returning data without the column names. How do I pull column names along with data from the employee table while using teradata python module in python3.x ?
I will use pandas and teradata to get full control of data.
import teradata
import pandas as pd
with udaExec.connect(method="ODBC", system=<server IP>,username=<user>,password=<pwd>) as session:
query = '''sel top 3 * from re
tail.employee'''
df = pd.read_sql(query,session)
print(df.columns.tolist()) #columns
print(df.head(2)) # beautiful first 2 rows
I've found pandas pretty pretty thick, but useful at times.
But I see the column names are in the cursor description: https://pypi.org/project/teradatasql/#CursorAttributes
The index isn't working for me in pypi for this page, so you'll probably have to scroll down, but you should find the following:
.description
Read-only attribute consisting of a sequence of seven-item sequences that each describe a result set column, available after a SQL request is executed.
.description[Column][0] provides the column name.
.description[Column][1] provides the column type code as an object comparable to one of the Type Objects listed below.
.description[Column][2] provides the column display size in characters. Not implemented yet.
.description[Column][3] provides the column size in bytes.
.description[Column][4] provides the column precision if applicable, or None otherwise.
.description[Column][5] provides the column scale if applicable, or None otherwise.
.description[Column][6] provides the column nullability as True or False.
If you want to replicate pandas to_dict, you can do the following:
with teradatasql.connect(**conn) as con:
with con.cursor () as cur:
cur.execute("sel top 3 * from retail.employee;")
rows = cur.fetchall()
columns=[d[0] for d in cur.description]
list_of_dict=[{columns[i]:rows[j][i] for i in range(0,len(columns))} for j in range(1,len(rows[0]))]
Result:
[
{
"Name":"John Doe",
"SomeOtherEmployeeColumn":"arbitrary data"
}
]
Have you tried:
with udaExec.connect(method="ODBC", system=<server IP>,username=<user>,password=<pwd>) as session:
for row in session.execute("""sel top 3 * from retail.employee"""):
print(row.name + ": " row.val)

List/Dict structure issue

I'm confused on how to structure a list/dict I need. I have scraped three pieces of info off ESPN: Conference, Team, and link to team homepage for future stat scrapping.
When the program first runs, id like to build a dictionary/list so that one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools. The link associated with each school isn't important that the end user know about but it is important that the correct link is associated with the correct school so that future stats from that specific school can be scraped.
For example the info scrapped is:
SEC, UGA, www.linka.com
ACC, FSU, www.linkb.com
etc...
I know i could create a list of dictionaries like:
sec_list=[{UGA: www.linka.com, Alabama: www.linkc.com, etc...}]
acc_list=[{FSU: www.linkb.com, etc...}]
The problem is id have to create about 26 lists here to hold every conference which sounds excessive. Is there a way to lump everything into one list but still have the ability to to extract schools from a specific conference or search for a school and the correct conference is also returned? Of course, the link to the school must also correspond to the correct school.
Python ships with sqlite3 to handle database problems and it has an :memory: mode for in-memory databases. I think it will solve your problem directly and with clear code.
import sqlite3
from pprint import pprint
# Load the data from a list of tuples in the from [(conf, school, link), ...]
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE Espn (conf text, school text, link text)')
c.execute('CREATE INDEX Cndx ON Espn (conf)')
c.execute('CREATE INDEX Sndx ON Espn (school)')
c.executemany('INSERT INTO Espn VALUES (?, ?, ?)', data)
conn.commit()
# Run queries
pprint(c.execute('SELECT * FROM Espn WHERE conf = "Big10"').fetchall())
pprint(c.execute('SELECT * FROM Espn WHERE school = "Alabama"').fetchall())
In memory databases are so easy to create and query that they are often the easiest solution to the problem of how to have multiple lookup keys and doing analytics on relational data. Trying to use dicts and lists for this kind of work just makes the problem unnecessarily complicated.
It's true you can do this with a list of dictionaries, but you might find it easier to be able to look up information with named fields. In that case, I'd recommend storing your scraped data in a Pandas DataFrame.
You want it so that "one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools".
Here's an example of what that would look like, using Pandas and a couple of convenience functions.
First, some example data:
confs = ['ACC','Big10','BigEast','BigSouth','SEC',
'ACC','Big10','BigEast','BigSouth','SEC']
teams = ['school{}'.format(x) for x in range(10)]
links = ['www.{}.com'.format(x) for x in range(10)]
scrape = zip(confs,teams,links)
[('ACC', 'school0', 'www.0.com'),
('Big10', 'school1', 'www.1.com'),
('BigEast', 'school2', 'www.2.com'),
('BigSouth', 'school3', 'www.3.com'),
('SEC', 'school4', 'www.4.com'),
('ACC', 'school5', 'www.5.com'),
('Big10', 'school6', 'www.6.com'),
('BigEast', 'school7', 'www.7.com'),
('BigSouth', 'school8', 'www.8.com'),
('SEC', 'school9', 'www.9.com')]
Now convert to DataFrame:
import pandas as pd
df = pd.DataFrame.from_records(scrape, columns=['conf','school','link'])
conf school link
0 ACC school0 www.0.com
1 Big10 school1 www.1.com
2 BigEast school2 www.2.com
3 BigSouth school3 www.3.com
4 SEC school4 www.4.com
5 ACC school5 www.5.com
6 Big10 school6 www.6.com
7 BigEast school7 www.7.com
8 BigSouth school8 www.8.com
9 SEC school9 www.9.com
Type in school, get conference:
def get_conf(df, school):
return df.loc[df.school==school, 'conf'].values
get_conf(df, school = 'school1')
['Big10']
Type in conference, get schools:
def get_schools(df, conf):
return df.loc[df.conf==conf, 'school'].values
get_schools(df, conf = 'Big10')
['school1' 'school6']
It's unclear from your question whether you also want the links associated with schools returned when searching by conference. If so, just update get_schools() to:
def get_schools(df, conf):
return df.loc[df.conf==conf, ['school','link']].values

Dynamically add filter to SQLAlchemy TextClause

Assume I have a SQLAlchemy table which looks like:
class Country:
name = VARCHAR
population = INTEGER
continent = VARCHAR
num_states = INTEGER
My application allow seeing name and population for all Countries. So I have a TextClause which looks like
"select name, population from Country"
I allow raw queries in my application so I don't have option to change this to selectable.
At runtime, I want to allow my users to choose a field name and put a field value on which I want to allow filtering. eg: User can say I only want to see name and population for countries where Continent is Asia. So I dynamically want to add the filter
.where(Country.c.continent == 'Asia')
But I can't add .where to a TextClause.
Similarly, my user may choose to see name and population for countries where num_states is greater than 10. So I dynamically want to add the filter
.where(Country.c.num_states > 10)
But again I can't add .where to a TextClause.
What are the options I have to solve this problem?
Could subquery help here in any way?
Please add a filter based on the conditions. filter is used for adding where conditions in sqlalchemy.
Country.query.filter(Country.num_states > 10).all()
You can also do this:
query = Country.query.filter(Country.continent == 'Asia')
if user_input == 'states':
query = query.filter(Country.num_states > 10)
query = query.all()
This is not doable in a general sense without parsing the query. In relational algebra terms, the user applies projection and selection operations to a table, and you want to apply selection operations to it. Since the user can apply arbitrary projections (e.g. user supplies SELECT id FROM table), you are not guaranteed to be able to always apply your filters on top, so you have to apply your filters before the user does. That means you need to rewrite it to SELECT id FROM (some subquery), which requires parsing the user's query.
However, we can sort of cheat depending on the database that you are using, by having the database engine do the parsing for you. The way to do this is with CTEs, by basically shadowing the table name with a CTE.
Using your example, it looks like the following. User supplies query
SELECT name, population FROM country;
You shadow country with a CTE:
WITH country AS (
SELECT * FROM country
WHERE continent = 'Asia'
) SELECT name, population FROM country;
Unfortunately, because of the way SQLAlchemy's CTE support works, it is tough to get it to generate a CTE for a TextClause. The solution is to basically generate the string yourself, using a custom compilation extension, something like this:
class WrappedQuery(Executable, ClauseElement):
def __init__(self, name, outer, inner):
self.name = name
self.outer = outer
self.inner = inner
#compiles(WrappedQuery)
def compile_wrapped_query(element, compiler, **kwargs):
return "WITH {} AS ({}) {}".format(
element.name,
compiler.process(element.outer),
compiler.process(element.inner))
c = Country.__table__
cte = select(["*"]).select_from(c).where(c.c.continent == "Asia")
query = WrappedQuery("country", cte, text("SELECT name, population FROM country"))
session.execute(query)
From my tests, this only works in PostgreSQL. SQLite and SQL Server both treat it as recursive instead of shadowing, and MySQL does not support CTEs.
I couldn't find anything nice for this in the documentation for this. I ended up resorting to pretty much just string processing.... but at least it works!
from sqlalchemy.sql import text
query = """select name, population from Country"""
if continent is not None:
additional_clause = """WHERE continent = {continent};"""
query = query + additional_clause
text_clause = text(
query.format(
continent=continent,
),
)
else:
text_clause = text(query)
with sql_connection() as conn:
results = conn.execute(text_clause)
You could also chain this logic with more clauses, although you'll have to create a boolean flag for the first WHERE clause and then use AND for the subsequent ones.

Summing database column in python

I have recently encountered the problem of adding the elements of a database column. Here is the following code:
import sqlite3
con = sqlite3.connect("values.db")
cur = con.cursor()
cur.execute('SELECT objects FROM data WHERE firm = "sony"')
As you can see, I connect to the database (sql) and I tell to Python to select the column "objects".
The problem is that I do not know the appropriate command for summing the selected objects.
Any ideas/ advices are highly reccomended.
Thank you in advance!!
If you can, have the database do the sum, as that reduces data transfer and lets the database do what it's good at.
cur.execute("SELECT sum(objects) FROM data WHERE firm = 'sony'")
or, if you're really just looking for the total count of objects.
cur.execute("SELECT count(objects) FROM data WHERE firm = 'sony'")
either way, your result is simply:
count = cur.fetchall()[0][0]
Try the following line:
print sum([ row[0] for row in cur.fetchall()])
If you want the items instead adding them together:
print ([ row[0] for row in cur.fetchall()])

Categories