I'm confused on how to structure a list/dict I need. I have scraped three pieces of info off ESPN: Conference, Team, and link to team homepage for future stat scrapping.
When the program first runs, id like to build a dictionary/list so that one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools. The link associated with each school isn't important that the end user know about but it is important that the correct link is associated with the correct school so that future stats from that specific school can be scraped.
For example the info scrapped is:
SEC, UGA, www.linka.com
ACC, FSU, www.linkb.com
etc...
I know i could create a list of dictionaries like:
sec_list=[{UGA: www.linka.com, Alabama: www.linkc.com, etc...}]
acc_list=[{FSU: www.linkb.com, etc...}]
The problem is id have to create about 26 lists here to hold every conference which sounds excessive. Is there a way to lump everything into one list but still have the ability to to extract schools from a specific conference or search for a school and the correct conference is also returned? Of course, the link to the school must also correspond to the correct school.
Python ships with sqlite3 to handle database problems and it has an :memory: mode for in-memory databases. I think it will solve your problem directly and with clear code.
import sqlite3
from pprint import pprint
# Load the data from a list of tuples in the from [(conf, school, link), ...]
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE Espn (conf text, school text, link text)')
c.execute('CREATE INDEX Cndx ON Espn (conf)')
c.execute('CREATE INDEX Sndx ON Espn (school)')
c.executemany('INSERT INTO Espn VALUES (?, ?, ?)', data)
conn.commit()
# Run queries
pprint(c.execute('SELECT * FROM Espn WHERE conf = "Big10"').fetchall())
pprint(c.execute('SELECT * FROM Espn WHERE school = "Alabama"').fetchall())
In memory databases are so easy to create and query that they are often the easiest solution to the problem of how to have multiple lookup keys and doing analytics on relational data. Trying to use dicts and lists for this kind of work just makes the problem unnecessarily complicated.
It's true you can do this with a list of dictionaries, but you might find it easier to be able to look up information with named fields. In that case, I'd recommend storing your scraped data in a Pandas DataFrame.
You want it so that "one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools".
Here's an example of what that would look like, using Pandas and a couple of convenience functions.
First, some example data:
confs = ['ACC','Big10','BigEast','BigSouth','SEC',
'ACC','Big10','BigEast','BigSouth','SEC']
teams = ['school{}'.format(x) for x in range(10)]
links = ['www.{}.com'.format(x) for x in range(10)]
scrape = zip(confs,teams,links)
[('ACC', 'school0', 'www.0.com'),
('Big10', 'school1', 'www.1.com'),
('BigEast', 'school2', 'www.2.com'),
('BigSouth', 'school3', 'www.3.com'),
('SEC', 'school4', 'www.4.com'),
('ACC', 'school5', 'www.5.com'),
('Big10', 'school6', 'www.6.com'),
('BigEast', 'school7', 'www.7.com'),
('BigSouth', 'school8', 'www.8.com'),
('SEC', 'school9', 'www.9.com')]
Now convert to DataFrame:
import pandas as pd
df = pd.DataFrame.from_records(scrape, columns=['conf','school','link'])
conf school link
0 ACC school0 www.0.com
1 Big10 school1 www.1.com
2 BigEast school2 www.2.com
3 BigSouth school3 www.3.com
4 SEC school4 www.4.com
5 ACC school5 www.5.com
6 Big10 school6 www.6.com
7 BigEast school7 www.7.com
8 BigSouth school8 www.8.com
9 SEC school9 www.9.com
Type in school, get conference:
def get_conf(df, school):
return df.loc[df.school==school, 'conf'].values
get_conf(df, school = 'school1')
['Big10']
Type in conference, get schools:
def get_schools(df, conf):
return df.loc[df.conf==conf, 'school'].values
get_schools(df, conf = 'Big10')
['school1' 'school6']
It's unclear from your question whether you also want the links associated with schools returned when searching by conference. If so, just update get_schools() to:
def get_schools(df, conf):
return df.loc[df.conf==conf, ['school','link']].values
Related
I have a table containing full of movie genre, like this:
id | genre
---+----------------------------
1 | Drama, Romance, War
2 | Drama, Musical, Romance
3 | Adventure, Biography, Drama
Im looking for a way to get the most common word in the whole genre column and return it to a variable for further step in python.
I'm new to Python so I really don't know how to do it. Currently, I have these lines to connect to the database but don't know the way to get the most common word mentioned above.
conn = mysql.connect()
cursor = conn.cursor()
most_common_word = cursor.execute()
cursor.close()
conn.close()
First you need get list of words in each column. i.e create another table like
genre_words(genre_id bigint, word varchar(50))
For clues how to do that you may check this question:
SQL split values to multiple rows
You can do that as temporary table if you wish or use transaction and rollback. Which one to choose depend of your data size and PC on which DB running.
After that query will be really simple
select count(*) as c, word from genre_word group by word order by count(*) desc limit 1;
You also can do it using python, but if so it will not be a MySQL question at all. Need read table, create simple list of word+counter. If it new, add it, if exist - increase counter.
from collections import Counter
# Connect to database and get rows from table
rows = ...
# Create a list to hold all of the genres
genres = []
# Loop through each row and split the genre string by the comma character
# to create a list of individual genres
for row in rows:
genre_list = row['genre'].split(',')
genres.extend(genre_list)
# Use a Counter to count the number of occurrences of each genre
genre_counts = Counter(genres)
# Get the most common genre
most_common_genre = genre_counts.most_common(1)
# Print the most common genre
print(most_common_genre)
Hi guys:) I'm a newbie at programming and would like to ask for help in creating a function to help reduce redundancy in my code. I have successfully created a database holding 5 different tables for data of different countries. All tables have the same structure (see attached screenshots for reference). My objective is to calculate the summation of all rows within all the different tables for a particular parameter (type of pollution).
I have managed to write code to only select the particular data I need of one country (I tried writing code to calculate the summation but I can't figure that out, so I decided to just select the data and then manually calculate the values myself with a calculator -I know that sort of defeats the purpose of programming but at my programming level (beginner) I feel like it's the only way that I can do the code) my issue is that I have five countries, so I don't want to repeat the same block of code for the different countries. this is my code for one country:
def read_MaltaData():
conn = sqlite3.connect('FinalProjectDatabase.sqlite3')
Malta = conn.cursor()
Malta.execute("SELECT * FROM MaltaData WHERE AirPollutant = 'PM10'")
result = Malta.fetchall()
print(result)
my result is this:
[('Malta', 'Valletta', 'MT00005', 'Msida', 'PM10', 64.3, 'ug/m3', 'Traffic', 'urban', 14.489985999999998, 35.895835999489535, 2.0), ('Malta', None, etc.
(I am going to manually calculate the data I require -in this case 64.3 + the value from the next row- as I don't know how to do it in python)
To clarify, my aim isn't to have a sum total across all the tables as one whole value (i.e. I don't want to add the values of all the countries all together). My desired output should look something like this:
Malta summation value
italy summation value
france summation value
and not like this
countries all together = one whole value (i.e. all summation values added together)
I would greatly appreciate any help I can get. Unfortunately I am not able to share the database with you, which is why I am sharing screenshots of it instead.
image of all 5 different tables in one database:
image of one table (all tables look the same, just with different values)
You can use UNION ALL to get a row for each country:
SELECT 'France' country, SUM(AirPolutionLevel) [summation value] FROM FranceData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Germany' country, SUM(AirPolutionLevel) [summation value] FROM GermanyData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Italy' country, SUM(AirPolutionLevel) [summation value] FROM ItalyData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Malta' country, SUM(AirPolutionLevel) [summation value] FROM MaltaData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Poland' country, SUM(AirPolutionLevel) [summation value] FROM PolandData WHERE AirPollutant = 'PM10'
If you pass the country name as argument to the data retrieval function, you can generate the table names dynamically (note the f-string arguments in execute and print):
First draft
def print_CountryData(country):
conn = sqlite3.connect('FinalProjectDatabase.sqlite3')
cur = conn.cursor()
cur.execute(f"SELECT SUM(AirPollutionLevel) FROM {country}Data WHERE AirPollutant = 'PM10'")
sumVal = cur.fetchone()[0]
print(f"{country} {sumVal}")
# example call:
for country in ('France', 'Germany', 'Italy', 'Malta', 'Poland'):
print_CountryData(country)
While building query strings your own with simple string functions is discouraged in the sqlite3 documentation for security reasons, in your very case where you have total control of the actual arguments I'd consider it as safe.
This answer adapts the summation from the great answer given by forpas but refuses to move the repetition to SQL. It also shows both integration with python and output formatting.
MRE-style version
This is an improved version of my first answer, transformed into a Minimal, Reproducible Example and combined with output. Also, some performance improvements were made, for instance opening the database only once.
import sqlite3
import random # to simulate actual pollution values
# Countries we have data for
countries = ('France', 'Germany', 'Italy', 'Malta', 'Poland')
# There is one table for each country
def tableName(country):
return country+'Data'
# Generate minimal version of tables filled with random data
def setup_CountryData(cur):
for country in countries:
cur.execute(f'''CREATE TABLE {tableName(country)}
(AirPollutant text, AirPollutionLevel real)''')
for i in range(5):
cur.execute(f"""INSERT INTO {tableName(country)} VALUES
('PM10', {100*random.random()})""")
# Get sum up pollution data for each country
def print_CountryData(cur):
for country in countries:
cur.execute(f"""SELECT SUM(AirPollutionLevel) FROM
{tableName(country)} WHERE AirPollutant = 'PM10'""")
sumVal = cur.fetchone()[0]
print(f"{country:10} {sumVal:9.5f}")
# For testing, we use an in-memory database
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
setup_CountryData(cur)
# The functionality actually required
print_CountryData(cur)
Sample output:
France 263.79430
Germany 245.20942
Italy 225.72068
Malta 167.72690
Poland 290.64190
It's often hard to evaluate a solution without actually trying it. That's the reason why questioners on StackOverflow are constantly encouraged to ask in this style: it makes it much more likely someone will understand and fix the problem ... quickly
If the database is not too big you could use pandas.
This approach is less efficient than using SQL queries directly but can be used if you want to explore the data interactively in a notebook for example.
You can create a dataframe from your SQLite db using pandas.read_sql_query
and then perform your calculation using pandas.DataFrame methods, which are designed for this type of tasks.
For your specific case:
import sqlite3
import pandas as pd
conn = sqlite3.connect(db_file)
query = "SELECT * FROM MaltaData WHERE AirPollutant = 'PM10'"
df = pd.read_sql_query(query, conn)
# check dataframe content
print(df.head())
If I understood and then you want to compute the sum of the values in a given column:
s = df['AirPollutionLevel'].sum()
If you have missing values you might want to fill them with 0s before summing:
s = df['AirPollutionLevel'].fillna(0).sum()
I have kind of a dual question which is keeping me from proceeding for a while already. I have read lots of articles, checked stackoverflow numerous times and read again the docs of mongoengine but I can not find the answer which works for me. I am using mongoDB to store the data of a Flask webb-app. To query the DB, I am using mongoengine. Now suppose my users model lookes like this:
Users
name: Superman
kudos:
0 0 date
1 category A
1 0 date
1 category B
name: Superman
kudos:
0 0 date
1 category A
1 0 date
1 category A
2 0 date
1 category B
The kudo's are nested documents which get created whenever a user receives a kudo. I store them as a db.ListField(date=now). This is working perfectly fine.
In a relational DB I would have a seperate kudo scheme. In mongoDB I assumend it would be the better solution to create nested documents wihtin the User collections. Otherwise you are still creating all kind of seperate scheme's which relations to others.
So here are my two main questions:
Am I correct in that my architecture is true to how mongoengine should be implemented?
How can I get a list (dict actually) of kudo's per category? So I would like to query and get Category - Count
Result should be:
kudos=[(category A, 3),(category B, 2)
If I already had something even remotely working I would provide it but I am completely stuck. Thats why I even started doubting storing the kudos in a seperate collection but I feel like I am than starting to get off track in correctly using a noSQL DB.
Assuming you have the following schema and data:
import datetime as dt
from mongoengine import *
connect(host='mongodb://localhost:27017/testdb')
class Kudo(EmbeddedDocument):
date = DateTimeField(default=dt.datetime.utcnow)
category = StringField()
class User(Document):
name = StringField(required=True)
kudos = EmbeddedDocumentListField(Kudo)
superman = User(name='superman', kudos=[Kudo(category='A')]).save()
batman = User(name='batman', kudos = [Kudo(category='A'), Kudo(category='B')]).save()
This isn't the most efficient but you can get the distribution with the following simple snippet:
import itertools
from collection import Counter
raw_kudos = User.objects.scalar('kudos')
categories_counter = Counter(k.category for k in itertools.chain.from_iterable(raw_kudos)) # raw_kudos is a list of list
print(categories_counter) # is a dict --> Counter({u'A': 1, u'B': 1})
And if you need higher performance, you'll need to use an aggregation pipeline
I'm currently creating an application which maps peoples skills against various technologies.
I have 3 tables;
Employees
Name
Department
Skill
Skill name
Results
Name (FK)
Skill (FK)
Skill level
I wish to be able to see every single employee with each skill listed in a table. I believe the correct procedure to retrieve this information would be to perform some sort of for loop and select the info from the 3 tables? The alternative is adding rows to the results table each time an employee or skill is added (although this doesn't seem like correct logic to me).
I think this is a correct logic. Since you have to keep the level of the skill for each employee.
Lets say you have created three models.
Employee
skill
Result
when you do
to get the skills of employee with id = 37
emp = Employee.objects.get(pk=37)
#here we will get an array which has tuple all the skills and its level for employee
skill_level_array = [(Skill.objects.filter(pk=x.skill), x.level) for x in Result.objects.filter(employee=emp)]
To get skills for all empoyees
all_emp = Employee.objects.all()
grand_array = {}
for emp in all_emp:
skill_level_array = [(Skill.objects.filter(pk=x.skill), x.level) for x in Result.objects.filter(employee=emp)]
grand_array[emp] = skill_level_array
Now grand_array has an array of dictionary, with key as employee and value as array of tuple
I am extremely new to Python (< 1 week), and I hope to read in three "variables," PMID, Abstract Text, and Mesh into a dataframe. My .xml is 10 GB.
Right now the following code produces a list of the PMIDs and the Abstract Texts. How would I convert it into a dataframe where there are 3 variables, PMID, Abstract Text, and Mesh, in which each DescriptorName within the Mesh from the XML is separated by a comma (for ex: Adenocarcinoma, Antineoplastic Agents, Colorectal Neoplasms)? Please note that the following snippet is only 1 PMID. There are about 1.8 mil in total.
Please note that some PMIDs do not contain any Abstract Texts or Mesh...in that case, I would like NA or "" to stand in place for its row.
import xml.etree.cElementTree as etree
# read in all PMIDs and Abstract Texts - got too scared to parse in Mesh incorrectly since it's very time consuming to re-run
pmid_abstract = []
for event, element in etree.iterparse("pubmed_result.xml"):
if element.tag in ["PMID", "AbstractText"]:
pmid_abstract.append(element.text)
element.clear()
This contains only the relevant tags in .xml for one PMID only
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">29310420</PMID>
<Article PubModel="Print">
<Abstract>
<AbstractText Label="RATIONALE" NlmCategory="BACKGROUND">Regorafenib is the new standard third-line therapy in metastatic colorectal cancer (mCRC). However, the reported 1-year overall survival rate does not exceed 25%.</AbstractText>
<AbstractText Label="PATIENT CONCERNS" NlmCategory="UNASSIGNED">A 55-year-old man affected by mCRC, treated with regorafenib combined with stereotactic body radiotherapy (SBRT), showing a durable response.</AbstractText>
</Abstract>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000230" MajorTopicYN="N">Adenocarcinoma</DescriptorName>
<QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
<QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000970" MajorTopicYN="N">Antineoplastic Agents</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015179" MajorTopicYN="N">Colorectal Neoplasms</DescriptorName>
<QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
<QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>
This might be what you want.
I appended a copy of the complete PubmedArticle element that you posted to the end of the xml file and then enclosed the two elements in a single PubmedArticles (plural) element, to demonstrate the principle involved in processing such files. Because your file is so large I chose to put temporary results into a sql database table, and then to import them from there into pandas.
The first time through the loop there is no record to process. Afterwards, each time a PMID element is encountered this implies that the previous PubmedArticle has been completely processed and is available for storing to the database. As other elements are encountered they are simply inserted into the dictionary representing the current article.
from xml.etree import ElementTree
import sqlite3
import pandas as pd
conn = sqlite3.connect('ragtime.db')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS ragtime')
c.execute('''CREATE TABLE ragtime (PMID text, AbstractText Text, DescriptorName Text)''')
with open('ragtime.csv', 'w') as ragtime:
record = None
for ev, el in ElementTree.iterparse('ragtime.xml'):
if el.tag=='PMID':
if record:
c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
record = {'PMID': el.text, 'AbstractText': [], 'DescriptorName': []}
elif el.tag=='AbstractText':
record['AbstractText'].append(el.text)
elif el.tag=='DescriptorName':
record['DescriptorName'].append(el.text)
else:
pass
c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
conn.commit()
df = pd.read_sql_query('SELECT * FROM ragtime', conn)
print (df.head())
conn.close()
It produces the following printed result.
PMID AbstractText \
0 29310420 Regorafenib is the new standard third-line the...
1 29310425 Regorafenib is the new standard third-line the...
DescriptorName
0 Adenocarcinoma,Antineoplastic Agents,Colorecta...
1 Adenocarcinoma,Antineoplastic Agents,Colorecta...