Counting Organizations (Database sqlite3)

Counting Organizations (Database sqlite3) - python

I need help with this python code. I am making an application that will read the mailbox data (mbox.txt) and count the number of email messages per organization (i.e. domain name of the email address) using a database with the following schema to maintain the counts. The top organizational count is 536.
This is the Schema: CREATE TABLE Counts (org TEXT, count INTEGER)
I've tried so many times I just can't get the count of 536. Here's my code below:
import sqlite3
conn = sqlite3.connect('emaildb.sqlite')
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS Counts')
cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')
fname = input('Enter file name: ')
if (len(fname) < 1): fname = 'mbox.txt'
fh = open(name)
for line in fh:
if not line.startswith('From: '): continue
pieces = line.split()
org = pieces[1]
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org,))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (org, count)
VALUES (?, 1)''', (org,))
else:
cur.execute('UPDATE Counts SET count = count + 1 WHERE org = ?',
(org,))
conn.commit()
# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT org, count FROM Counts ORDER BY count DESC LIMIT 10'
for row in cur.execute(sqlstr):
print(str(row[0]), row[1])
cur.close()
The highest number that I got is 195. Here is the output of the code above:
Enter the file name:
zqian#umich.edu 195
mmmay#indiana.edu 161
cwen#iupui.edu 158
chmaurer#iupui.edu 111
aaronz#vt.edu 110
ian#caret.cam.ac.uk 96
jimeng#umich.edu 93
rjlowe#iupui.edu 90
dlhaines#umich.edu 84
david.horwitz#uct.ac.za 67
Here's the link where I got the text file and wrote it to a text file called mbox.txt
(https://www.py4e.com/code3/mbox.txt)

You're not extracting the domain from the email. So multiple emails at the same domain are being treated as different organizations.
for line in fh:
if not line.startswith('From: '): continue
pieces = line.split()
email = pieces[1]
pieces = email.splot('#')
org = pieces[1]
...
Also, you might want to use the code in SQLite INSERT - ON DUPLICATE KEY UPDATE (UPSERT) so you don't have to do a SELECT query to see if the organization already exists.

Your retrieved results are email addresses, not email domains. You have to split the email addresses at the '#' symbol to get domain names:
if not line.startswith('From: '):
continue
pieces = line.split('#') # this is what you want
org = pieces[1]
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org,))
Explanation: instead of splitting the string at every space, which is the default behaviour of the Python str.split() function, we split the string at the '#' sign. So an line in your text file like 'From: name#email.com' would become a list with two parts: ['From: name', 'email.com']
Then you can use the second part and keep track of that part instead, and hopefully the code will work.

Related

Python SQLite Database Result has unwanted brackets and quotation marks

I am doing an assignment from Coursera Course "Using Databases with Python" and in one of the assignment I ran into this issue where the column of my database result returned has brackets and quotation marks around it. (It should return org as iupui.edu instead of my current result ['iupui.edu']
Please refer to my code below:
import sqlite3
import re
conn = sqlite3.connect('emaildb.sqlite')
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS Counts')
cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')
fname = input('Enter file name: ')
if (len(fname) < 1): fname = 'mbox-short.txt'
fh = open(fname)
for line in fh:
if not line.startswith('From: '): continue
pieces = line.split()
email = pieces[1]
org = str(re.findall('#(\S+)', email))
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org,))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (org, count)
VALUES (?, 1)''', (org,))
else:
cur.execute('UPDATE Counts SET count = count + 1 WHERE org = ?',
(org,))
conn.commit()
# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT org, count FROM Counts ORDER BY count DESC LIMIT 10'
for row in cur.execute(sqlstr):
print(str(row[0]), row[1])
cur.close()
The mbox file is here: https://www.py4e.com/code3/mbox.txt
I have a feeling I shouldn't convert org to string class but I don't know what else to convert it to because
I would greatly appreciate your help as I've been trying to fix it for hours!

This has to do with how you're saving it:
org = str(re.findall('#(\S+)', email))
Here, you find all email orgs, right? but how do you process them? Instead of taking the first value, you cast it into a string. Here's the problem. findall returns a list, even if there is only one result. This is what you can do:
org = re.findall('#(\S+)', email)[0]
Now, org is still a string, but it no longer has brackets, as you are not casting a list to a string.

Parsing just a column of a CSV file with multiple columns

I am learning Python and am currently working with it to parse a CSV file.
The CSV file has 3 columns:
Full_name, university, and Birth_Year.
I have successfully loaded,read, and printed the content of a given CSV file into Python, but here’s where I am stuck at:
I want to use and parse ONLY the column Full_name to 3 columns: first, middle, and last. If there are only 2 words in the name, then the middle name should be null.
The resulting parsed output should then be inserted to a sql db through Python.
Here’s my code so far:
import csv
if __name__ == '__main__':
if len (sys.argv) != 2:
print("Please enter the csv file too: python name_parsing.py student_info.csv")
sys.exit()
else:
with open(sys.argv[1], "r" ) as file:
reader = csv.DictReader(file) #I use DictReader because the csv file has > 1 col
# Print the names on the cmd
for row in reader:
name = row["Full_name"]
for name in reader:
if len(name) == 2:
print(first_name = name[0])
print(middle_name = None)
print(last_name = name[2])
if len(name) == 3 : # The assumption is that name is either 2 or 3 words only.
print(first_name = name[0])
print(middle_name = name[1])
print(last_name = name[2])
db.execute("INSERT INTO name (first, middle, last) VALUES(?,?,?)",
row["first_name"], row["middle_name"], row["last_name"])
Running the program above gives me no output whatsoever. How to parse my code the right way? Thank you.

I created a sample file based on your description. The content looks as below:
Full_name,University,Birth_Year
Prakash Ranjan Gupta,BPUT,1920
Hari Shankar,NIT,1980
John Andrews,MIT,1950
Arbaaz Aslam Khan,REC,2005
And then I executed the code below. It runs fine on my jupyter notebook. You can add the lines (sys.argv) != 2 etc) with this as you need. I have used sqlite3 database I hope this works. In case you want the if/main block added to this, let me know: can edit.
This is going by your code. (Otherwise You can do this using pandas in an easier way I believe.)
import csv
import sqlite3
con = sqlite3.connect('name_data.sql') ## Make DB connection and create a table if it does not exist
cur = con.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS UNIV_DATA
(FIRSTNAME TEXT,
MIDDLE_NAME TEXT,
LASTNAME TEXT,
UNIVERSITY TEXT,
YEAR TEXT)''')
with open('names_data.csv') as fh:
read_data = csv.DictReader(fh)
for uniData in read_data:
lst_nm = uniData['Full_name'].split()
if len(lst_nm) == 2:
fn,ln = lst_nm
mn = None
else:
fn,mn,ln = lst_nm
# print(fn,mn,ln,uniData['University'],uniData['Birth_Year'] )
cur.execute('''
INSERT INTO UNIV_DATA
(FIRSTNAME, MIDDLE_NAME, LASTNAME, UNIVERSITY, YEAR)
VALUES(?,?,?,?,?)''',
(fn,mn,ln,uniData['University'],uniData['Birth_Year'])
)
con.commit()
cur.close()
con.close()
If you want to read the data in the table UNIV_DATA:
Option 1: (prints the rows in the form of tuple)
import sqlite3
con = sqlite3.connect('name_data.sql') #Make connection to DB and create a connection object
cur = con.cursor() #Create a cursor object
results = cur.execute('SELECT * FROM UNIV_DATA') # Execute the query and store the rows retrieved in 'result'
[print(result) for result in results] #Traverse through 'result' in a loop to print the rows retrieved
cur.close() #close the cursor
con.close() #close the connection
Option 2: (prints all the rows in the form of a pandas data frame - execute in jupyter ...preferably )
import sqlite3
import pandas as pd
con = sqlite3.connect('name_data.sql') #Make connection to DB and create a connection object
df = pd.read_sql('SELECT * FROM UNIV_DATA', con) #Query the table and store the result in a dataframe : df
df

When you call name = row["Full_name"] it is going to return a string representing the name, e.g. "John Smith".
In python strings can be treated like lists, so in this case if you called len(name) it would return 10 as "John Smith" has 10 characters. As this doesn't equal 2 or 3, nothing will happen in your for loop.
What you need is some way to turn the string into a list that containing the first, second and last names. You can do this using the split function. If you call name.split(" ") it would split the string whenever there is a space, continuing the above example this would return ["John", "Smith"] which should make your code work.

implementing sqlite sort by function and select function together in python along with pandas

I am writing a program to search through a database. My table contains Id, Title, Date, Content. I only need date and content as output. The output must be sorted with respect to 'Id'. But I don't want Id in output. How can I implement this ?
I tried to sort data before selecting Id and Content, unfortunately it is not giving me desired result.
import os
import pandas as pd
conn = sqlite3.connect('insight.db')
c = conn.cursor()
def search_keyword(term):
c.execute("SELECT * FROM GkData ORDER BY Id DESC")
print(pd.read_sql_query("SELECT date, content FROM GkData WHERE {} LIKE '%{}%'".format('content', term), conn))
c.execute("SELECT content FROM GkData WHERE {} LIKE '%{}%'".format('content', term))
data = c.fetchall()
for idx, row in enumerate(data):
new_data = str(row).replace("',)", " ").replace("('", " ")
print('[ ' + str(idx) + ' ] =====> ' + new_data)
while True:
search = input("Search term: ")
search_keyword(str(search))
conn.close()
Output image of the program
Image of Table

"The output must be sorted with respect to 'Id'. But I don't want Id in output. How can I implement this ?"
SELECT date, content FROM GkData WHERE content LIKE %something% ORDER BY id ASC

python script is not saving into database

I am currently learning how to modify data with python using visual studios and sqlite. My assignment is to count how many times emails are found in a file, organize them in a way that each email is then counted. Then I must input these into SQLite as a table named Counts with two rows (org,count). I have wrote a code that runs the program and outputs it onto the visual studios output screen but not the database.
this is my program:
import sqlite3
conn = sqlite3.connect('database3.db')
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS Counts')
cur.execute('''CREATE TABLE Counts (email TEXT, count INTEGER)''')
#cur.execute("INSERT INTO Counts Values('mlucygray#gmail.com',1)")
# Save (commit) the changes
conn.commit()
fname = input('Enter file name: ')
if (len(fname) < 1): fname = 'mbox-short.txt'
fh = open(fname)
for line in fh:
if not line.startswith('From: '): continue
pieces = line.split()
email = pieces[1]
cur.execute('SELECT count FROM Counts WHERE email = ? ', (email,))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (email, count) VALUES (?, 1)''', (email,))
else:
cur.execute('UPDATE Counts SET count = count + 1 WHERE email = ?',(email,))
cur.execute('SELECT * FROM Counts')
# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT email, count FROM Counts ORDER BY count DESC LIMIT 10'
conn.commit()
for row in cur.execute(sqlstr):
print(str(row[0]), row[1])
conn.commit()
cur.close()
click here for the link to the output of the above code
Thank you for any suggestions

You need to commit changes with insert/update and DONT need to commit after executing select statements.
for line in fh:
if not line.lower().startswith('from: '): continue
pieces = line.split()
email = pieces[1]
cur.execute('SELECT count FROM Counts WHERE email = ?', (email,))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (email, count) VALUES (?, 1)''', (email,))
else:
cur.execute('UPDATE Counts SET count = count + 1 WHERE email = ?',(email,))
conn.commit()
sqlstr = 'SELECT email, count FROM Counts ORDER BY count DESC LIMIT 10'
for row in cur.execute(sqlstr):
print(str(row[0]), row[1])
cur.close()

Python 2.7 (Anaconda 2) - IndentationError: unexpected indent Notepad++ auto separating code

Okay, so this code is for an online class, and I feel the issue is caused by Notepad++, but I can't prove anything.
Actual Error is:
[Anaconda2] C:\Users\ia566\Desktop\Python>python countorgdb.py
File "countorgdb.py", line 20
print pieces
^
IndentationError: unexpected indent
Here is a picture of Notepad++:
Notepad++ Screen Shot With Error
Notepad++ with line deleted
Notice how the file tree changes. Any addition to the code causes an extra branch in the file tree and thus an Indention Error.
Full Code:
import sqlite3
conn = sqlite3.connect('jbemaildb.sqlite')
cur = conn.cursor()
cur.execute('''
DROP TABLE IF EXISTS Counts''')
cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')
fname = raw_input('Enter file name: ')
if ( len(fname) < 1 ) : fname = 'mbox-short.txt'
fh = open(fname)
for line in fh:
if not line.startswith('From: ') : continue
pieces = line.split()
org = pieces[1]
print org
# Here is were I a am trying to add a print statement and more code.
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org, ))
row = cur.fetchone()
if row is None:
cur.execute('''INSERT INTO Counts (org, count)
VALUES ( ?, 1 )''', (org, ) )
else :
cur.execute('UPDATE Counts SET count=count+1 WHERE org = ?',
(org, ))
# This statement commits outstanding changes to disk each
# time through the loop - the program can be made faster
# by moving the commit so it runs only after the loop completes
conn.commit()
# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT org, count FROM Counts ORDER BY count DESC LIMIT 10'
print
print "Counts:"
for row in cur.execute(sqlstr) :
print str(row[0]), row[1]
cur.close()
I cannot figure out what I am doing to make Notepad++ so angry. I just want to be able to add to my code.

Indentation Error occurs when you are not using spaces or tabs properly. You should indent your code with four spaces. Try it out and see if it helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting Organizations (Database sqlite3) - python

Related

Python SQLite Database Result has unwanted brackets and quotation marks

Parsing just a column of a CSV file with multiple columns

implementing sqlite sort by function and select function together in python along with pandas

python script is not saving into database

Python 2.7 (Anaconda 2) - IndentationError: unexpected indent Notepad++ auto separating code

Categories

Resources