ETL table selection by Variable

ETL table selection by Variable - python

I'm trying to select rows within a table and create a new table with the information from the original table using PETL.
My code right now is:
import petl as etl
table_all = (
etl.fromcsv("practice_locations.csv")
.convert('Practice_Name', 'upper')
.convert('Suburb', str)
.convert('State', str)
.convert('Postcode', int)
.convert('Lat', str)
.convert('Long', str)
)
def selection(post_code):
table_selected = etl.select(table_all, "{Postcode} == 'post_code'")
print(post_code)
etl.tojson(table_selected, 'location.json', sort_keys=True)
But I cannot seem to populate table_selected by using the selection function as it is. The etl.select call will work if I replace post_code to look like
table_selected = etl.select(table_all, "{Postcode} == 4510")
Which outputs the correct table shown as:
+--------------------------------+--------------+-------+----------+--------------+--------------+
| Practice_Name | Suburb | State | Postcode | Lat | Long |
+================================+==============+=======+==========+==============+==============+
| 'CABOOLTURE COMBINED PRACTICE' | 'Caboolture' | 'QLD' | 4510 | '-27.085007' | '152.951707' |
+--------------------------------+--------------+-------+----------+--------------+--------------+
I'm sure I am just trying to call post_code in a way that is wrong but have tried everything from the PETL documentation and can't seem to figure it out.
Any help is much appreciated.

"{Postcode} == 'post_code'" will not replace post_code with the value passed to your selection function.
You need to format your select string (and escape {Postcode} when using format)
table_selected = etl.select(table_all, "{{Postcode}} == {post_code}".format(post_code=post_code))
Testing this in console
>>> "{{Postcode}} == {post_code}".format(post_code=1234)
'{Postcode} == 1234'

Related

How to group column values into 'others'?

I have a huge list of website names in my dataframe.
e.g array(['google','facebook','yahoo','youtube', and many other small websites])
Dataframe has around 40 more websites.
I want to group the other websites name as 'other'
My input table is something like
|Website |
|-------------|
|google.com |
|youtube.com |
|yahoo.com |
|nyu.com |
|something.com|
My desired output will be something like
|Website |
|-----------|
|google.com |
|youtube.com|
|yahoo.com |
|others |
|others |
I tried a few things but didn't work. Should I be manually renaming them ? Or is there any way, I can create a new column and mention them as others with a few exceptions as above ?
Thanks in advance.

try:
m=df['Website'].isin(['google.com','youtube.com','yahoo.com'])
#Finally:
df.loc[~m,'Website']='others'
OR
m=df['Website'].str.contains('google|youtube|yahoo')
#Finally:
df.loc[~m,'Website']='others'

Try using str.contains:
df.loc[df['Website'].str.contains('google|youtube|yahoo|facebook'),'Website']='others'

Maybe...
# maintain a list of sites you wish to keep
sitesToKeep = ['google.com', 'youtube.com', 'yahoo.com']
# for all rows where the value in the column 'Website' is not present in the list 'sitesToKeep' change the value to 'other'
df.loc[~df.Website.isin(sitesToKeep), 'Website'] = 'Other'

Check if value in a column exists in URL using lamda function

I have a dataframe that has 2 columns. One is the URL and other is the username.
+----------------------------------------+---------------+
| URL | Username |
+----------------------------------------+---------------+
| johnsmith/stackoverflow.com/?=abc | johnsmith |
| michealrod/stackoverflow.com/?=payment | michealrod |
| stephaniejean/stackoverflow.com/?=abc | stephaniejean |
+----------------------------------------+---------------+
I want to write a lambda function that that checks if the username exists i the URL. I am trying to write something like this but getting an error
df['exists'] = df.apply(lambda x : df['Username'] in df['URL']).any()
So basically I am trying to get a TRUE if the username is a part of URL and False if the username does not exists in the URL.

Assuming your data is clean, a list comprehension is relatively efficient:
df['exists'] = [x in y for x, y in zip(df['Username'], df['URL'])]
You can use apply but with worse performance:
df['exists'] = df.apply(lambda row: row['Username'] in row['URL'], axis=1)

Check with numpy core.defchararray.find
df['exists']=np.core.defchararray.find(df.URL.values.astype(str),df.Username.values)!=-1

Convert a value using a value from a different row with petl?

I have the following table:
+---------+------------+----------------+
| IRR | Price List | Cambrdige Data |
+=========+============+================+
| '1.56%' | '0' | '6/30/1989' |
+---------+------------+----------------+
| '5.17%' | '100' | '9/30/1989' |
+---------+------------+----------------+
| '4.44%' | '0' | '12/31/1990' |
+---------+------------+----------------+
I'm trying to write a calculator that updates the Price List field by making a simple calculation. The logic is basically this:
previous price * ( 1 + IRR%)
So for the last row, the calculation would be: 100 * (1 + 4.44%) = 104.44
Since I'm using petl, I'm trying to figure out how to update a field with its above value and a value from the same row and then populate this across the whole Price List column. I can't seem to find a useful petl utility for this. Should I just manually write a method? What do you guys think?

Try this:
# conversion can access other values from the same row
table = etl.convert(table, 'Price List',
lambda row: 100 * (1 + row.IRR),
pass_row=True)

Automatically multiprocessing a 'function apply' on a dataframe column

I have a simple dataframe with two columns.
+---------+-------+ | subject | score |
+---------+-------+ | wow | 0 |
+---------+-------+ | cool | 0 |
+---------+-------+ | hey | 0 |
+---------+-------+ | there | 0 |
+---------+-------+ | come on | 0 |
+---------+-------+ | welcome | 0 |
+---------+-------+
For every record in 'subject' column, I am calling a function and updating the results in column 'score' :
df['score'] = df['subject'].apply(find_score)
Here find_score is a function, which processes strings and returns a score :
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
# Instantiates a client
language_client = language.Client()
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
This works fine as expected but its quite slow as it processes the record one by one.
Is there a way, this can be parallelised ? without manually splitting the dataframe into smaller chunks ? Is there any library which does that automatically ?
Cheers

The instantiation of language.Client every time you call the find_score function is likely a major bottleneck. You don't need to create a new client instance for every use of the function, so try creating it outside the function, before you call it:
# Instantiates a client
language_client = language.Client()
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
df['score'] = df['subject'].apply(find_score)
If you insist, you can use multiprocessing like this:
from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()

Python Output Length

I'm attempting to output my database table data, which works aside from long table rows. The columns need to be as large as the longest database row. I'm having trouble implementing a calculation to correctly output the table proportionally instead of a huge mess when long rows are outputted (without using a third party library e.g. Print results in MySQL format with Python). Please let me know if you need more information.
Database connection:
connection = sqlite3.connect("test_.db")
c = connection.cursor()
c.execute("SELECT * FROM MyTable")
results = c.fetchall()
formatResults(results)
Table formatting:
def formatResults(x):
try:
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in x:
print(tavnit % row)
print(separator)
print ""
except:
showMainMenu()
pass
Output problem example:
+------+------+---------+
| Date | Name | LinkOrFile |
+------+------+---------+
| 03-17-2016 | hi.com | Locky |
| 03-18-2016 | thisisitsqq.com | None |
| 03-19-2016 | http://ohiyoungbuyff.com\69.exe?1 | None |
| 03-20-2016 | http://thisisitsqq..com\69.exe?1 | None |
| 03-21-2016 | %Temp%\zgHRNzy\69.exe | None |
| 03-22-2016 | | None |
| 03-23-2016 | E52219D0DA33FDD856B2433D79D71AD6 | Downloader |
| 03-24-2016 | microsoft.com | None |
| 03-25-2016 | 89.248.166.132 | None |
| 03-26-2016 | http://89.248.166.131/55KB5js9dwPtx4= | None |

If your main problem is making column widths consistent across all the lines, this python package could do the job: https://pypi.python.org/pypi/tabulate
Below you find a very simple example of a possible formatting approach.
The key point is to find the largest length of each column and then use format method of the string object:
#!/usr/bin/python
import random
import string
from operator import itemgetter
def randomString(minLen = 1, maxLen = 10):
""" Random string of length between 1 and 10 """
l = random.randint(minLen, maxLen)
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(l))
COLUMNS = 4
def randomTable():
table = []
for i in range(10):
table.append( [randomString() for j in range(COLUMNS)] )
return table
def findMaxColumnLengs(table):
""" Returns tuple of max column lengs """
maxLens = [0] * COLUMNS
for l in table:
lens = [len(s) for s in l]
maxLens = [max(maxLens[e[0]], e[1]) for e in enumerate(lens)]
return maxLens
if __name__ == '__main__':
ll = randomTable()
ml = findMaxColumnLengs(ll)
# tuple of formatting statements, see format docs
formatStrings = ["{:<%s}" % str(m) for m in ml ]
fmtStr = "|".join(formatStrings)
print "=================================="
for l in ll:
print l
print "=================================="
for l in ll:
print fmtStr.format(*l)
This prints the initial table packed in the list of lists and the formatted output.
==================================
['2U7Q', 'DZK8Z5XT', '7ZI0W', 'A9SH3V3U']
['P7SOY3RSZ1', 'X', 'Z2W', 'KF6']
['NO8IEY9A', '4FVGQHG', 'UGMJ', 'TT02X']
['9S43YM', 'JCUT0', 'W', 'KB']
['P43T', 'QG', '0VT9OZ0W', 'PF91F']
['2TEQG0H6A6', 'A4A', '4NZERXV', '6KMV22WVP0']
['JXOT', 'AK7', 'FNKUEL', 'P59DKB8']
['BTHJ', 'XVLZZ1Q3H', 'NQM16', 'IZBAF']
['G0EF21S', 'A0G', '8K9', 'RGOJJYH2P9']
['IJ', 'SRKL8TXXI', 'R', 'PSUZRR4LR']
==================================
2U7Q |DZK8Z5XT |7ZI0W |A9SH3V3U
P7SOY3RSZ1|X |Z2W |KF6
NO8IEY9A |4FVGQHG |UGMJ |TT02X
9S43YM |JCUT0 |W |KB
P43T |QG |0VT9OZ0W|PF91F
2TEQG0H6A6|A4A |4NZERXV |6KMV22WVP0
JXOT |AK7 |FNKUEL |P59DKB8
BTHJ |XVLZZ1Q3H|NQM16 |IZBAF
G0EF21S |A0G |8K9 |RGOJJYH2P9
IJ |SRKL8TXXI|R |PSUZRR4LR

The code that you used is for MySQL. The critical part is the line widths.append(max(cd[2], len(cd[0]))) where cd[2] gives the length of the longest data in that column. This works for MySQLdb.
However, you are using sqlite3, for which the value cd[2] is set to None:
https://docs.python.org/2/library/sqlite3.html#sqlite3.Cursor.description
Thus, you will need to replace the following logic:
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
with your own. The rest of the code should be fine as long as widths is computed correctly.
The easiest way to get the widths variable correctly, would be to traverse through each row of the result and find out the max width of each column, then append it to widths. This is just some pseudo code:
for cd in c.description:
columns.append(cd[0]) # Get column headers
widths = [0] * len(c.description) # Initialize to number of columns.
for row in x:
for i in range(len(row)): # This assumes that row is an iterable, like list
v = row[i] # Take value of ith column
widths[i] = max(len(v), widths[i]) # Compare length of current value with value already stored
At the end of this, widths should contain the maximum length of each column.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

ETL table selection by Variable - python

Related

How to group column values into 'others'?

Check if value in a column exists in URL using lamda function

Convert a value using a value from a different row with petl?

Automatically multiprocessing a 'function apply' on a dataframe column

Python Output Length

Categories

Resources