Automatically multiprocessing a 'function apply' on a dataframe column - python

I have a simple dataframe with two columns.
+---------+-------+ | subject | score |
+---------+-------+ | wow | 0 |
+---------+-------+ | cool | 0 |
+---------+-------+ | hey | 0 |
+---------+-------+ | there | 0 |
+---------+-------+ | come on | 0 |
+---------+-------+ | welcome | 0 |
+---------+-------+
For every record in 'subject' column, I am calling a function and updating the results in column 'score' :
df['score'] = df['subject'].apply(find_score)
Here find_score is a function, which processes strings and returns a score :
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
# Instantiates a client
language_client = language.Client()
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
This works fine as expected but its quite slow as it processes the record one by one.
Is there a way, this can be parallelised ? without manually splitting the dataframe into smaller chunks ? Is there any library which does that automatically ?
Cheers

The instantiation of language.Client every time you call the find_score function is likely a major bottleneck. You don't need to create a new client instance for every use of the function, so try creating it outside the function, before you call it:
# Instantiates a client
language_client = language.Client()
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
df['score'] = df['subject'].apply(find_score)
If you insist, you can use multiprocessing like this:
from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()

Related

Importing csv formatted for excel into a dataframe

I am receiving datafiles from 2 different people and the files are coming through with different formats despite both users using the same system and the same browser.
I would like to be able to make my code smart enough to read either format but so far I have been unsuccessful.
The data coming through I am having issues with looks like this
+----------------+---------------+--------------+
| Customer Name | Customer code | File Ref |
+----------------+---------------+--------------+
| ACCOUNT SET UP | ="35" | R2I0025715 |
+----------------+---------------+--------------+
| Xenox | ="4298" | ="913500999" |
+----------------+---------------+--------------+
and the data that is importing cleanly looks like this
+----------------+---------------+------------+
| Customer Name | Customer code | File Ref |
+----------------+---------------+------------+
| ACCOUNT SET UP | 35 | R2I0025715 |
+----------------+---------------+------------+
| Xenox | 4298 | 913500999 |
+----------------+---------------+------------+
I am trying to import the data with the following code pd.read_csv(f, encoding='utf-8', dtype={"Customer Name": "string", "Customer code": "string", "File Ref": "string"})
A workaround that I am using is opening each csv in excel, and saving. But when this involves hundreds of files, it isn't really a workaround.
Can anyone help?
You could use the standard strip() function to remove leading and trailing = and " characters on all of your columns.
For example:
import pandas as pd
data = {
'Customer Name' : ['ACCOUNT SET UP', 'Xenox', 'ACCOUNT SET UP', 'Xenox'],
'Customer Code': ['="35"', '="4298"', '35', '4298'],
'File Ref': ['R2I0025715', '="913500999"', 'R2I0025715', '913500999']
}
df = pd.DataFrame(data)
for col in df.columns:
df[col] = df[col].str.strip('="')
print(df)
Giving you:
Customer Name Customer Code File Ref
0 ACCOUNT SET UP 35 R2I0025715
1 Xenox 4298 913500999
2 ACCOUNT SET UP 35 R2I0025715
3 Xenox 4298 913500999
If you just want to apply it to specific columns, use:
for col in ['Customer Code', 'File Ref']:
df[col] = df[col].str.strip('="')
My Solution:
import re
import pandas as pd
def removechar(x):
x = str(x)
out = re.sub('="', '', x)
return(out)
def removechar2(x):
x = str(x)
out = re.sub('"', '', x)
out = int(out) #could use float(), depends on what you want
return(out)
#then use applymap from pandas
Example:
datas = {'feature1': ['="23"', '="24"', '="23"', '="83"'], 'feature2': ['="23"', '="2"', '="3"', '="23"']}
test = pd.DataFrame(datas) # Example dataframe
test
Out[1]:
feature1 feature2
0 ="23" ="23"
1 ="24" ="2"
2 ="23" ="3"
3 ="83" ="23"
#applymap my functions
test = test.applymap(removechar)
test = test.applymap(removechar2)
test
Out[2]:
feature1 feature2
0 23 23
1 24 2
2 23 3
3 83 23
#fixed
Note you could probably do it just one line of applymap and one function running re.sub, try googling and reading the documentation for re.sub, this was something quick I whipped up.

ETL table selection by Variable

I'm trying to select rows within a table and create a new table with the information from the original table using PETL.
My code right now is:
import petl as etl
table_all = (
etl.fromcsv("practice_locations.csv")
.convert('Practice_Name', 'upper')
.convert('Suburb', str)
.convert('State', str)
.convert('Postcode', int)
.convert('Lat', str)
.convert('Long', str)
)
def selection(post_code):
table_selected = etl.select(table_all, "{Postcode} == 'post_code'")
print(post_code)
etl.tojson(table_selected, 'location.json', sort_keys=True)
But I cannot seem to populate table_selected by using the selection function as it is. The etl.select call will work if I replace post_code to look like
table_selected = etl.select(table_all, "{Postcode} == 4510")
Which outputs the correct table shown as:
+--------------------------------+--------------+-------+----------+--------------+--------------+
| Practice_Name | Suburb | State | Postcode | Lat | Long |
+================================+==============+=======+==========+==============+==============+
| 'CABOOLTURE COMBINED PRACTICE' | 'Caboolture' | 'QLD' | 4510 | '-27.085007' | '152.951707' |
+--------------------------------+--------------+-------+----------+--------------+--------------+
I'm sure I am just trying to call post_code in a way that is wrong but have tried everything from the PETL documentation and can't seem to figure it out.
Any help is much appreciated.
"{Postcode} == 'post_code'" will not replace post_code with the value passed to your selection function.
You need to format your select string (and escape {Postcode} when using format)
table_selected = etl.select(table_all, "{{Postcode}} == {post_code}".format(post_code=post_code))
Testing this in console
>>> "{{Postcode}} == {post_code}".format(post_code=1234)
'{Postcode} == 1234'

Applying a udf function in a distributed fashion in PySpark

Say I have a very basic Spark DataFrame that consists of a couple of columns, one of which contains a value that I want to modify.
|| value || lang ||
| 3 | en |
| 4 | ua |
Say, I want to have a new column per specific class where I would add a float number to the given value (this is not much relevant to the final question though, in reality I do a prediction with sklearn there, but for simplicity let's assume we are adding stuff, the idea is I am modifying the value in some way). So given a dict classes={'1':2.0, '2':3.0} I would like to have a column for each class where I add the value from DF to the value of the class and then save it to a csv:
class_1.csv
|| value || lang || my_class | modified ||
| 3 | en | 1 | 5.0 | # this is 3+2.0
| 4 | ua | 1 | 6.0 | # this is 4+2.0
class_2.csv
|| value || lang || my_class | modified ||
| 3 | en | 2 | 6.0 | # this is 3+3.0
| 4 | ua | 2 | 7.0 | # this is 4+3.0
So far I have the following code that works and modifies the value for each defined class, but it is done with a for loop and I am looking for a more advanced optimization for it:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
from pyspark.sql.functions import lit
# create session and context
spark = pyspark.sql.SparkSession.builder.master("yarn").appName("SomeApp").getOrCreate()
conf = SparkConf().setAppName('Some_App').setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
my_df = spark.read.csv("some_file.csv")
# modify the value here
def do_stuff_to_column(value, separate_class):
# do stuff to column, let's pretend we just add a specific value per specific class that is read from a dictionary
class_dict = {'1':2.0, '2':3.0} # would be loaded from somewhere
return float(value+class_dict[separate_class])
# iterate over each given class later
class_dict = {'1':2.0, '2':3.0} # in reality have more than 10 classes
# create a udf function
udf_modify = udf(do_stuff_to_column, FloatType())
# loop over each class
for my_class in class_dict:
# create the column first with lit
my_df2 = my_df.withColumn("my_class", lit(my_class))
# modify using udf function
my_df2 = my_df2.withColumn("modified", udf_modify("value","my_class"))
# write to csv now
my_df2.write.format("csv").save("class_"+my_class+".csv")
So the question is, is there a better/faster way of doing this then in a for loop?
I would use some form of join, in this case crossJoin. Here's a MWE:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(3, 'en'), (4, 'ua')], ['value', 'lang'])
classes = spark.createDataFrame([(1, 2.), (2, 3.)], ['class_key', 'class_value'])
res = df.crossJoin(classes).withColumn('modified', F.col('value') + F.col('class_value'))
res.show()
For saving as separate CSV's I think there is no better way than to use a loop.

Making Table without using Texttable

I am writing Python code to show items in a store .... as I am still learning I want to know how to make a table which looks exactly like a table made by using Texttable ....
My code is
Goods = ['Book','Gold']
Itemid= [711001,711002]
Price= [200,50000]
Count= [100,2]
Category= ['Books','Jewelry']
titles = ['', 'Item Id', 'Price', 'Count','Category']
data = [titles] + list(zip(Goods, Itemid, Price, Count, Category))
for i, d in enumerate(data):
line = '|'.join(str(x).ljust(12) for x in d)
print(line)
if i == 0:
print('=' * len(line))
My Output:
|Item Id |Price |Count |Category
================================================================
Book |711001 |200 |100 |Books
Gold |711002 |50000 |2 |Jewelry
Output I want:
+------+---------+-------+-------+-----------+
| | Item Id | Price | Count | Category |
+======+=========+=======+=======+===========+
| Book | 711001 | 200 | 100 | Books |
+------+---------+-------+-------+-----------+
| Gold | 711002 | 50000 | 2 | Jewelry |
+------+---------+-------+-------+-----------+
You code is building your output by hand, using string.join(). You can do it that way but it is very tedious. Use string formatting instead.
To help you along here is one line:
content_format = "| {Goods:4.4s} | {ItemId:<7d} | {Price:<5d} | {Count:<5d} | {Category:9s} |"
output_line = content_format.format(Goods="Book",ItemId=711001,Price=200,Count=100,Category="Books")
Texttable adjusts its cell widths to fit the data. If you want to do the same, then you will have to put computed field widths in content_format instead of using numeric literals the way I have done in the example above. Again, here is one example to get you going:
content_format = "| {Goods:4.4s} | {ItemId:<7d} | {Price:<5d} | {Count:<5d} | {Category:{CategoryWidth}s} |"
output_line = content_format.format(Goods="Book",ItemId=711001,Price=200,Count=100,Category="Books",CategoryWidth=9)
But if you already know how to do this using Texttable, why not use that? Your comment says it's not available in Python: not true, I just downloaded version 0.9.0 using pip.

Python Output Length

I'm attempting to output my database table data, which works aside from long table rows. The columns need to be as large as the longest database row. I'm having trouble implementing a calculation to correctly output the table proportionally instead of a huge mess when long rows are outputted (without using a third party library e.g. Print results in MySQL format with Python). Please let me know if you need more information.
Database connection:
connection = sqlite3.connect("test_.db")
c = connection.cursor()
c.execute("SELECT * FROM MyTable")
results = c.fetchall()
formatResults(results)
Table formatting:
def formatResults(x):
try:
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in x:
print(tavnit % row)
print(separator)
print ""
except:
showMainMenu()
pass
Output problem example:
+------+------+---------+
| Date | Name | LinkOrFile |
+------+------+---------+
| 03-17-2016 | hi.com | Locky |
| 03-18-2016 | thisisitsqq.com | None |
| 03-19-2016 | http://ohiyoungbuyff.com\69.exe?1 | None |
| 03-20-2016 | http://thisisitsqq..com\69.exe?1 | None |
| 03-21-2016 | %Temp%\zgHRNzy\69.exe | None |
| 03-22-2016 | | None |
| 03-23-2016 | E52219D0DA33FDD856B2433D79D71AD6 | Downloader |
| 03-24-2016 | microsoft.com | None |
| 03-25-2016 | 89.248.166.132 | None |
| 03-26-2016 | http://89.248.166.131/55KB5js9dwPtx4= | None |
If your main problem is making column widths consistent across all the lines, this python package could do the job: https://pypi.python.org/pypi/tabulate
Below you find a very simple example of a possible formatting approach.
The key point is to find the largest length of each column and then use format method of the string object:
#!/usr/bin/python
import random
import string
from operator import itemgetter
def randomString(minLen = 1, maxLen = 10):
""" Random string of length between 1 and 10 """
l = random.randint(minLen, maxLen)
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(l))
COLUMNS = 4
def randomTable():
table = []
for i in range(10):
table.append( [randomString() for j in range(COLUMNS)] )
return table
def findMaxColumnLengs(table):
""" Returns tuple of max column lengs """
maxLens = [0] * COLUMNS
for l in table:
lens = [len(s) for s in l]
maxLens = [max(maxLens[e[0]], e[1]) for e in enumerate(lens)]
return maxLens
if __name__ == '__main__':
ll = randomTable()
ml = findMaxColumnLengs(ll)
# tuple of formatting statements, see format docs
formatStrings = ["{:<%s}" % str(m) for m in ml ]
fmtStr = "|".join(formatStrings)
print "=================================="
for l in ll:
print l
print "=================================="
for l in ll:
print fmtStr.format(*l)
This prints the initial table packed in the list of lists and the formatted output.
==================================
['2U7Q', 'DZK8Z5XT', '7ZI0W', 'A9SH3V3U']
['P7SOY3RSZ1', 'X', 'Z2W', 'KF6']
['NO8IEY9A', '4FVGQHG', 'UGMJ', 'TT02X']
['9S43YM', 'JCUT0', 'W', 'KB']
['P43T', 'QG', '0VT9OZ0W', 'PF91F']
['2TEQG0H6A6', 'A4A', '4NZERXV', '6KMV22WVP0']
['JXOT', 'AK7', 'FNKUEL', 'P59DKB8']
['BTHJ', 'XVLZZ1Q3H', 'NQM16', 'IZBAF']
['G0EF21S', 'A0G', '8K9', 'RGOJJYH2P9']
['IJ', 'SRKL8TXXI', 'R', 'PSUZRR4LR']
==================================
2U7Q |DZK8Z5XT |7ZI0W |A9SH3V3U
P7SOY3RSZ1|X |Z2W |KF6
NO8IEY9A |4FVGQHG |UGMJ |TT02X
9S43YM |JCUT0 |W |KB
P43T |QG |0VT9OZ0W|PF91F
2TEQG0H6A6|A4A |4NZERXV |6KMV22WVP0
JXOT |AK7 |FNKUEL |P59DKB8
BTHJ |XVLZZ1Q3H|NQM16 |IZBAF
G0EF21S |A0G |8K9 |RGOJJYH2P9
IJ |SRKL8TXXI|R |PSUZRR4LR
The code that you used is for MySQL. The critical part is the line widths.append(max(cd[2], len(cd[0]))) where cd[2] gives the length of the longest data in that column. This works for MySQLdb.
However, you are using sqlite3, for which the value cd[2] is set to None:
https://docs.python.org/2/library/sqlite3.html#sqlite3.Cursor.description
Thus, you will need to replace the following logic:
for cd in c.description:
widths.append(max(cd[2], len(cd[0])))
columns.append(cd[0])
with your own. The rest of the code should be fine as long as widths is computed correctly.
The easiest way to get the widths variable correctly, would be to traverse through each row of the result and find out the max width of each column, then append it to widths. This is just some pseudo code:
for cd in c.description:
columns.append(cd[0]) # Get column headers
widths = [0] * len(c.description) # Initialize to number of columns.
for row in x:
for i in range(len(row)): # This assumes that row is an iterable, like list
v = row[i] # Take value of ith column
widths[i] = max(len(v), widths[i]) # Compare length of current value with value already stored
At the end of this, widths should contain the maximum length of each column.

Categories