Python pandas parallel processing using jugs TaskGenerator

Python pandas parallel processing using jugs TaskGenerator - python

I am trying to read a pandas dataframe and perform certain operations and return the dataframe. I also want to multiprocess the operation to take advantage of multiple cores that my system has.
import pandas as pd
import re
from jug import TaskGenerator
#TaskGenerator
def find_replace(input_path_find):
start_time = time.clock()
df_find = pd.read_csv(input_path_find)
df_find.currentTitle=df_find.currentTitle.str.replace(r"[^a-zA-Z0-9`~!|##%&_}={:\"\];<>,./. -]",r'')
#extra space
df_find.currentTitle=df_find.currentTitle.str.replace('\s+', ' ')
#length
df_find['currentTitle_sort'] = df_find.currentTitle.str.len()
#sort
df_find = df_find.sort_values(by='currentTitle_sort',ascending=0)
#reindx
df_find.reset_index(drop=True,inplace=True)
del df_find['currentTitle_sort']
return df_find
When i pass the parameter which is CSV file i want to process
df_returned = find_replace('C:\\Users\\Dell\\Downloads\\Find_Replace_in_this_v1.csv')
I am getting some weird output
find_replace
Task(__main__.find_replace, args=('C:\\Users\\Dell\\Downloads\\Find_Replace_in_this_v1.csv',), kwargs={})
In [ ]:
Any help? I basically want to save the output from the function
I have already checked the answer and it didn't work. Also, i am using pythono 2.7 and anaconda IDE Pandas memoization

This is a misunderstanding of how jug works.
The result you are getting is, indeed a Task object, which you can run: df_returned.run().
Usually, though, you would have saved this script to a file (say analysis.py) and called jug execute analysis.py to execute the tasks.

Related

Reading different set of json files same time with python

I have two sets of files b and c (JSON). The number of files in each is normally between 500-1000. Right now I am reading this seperately. Can I read these at the same time using multi-threading? I have enough memory and processors.
yc=no of c files
yb=no of b files
c_output_transaction_list =[]
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
print(c_json_file)
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
c_output_transaction_list.extend(c_transaction_list)
df_res_c= pd.DataFrame(c_output_transaction_list)
b_output_transaction_list =[]
for num in range(yb):
b_json_file='./output/d_b_'+str(num)+'.json'
print(b_json_file)
b_transaction_list = json.load(open(b_json_file))['data']['transaction_list']
b_output_transaction_list.extend(b_transaction_list)
df_res_b= pd.DataFrame(b_output_transaction_list)

I use this method to read hundreds of files in parallel into a final dataframe. Without having your data, you'll have to verify this does what you want. Reading the multiprocess help docs will assist. I use the same code on linux (aws ec2 reading s3 files) and windows reading the same s3 files. I find a big time savings do this.
import os
import pandas as pd
from multiprocessing import Pool
# you set the number of processors or just take the cpu_count from the os object. playing around with this does make a difference. For me using the max isn't always the fast overall time
num_proc = os.cpu_count()
# define the funciton that creates a dataframe from your file
# note, this is different where you build the list the create a dataframe at the end
def json_parse(c_json_file):
c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
return pd.DataFrame(c_transaction_list)
# this is multiprocessing function that feeds the file names to the parsing function
# if you don't pass num_proc it defaults to 4
def json_multiprocess(fn_list, num_proc=4):
with Pool(num_proc) as pool:
# I use starmap, you may just be able use map
# if you pass more than the file name, starmap handles zip() very well
r = pool.starmap(json_parse, fn_list, 15)
pool.close()
pool.join()
return r
# build your file list first
yc=no of c files
flist = []
for num in range(yc):
c_json_file='./output/d_c_'+str(num)+'.json'
flist.append(c_json_file)
# get a list of of your intermediate dataframes
dfs = json_multiprocess(flist, num_proc=num_proc)
# concat your dataframe
df_res_c = pd.concat(dfs)
Then do the same for your next set of files...
Use the example in Aelarion's comment to help structure the file

Python and Dask - reading and concatenating multiple files

I have some parquet files, all coming from the same domain but with some differences in structure. I need to concatenate all of them. Below some example of these files:
file 1:
A,B
True,False
False,False
file 2:
A,C
True,False
False,True
True,True
What I am looking to do is to read and concatenate these files in the fastest way possible obtaining the following result:
A,B,C
True,False,NaN
False,False,NaN
True,NaN,False
False,NaN,True
True,NaN,True
To do that I am using the following code, extracted using (Reading multiple files with Dask, Dask dataframes: reading multiple files & storing filename in column):
import glob
import dask.dataframe as dd
from dask.distributed import Client
import dask
def read_parquet(path):
return pd.read_parquet(path)
if __name__=='__main__':
files = glob.glob('test/*/file.parquet')
print('Start dask client...')
client = Client()
results = [dd.from_delayed(dask.delayed(read_parquet)(diag)) for diag in diag_files]
results = dd.concat(results).compute()
client.close()
This code works, and it is already the fastest version I could come up with (I tried sequential pandas and multiprocessing.Pool). My idea was that Dask could ideally start part of the concatenation while still reading some of the files, however, from the task graph I see some sequential reading of the metadata of each parquet file, see the screenshot below:
The first part of the task graph is a mixture of read_parquet followed by read_metadata. The first part always shows only 1 task executed (in the task processing tab). The second part is a combination of from_delayed and concat and it is using all of my workers.
Any suggestion on how to speed up the file reading and reduce the execution time of the first part of the graph?

The problem with your code is that you use Pandas version of
read_parquet.
Instead use:
dask version of read_parquet,
map and gather methods offered by Client,
dask version of concat,
Something like:
def read_parquet(path):
return dd.read_parquet(path)
def myRead():
L = client.map(read_parquet, glob.glob('file_*.parquet'))
lst = client.gather(L)
return dd.concat(lst)
result = myRead().compute()
Before that I created a client, once only.
The reason was that during my earlier experiments I got an error
message when I attempted to create it again (in a function), even
though the first instance has been closed before.

Python: Do not remove the data when I stop the program. Loading a very big database only once

So, I have this database with thousands of rows and columns. At the start of the program I load the data and assign a variable to it:
data=np.loadtxt('database1.txt',delimiter=',')
Since this database contains many elements, it takes minutes to start the program. Is there a way in Python (similar to .mat files in matlab) which makes me only load the data once even when I stop the program then run it again? Currenly my time is wasted waiting for the program to load the data if I just change a small thing for testing.

Firstly, the Numpy package isn't good to read a large file, the Pandas package it's so strongly.
So just stop using np.loadtxt and start using pd.read_csv instead.
But, if you want to use it
I think that the np.fromfile() module is more efficient and faster than np.loadtxt().
So, my advice try:
data = np.fromfile('database1.txt', sep=',')
instead of:
data = np.loadtxt('database1.txt',delimiter=',')

You could pickle to cache your data.
import pickle
import os
import numpy as np
if os.path.isfile("cache.p"):
with open("cache.p","rb") as f:
data=pickle.load(f)
else:
data=data=np.loadtxt('database1.txt',delimiter=',')
with open("cache.p","wb") as f:
pickle.dump(data,f)
The first time it will be very slow, then in later executions it will be pretty fast.
just tested with a file containing 1 million rows and 20 columns of random floats, it took ~30s the first time, and ~0.4s the following times.

Nested For Loops With Calculations Vs. Linear Process

I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')

To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!

Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.

python - using functions from a different py script gives NameError

I have a py script, lets call it MergeData.py, where I merge two data files. Since I have a lot of pairs of data files that have to be merged I thought it would be good for readability reasons to put my code in MergeData.py into a function, say merge_data(), and call this function in a loop over all my pairs of data files in a different py script.
2 Questions:
Is it wise, in terms of speed, to call the function from a different file instead of runing the code directly in the loop? (I have thousands of pairs that have to be merged.)
I thought, to use the function in MergeData.py I have to include in the head of my script from MergedData import merge_data. Within the function merge_data I make use of pandas which I import in the main file by 'import pandas as pd'. When calling the function I get the error 'NameError: global name 'pd' is not defined'. I have tried all possible places to import the pandas modul, even within the function, but the error keeps popping up. What am I doing wrong?
In MergeData.py I have
def merge_data(myFile1,myFile2):
df1 = pd.read_csv(myFile1)
df2 = pd.read_csv(myFile2)
# ... my code
and in the other file I have
import pandas as pd
from MergeData import merge_data
# then some code to get my file names followed by
FileList = zip(FileList1,FileList2)
for myFile1,myFile2 in FileList:
# Run Merging Algorithm
dataEq = merge_data(myFile1,myFile2)
I am aware of What is the best way to call a Python script from another Python script?, but cannot really see if that relates to me.

You need to move the line
import pandas as pd
Into the module in which the symbol pd is actually needed, i.e. move it out of your "other file" and into your MergeData.py file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas parallel processing using jugs TaskGenerator - python

This is a misunderstanding of how jug works. The result you are getting is, indeed a Task object, which you can run: df_returned.run(). Usually, though, you would have saved this script to a file (say analysis.py) and called jug execute analysis.py to execute the tasks.

Related

Reading different set of json files same time with python

Python and Dask - reading and concatenating multiple files

Python: Do not remove the data when I stop the program. Loading a very big database only once

Nested For Loops With Calculations Vs. Linear Process

python - using functions from a different py script gives NameError

Categories

Resources