Prevent reading data multiple times using Dask

Prevent reading data multiple times using Dask - python

What can i do to prevent same file being read more then twice?
For the background, i have below detail
Im trying to read list of file in a folder, transform it, output it into a file, and check the gap before and after transformation
first for the reading part
def load_file(file):
df = pd.read_excel(file)
return df
file_list = glob2.glob("folder path here")
future_list = [delayed(load_file)(file) for file in file_list]
read_result_dd = dd.from_delayed(future_list)
After that , i will do some transformation to the data:
def transform(df):
# do something to df
return df
transformation_result = read_result_dd.map_partitions(lambda df: transform(df))
i would like to achieve 2 things:
first to get the transformation output:
Outputfile = transformation_result.compute()
Outputfile.to_csv("path and param here")
second to get the comparation
read_result_comp = read_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()
transformation_result_comp = transformation_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()
Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()
Checker.to_csv("path and param here")
The problem is if i call Outputfile and Checker in sequence, i.e.:
Outputfile = transformation_result.compute()
Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()
Outputfile.to_csv("path and param here")
Checker.to_csv("path and param here")
it will read the entire file twice (for each of the compute)
Is there any way to have the read result done only once?
Also are there any way to have both compute() to run in a sequence? (if i run it in two lines, from the dask dashboard i could see that it will run the first, clear the dasboard, and run the second one instead of running both in single sequence)
I cannot run .compute() for the result file because my ram can't contain it, the resulting dataframe is too big. both the checker and the output file is significantly smaller compared to the original data.
Thanks

You can call the dask.compute function on multiple Dask collections
a, b = dask.compute(a, b)
https://docs.dask.org/en/latest/api.html#dask.compute
In the future, I recommend producing an MCVE

Related

Pandas dataframe not returning the index using the loc method

I'm trying to retrieve the index of a row within a dataframe using the loc method and a comparison of data from another dataframe within a for loop. Maybe I'm going about this wrong, I dunno. Here's a bit of information to help give the problem some context...
The following function imports some inventory data into a pandas dataframe from an xlsx file; this seemingly works just fine:
def import_inventory():
import warnings
try:
with warnings.catch_warnings(record=True):
warnings.simplefilter("always")
return pandas.read_excel(config_data["inventory_file"],header=1)
except Exception as E:
writelog.error(E)
sys.exit(E)
The following function imports some data from a combination of CSV files, creating a singular dataframe to work from during comparison; this seemingly works just fine:
def get_report_results():
output_dir = f"{config_data['output_path']}/reports"
report_ids = []
......
...execute and download the report csv files
......
reports_content = []
for path,current_directory,files in os.walk(output_dir):
for file in files:
file_path = os.path.join(path,file)
clean_csv_data(file_path) # This function simply cleans up the CSV content (removes blank rows, removes unnecessary footer data); updates same file that was sent in upon successful completion
current_file_content = pandas.read_csv(file_path,index_col=None,header=7)
reports_content.append(current_file_content)
reports_content = pandas.concat(reports_content,axis=0,ignore_index=True)
return reports_content
The problems exist here, at the following function that is supposed to search the reports content for the existence of an ID value then grab that row's index so I can use it in the future to modify some columns, add some columns.
def search_reports(inventory_df,reports_df):
for index,row in inventory_df.iterrows():
reports_index = reports_df.loc[reports_df["Inventory ID"] == row["Inv ID"]].index[0]
print(reports_df.iloc[reports_index]["Lookup ID"])
Here's the error I receive upon comparison
Length of values (1) does not match length of index (4729)
I can't quite figure out why this is happening. If I pull everything out of functions the work seems to happen the way it should. Any ideas?
There's a bit more work happening to the dataframe that comes from import_inventory, but didn't want to clutter the question. It's nothing major - one function adds a few columns that splits out a comma-separated value in the inventory into its own columns, another adds a column based on the contents of another column.
Edit:
As requested, the full stack trace is below. I've also included the other functions that operate on the original inventory_df object between its retreival (import_inventory) and its final comparison (search_reports).
This function again operates on the inventory_df function, only this time it retrieves a single column from each row (if it has data) and breaks the semicolon-separated list of key-value pair tags apart for further inspection. If it finds one, it creates the necessary column for it and populates that row with the found value.
def sort_tags(inventory_df):
cluster_key = "Cluster:"
nodetype_key = "NodeType:"
project_key = "project:"
tags = inventory_df["Tags List"]
for index,tag in inventory_df.items():
if not pandas.isna(tag):
tag_keysvalues = tag.split(";")
if any(cluster_key in string for string in tag_keysvalues):
pair = [x for x in tag_keysvalues if x.startswith(cluster_key)]
key_value_split = pair[0].split(":")
inventory_df.loc[index, "Cluster Name"] = key_value_split[1]
if any(nodetype_key in string for string in tag_keysvalues):
pair = [x for x in tag_keysvalues if x.startswith(nodetype_key)]
key_value_split = pair[0].split(":")
inventory_df.loc[index, "Node Type"] = key_value_split[1]
if any(project_key in string for string in tag_keysvalues):
pair = [x for x in tag_keysvalues if x.startswith(project_key)]
key_value_split = pair[0].split(":")
inventory_df.loc[index, "Project Name"] = key_value_split[1]
return inventory_df
This function compares the new inventory DF with a CSV import-to-DF of the old inventory. It creates new columns based on old inventory data if it finds a match. I know this is ugly code, but I'm hoping to replace it when I can find a solution to my current problem.
def compare_inventories(old_inventory_df,inventory_df):
aws_rowcount = len(inventory_df)
now = parser.parse(datetime.utcnow().isoformat()).replace(tzinfo=timezone.utc).astimezone(tz=None)
for a_index,a_row in inventory_df.iterrows():
if a_row["Comments"] != "none":
for o_index,o_row in old_inventory_df.iterrows():
last_checkin = parser.parse(str(o_row["last_checkin"])).replace(tzinfo=timezone.utc).astimezone(tz=None)
if (a_row["Comments"] == o_row["asset_name"]) and ((now - timedelta(days=30)) <= last_checkin):
inventory_df.loc[a_index,["Found in OldInv","OldInv Address","OldInv Asset ID","Inv ID"]] = ["true",o_row["address"],o_row["asset_id"],o_row["host_id"]]
return inventory_df
Here's the stack trace for the error:
Traceback (most recent call last):
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\main.py", line 52, in main
reports_index = reports_df.loc[reports_df["Inventory ID"] == row["Inv ID"]].index
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\ops\common.py", line 70, in new_method
return method(self, other)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\arraylike.py", line 40, in __eq__
return self._cmp_method(other, operator.eq)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\series.py", line 5625, in _cmp_method
return self._construct_result(res_values, name=res_name)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\series.py", line 3017, in _construct_result
out = self._constructor(result, index=self.index)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\series.py", line 442, in __init__
com.require_length_match(data, index)
File "c:\Users\beefcake-quad\Code\INVENTORYAssetSnapshot\.venv\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (1) does not match length of index (7150)

reports_index = reports_df.loc[report_data["Inventory ID"] == row["Inv ID"].index[0]
missing ] at end

How I can find out how many vectors are processed in Python?

I have a very long dictionary containing more than 1 million keys.
df['file'] = files
df['features'] = df.apply(lambda row: getVector(model, row['file']), axis=1)
Below is getVector function:
vector=model.predict(inp.reshape(1, 100, 100, 1))
print(file+ " is added.")
return vector
But, it shows blahblah.jpg is added but it has no use as I do not know how many files have been processed.
My question is that how I get the count of files that has been processed?
For example,
1200 out of 1,000,000 is processed.
or even just
1200
I do not want to have a neat and clean output. I just want to know how many files have been processed.
Any idea would be appreciated.

You can use a decorator to count it:
import functools
def count(func):
#functools.wraps(func)
def wrappercount(*args):
func(*args)
func.counter+=1
print(func.counter,"of 1,000,000 is processed.")
func.counter=0
return wrappercount
#count
def getVector(*args):
#write code and add arguments
Learn more about decorators here.
If you're having a class of which you want to count a method's usage, you can use the following:
class getVector:
counter=0
def __init__(self):#add your arguments here
#write code here
getVector.counter +=1
print(getVector.counter,"of 1,000,000 is processed.")

How about creating a column in the initial dataframe with incremental numbers? You would use this column as an input of your getVector function and display the value.
Something like this:
df['features'].apply(lambda row: getVector(model, row['file'], row['count']), axis=1)

For-loop over multiple files in same directory in Python

So I already tried to check other questions here about (almost) the same topic, however I did not find something that solves my problem.
Basically, I have a piece of code in Python that tries to open the file as a data frame and execute some eye tracking functions (PyGaze). I have 1000 files that I need to analyse and wanted to create a for-loop to execute my code on all the files automatically.
The code is the following:
os.chdir("/Users/Documents//Analyse/Eye movements/Python - Eye Analyse")
directory = '/Users/Documents/Analyse/Eye movements/R - Filtering Data/Filtered_data/Filtered_data_test'
for files in glob.glob(os.path.join(directory,"*.csv")):
#Downloas csv, plot
df = pd.read_csv(files, parse_dates = True)
#Plot raw data
plt.plot(df['eye_x'],df['eye_y'], 'ro', c="red")
plt.ylim([0,1080])
plt.xlim([0,1920])
#Fixation analysis
from detectors import fixation_detection
fixations_data = fixation_detection(df['eye_x'],df['eye_y'], df['time'],maxdist=25, mindur=100)
Efix_data = fixations_data[1]
numb_fixations = len(Efix_data) #number of fixations
fixation_start = [i[0] for i in Efix_data]
fixation_stop = [i[1] for i in Efix_data]
fixation = {'start' : fixation_start, 'stop': fixation_stop}
fixation_frame = pd.DataFrame(data=fixation)
fixation_frame['difference'] = fixation_frame['stop'] - fixation_frame['start']
mean_fixation_time = fixation_frame['difference'].mean() #mean fixation time
final = {'number_fixations' : [numb_fixations], 'mean_fixation_time': [mean_fixation_time]}
final_frame = pd.DataFrame(data=final)
#write everything in one document
final_frame.to_csv("/Users/Documents/Analyse/Eye movements/final_data.csv")
The code is running (no errors), however : it only runs for the first file. The code is not ran for the other files present in the folder/directory.
I do not see where my mistake is?

Your output file name is constant, so it gets overwritten with each iteration of the for loop. Try the following instead of your final line, which opens the file in "append" mode instead:
#write everything in one document
with open("/Users/Documents/Analyse/Eye movements/final_data.csv", "a") as f:
final_frame.to_csv(f, header=False)

Having an issue with using median function in numpy

I am having an issue with using the median function in numpy. The code used to work on a previous computer but when I tried to run it on my new machine, I got the error "cannot perform reduce with flexible type". In order to try to fix this, I attempted to use the map() function to make sure my list was a floating point and got this error message: could not convert string to float: .
Do some more attempts at debugging, it seems that my issue is with my splitting of the lines in my input file. The lines are of the form: 2456893.248202,4.490 and I want to split on the ",". However, when I print out the list for the second column of that line, I get
4
.
4
9
0
so it seems to somehow be splitting each character or something though I'm not sure how. The relevant section of code is below, I appreciate any thoughts or ideas and thanks in advance.
def curve_split(fn):
with open(fn) as f:
for line in f:
line = line.strip()
time,lc = line.split(",")
#debugging stuff
g=open('test.txt','w')
l1=map(lambda x:x+'\n',lc)
g.writelines(l1)
g.close()
#end debugging stuff
return time,lc
if __name__ == '__main__':
# place where I keep the lightcurve files from the image subtraction
dirname = '/home/kuehn/m4/kepler/subtraction/detrending'
files = glob.glob(dirname + '/*lc')
print(len(files))
# in order to create our lightcurve array, we need to know
# the length of one of our lightcurve files
lc0 = curve_split(files[0])
lcarr = np.zeros([len(files),len(lc0)])
# loop through every file
for i,fn in enumerate(files):
time,lc = curve_split(fn)
lc = map(float, lc)
# debugging
print(fn[5:58])
print(lc)
print(time)
# end debugging
lcm = lc/np.median(float(lc))
#lcm = ((lc[qual0]-np.median(lc[qual0]))/
# np.median(lc[qual0]))
lcarr[i] = lcm
print(fn,i,len(files))

Python: `paste' multiple (unknown) csvs together

What I am essentially looking for is the `paste' command in bash, but in Python2. Suppose I have a csv file:
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
And another such:
e1,f1
e2,f2
e3,f3
I want to pull them together into this:
a1,b1,c1,d1,e1,f1
a2,b2,c2,d2,e2,f2
a3,b3,c3,d3,e3,f3
This is the simplest case where I have a known number and only two. What if I wanted to do this with an arbitrary number of files without knowing how many I have.
I am thinking along the lines of using zip with a list of csv.reader iterables. There will be some unpacking involved but seems like this much python-foo is above my IQ level ATM. Can someone suggest how to implement this idea or something completely different?
I suspect this should be doable with a short snippet. Thanks.

file1 = open("file1.csv", "r")
file2 = open("file2.csv", "r")
for line in file1:
print(line.strip().strip(",") +","+ file2.readline().strip()+"\n")
Extendable for as many files as you wish. Just keep adding to the print statement. Instead of print you can also have a append to a list or whatever you wish. You may have to worry about length of files, I did not as you did not specify.

Assuming the number of files is unknown, and that all the files are properly formatted as csv have the same number of lines:
files = ['csv1', 'csv2', 'csv3']
fs = map(open, files)
done = False
while not done:
chunks = []
for f in fs:
try:
l = next(f).strip()
chunks.append(l)
except StopIteration:
done = True
break
if not done:
print ','.join(chunks)
for f in fs:
f.close()
There seems to be no easy way of using context managers with a variable list of files easily, at least in Python 2 (see a comment in the accepted answer here), so manual closing of files will be required as above.

You could try pandas
In your case, group of [a,b,c,d] and [e,f] could be treated as DataFrame in Pandas, and it's easy to do join because Pandas has function called concat.
import pandas as pd
# define group [a-d] as df1
df1 = pd.read_csv('1.csv')
# define group [e-f] as df2
df2 = pd.read_csv('2.csv')
pd.concat(df1,df2,axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Prevent reading data multiple times using Dask - python

You can call the dask.compute function on multiple Dask collections a, b = dask.compute(a, b) https://docs.dask.org/en/latest/api.html#dask.compute In the future, I recommend producing an MCVE

Related

Pandas dataframe not returning the index using the loc method

How I can find out how many vectors are processed in Python?

For-loop over multiple files in same directory in Python

Having an issue with using median function in numpy

Python: `paste' multiple (unknown) csvs together

Categories

Resources