seek. function without output - python

tips_file = open('code_tips.txt', 'w+')
tips_file.write('-use simple function and variable names\n-comment code\n-organize code into functions\n')
tips_file.seek(0)
tips_text = tips_file.read()
print (tips_text)
There is no output for the above code.
But for the following codes, there is output
tips_file.seek(13)
tips_text = tips_file.read()
print(tips_text)
I am wondering why there is difference
I just figure out that if I separate the code into 2 cells
tips_file = open('code_tips.txt', 'w+')
tips_file.write('-use simple function and variable names\n-
commentcode\n-organize code into functions\n')
and
tips_file.seek(0)
tips_text = tips_file.read()
print(tips_text)
Then I can deliver output if I run the first cell once and the second cell twice. If only run the second cell for once, it gives no output as the same of the problem. I am really confused.

Related

Prevent reading data multiple times using Dask

What can i do to prevent same file being read more then twice?
For the background, i have below detail
Im trying to read list of file in a folder, transform it, output it into a file, and check the gap before and after transformation
first for the reading part
def load_file(file):
df = pd.read_excel(file)
return df
file_list = glob2.glob("folder path here")
future_list = [delayed(load_file)(file) for file in file_list]
read_result_dd = dd.from_delayed(future_list)
After that , i will do some transformation to the data:
def transform(df):
# do something to df
return df
transformation_result = read_result_dd.map_partitions(lambda df: transform(df))
i would like to achieve 2 things:
first to get the transformation output:
Outputfile = transformation_result.compute()
Outputfile.to_csv("path and param here")
second to get the comparation
read_result_comp = read_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()
transformation_result_comp = transformation_result_dd.groupby("groupby param here")["result param here"].sum().reset_index()
Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()
Checker.to_csv("path and param here")
The problem is if i call Outputfile and Checker in sequence, i.e.:
Outputfile = transformation_result.compute()
Checker = read_result_dd.merge(transformation_result, on=['header_list'], how='outer').compute()
Outputfile.to_csv("path and param here")
Checker.to_csv("path and param here")
it will read the entire file twice (for each of the compute)
Is there any way to have the read result done only once?
Also are there any way to have both compute() to run in a sequence? (if i run it in two lines, from the dask dashboard i could see that it will run the first, clear the dasboard, and run the second one instead of running both in single sequence)
I cannot run .compute() for the result file because my ram can't contain it, the resulting dataframe is too big. both the checker and the output file is significantly smaller compared to the original data.
Thanks
You can call the dask.compute function on multiple Dask collections
a, b = dask.compute(a, b)
https://docs.dask.org/en/latest/api.html#dask.compute
In the future, I recommend producing an MCVE

For-loop over multiple files in same directory in Python

So I already tried to check other questions here about (almost) the same topic, however I did not find something that solves my problem.
Basically, I have a piece of code in Python that tries to open the file as a data frame and execute some eye tracking functions (PyGaze). I have 1000 files that I need to analyse and wanted to create a for-loop to execute my code on all the files automatically.
The code is the following:
os.chdir("/Users/Documents//Analyse/Eye movements/Python - Eye Analyse")
directory = '/Users/Documents/Analyse/Eye movements/R - Filtering Data/Filtered_data/Filtered_data_test'
for files in glob.glob(os.path.join(directory,"*.csv")):
#Downloas csv, plot
df = pd.read_csv(files, parse_dates = True)
#Plot raw data
plt.plot(df['eye_x'],df['eye_y'], 'ro', c="red")
plt.ylim([0,1080])
plt.xlim([0,1920])
#Fixation analysis
from detectors import fixation_detection
fixations_data = fixation_detection(df['eye_x'],df['eye_y'], df['time'],maxdist=25, mindur=100)
Efix_data = fixations_data[1]
numb_fixations = len(Efix_data) #number of fixations
fixation_start = [i[0] for i in Efix_data]
fixation_stop = [i[1] for i in Efix_data]
fixation = {'start' : fixation_start, 'stop': fixation_stop}
fixation_frame = pd.DataFrame(data=fixation)
fixation_frame['difference'] = fixation_frame['stop'] - fixation_frame['start']
mean_fixation_time = fixation_frame['difference'].mean() #mean fixation time
final = {'number_fixations' : [numb_fixations], 'mean_fixation_time': [mean_fixation_time]}
final_frame = pd.DataFrame(data=final)
#write everything in one document
final_frame.to_csv("/Users/Documents/Analyse/Eye movements/final_data.csv")
The code is running (no errors), however : it only runs for the first file. The code is not ran for the other files present in the folder/directory.
I do not see where my mistake is?
Your output file name is constant, so it gets overwritten with each iteration of the for loop. Try the following instead of your final line, which opens the file in "append" mode instead:
#write everything in one document
with open("/Users/Documents/Analyse/Eye movements/final_data.csv", "a") as f:
final_frame.to_csv(f, header=False)

Search text file and save to specific variables

I have looked at the previous questions and answers to my question but can't seem to get this to work. I am very new to python so I apologize in advance for this basic question.
I am running a Monte Carlo simulation in a separate piece of software. It creates a text file output that is quite lengthy. I want to retrieve 3 values that are under one heading. I have created the following code that isolates the part of the text file I want.
f = open("/Users/scott/Desktop/test/difinp0.txt","rt")
difout = f.readlines()
f.close()
d = range(1,5)
for i, line in enumerate(difout):
if "Chi-Square Test for Difference Testing" in line:
for l in difout[i:i+5]: print(l)
This produces the following:
Chi-Square Test for Difference Testing
Value 12.958
Degrees of Freedom 10
P-Value 0.2261
Note: there is a blank line between the heading and the next line titled "Value."
There are a different statistics with the same labels in the output but I need the ones here that are under the heading "Chi-square Test for Difference Testing.
What I am looking for is to save the values into 3 variables for use later.
chivalue (which in this case would be 12.958
chidf (which in this case would be 10)
chip (which in this case would be 0.2261
I've tried to enumerate "l" and retrieve from there but I can't seem to get it to work.
Any thoughts would be greatly appreciated. Again, apologies for such a basic question.
One option is to build a function that parses the input lines and returns the variables you want
def parser(text_lines):
v, d, p = [None]*3
for line in text_lines:
if line.strip().startswith('Value'):
v = float(line.strip().split(' ')[-1])
if line.strip().startswith('Degrees'):
d = float(line.strip().split(' ')[-1])
if line.strip().startswith('P-Value'):
p = float(line.strip().split(' ')[-1])
return v,d,p
for i, line in enumerate(difout):
if "Chi-Square Test for Difference Testing" in line:
for l in difout[i:i+5]:
print(l)
value, degree, p_val = parser(difout[i:i+5])

Having an issue with using median function in numpy

I am having an issue with using the median function in numpy. The code used to work on a previous computer but when I tried to run it on my new machine, I got the error "cannot perform reduce with flexible type". In order to try to fix this, I attempted to use the map() function to make sure my list was a floating point and got this error message: could not convert string to float: .
Do some more attempts at debugging, it seems that my issue is with my splitting of the lines in my input file. The lines are of the form: 2456893.248202,4.490 and I want to split on the ",". However, when I print out the list for the second column of that line, I get
4
.
4
9
0
so it seems to somehow be splitting each character or something though I'm not sure how. The relevant section of code is below, I appreciate any thoughts or ideas and thanks in advance.
def curve_split(fn):
with open(fn) as f:
for line in f:
line = line.strip()
time,lc = line.split(",")
#debugging stuff
g=open('test.txt','w')
l1=map(lambda x:x+'\n',lc)
g.writelines(l1)
g.close()
#end debugging stuff
return time,lc
if __name__ == '__main__':
# place where I keep the lightcurve files from the image subtraction
dirname = '/home/kuehn/m4/kepler/subtraction/detrending'
files = glob.glob(dirname + '/*lc')
print(len(files))
# in order to create our lightcurve array, we need to know
# the length of one of our lightcurve files
lc0 = curve_split(files[0])
lcarr = np.zeros([len(files),len(lc0)])
# loop through every file
for i,fn in enumerate(files):
time,lc = curve_split(fn)
lc = map(float, lc)
# debugging
print(fn[5:58])
print(lc)
print(time)
# end debugging
lcm = lc/np.median(float(lc))
#lcm = ((lc[qual0]-np.median(lc[qual0]))/
# np.median(lc[qual0]))
lcarr[i] = lcm
print(fn,i,len(files))

Output with Python Glob // Cannot find where is error in Python code

I have the following code, which does NOT give an error but it also does not produce an output.
The script is made to do the following:
The script takes an input file of 4 tab-separated columns:
It then counts the unique values in Column 1 and the frequency of corresponding values in Column 4 (which contains 2 different tags: C and D).
The output is 3 tab-separated columns containing the unique values of column 1 and their corresponding frequency of values in Column 4: Column 2 has the frequency of the string in Column 1 that corresponds with Tag C and Column 3 has the frequency of the string in Column 1 that corresponds with Tag D.
Here is a sample of input:
algorithm-n like-1-resonator-n 8.1848 C
algorithm-n produce-hull-n 7.9104 C
algorithm-n like-1-resonator-n 8.1848 D
algorithm-n produce-hull-n 7.9104 D
anything-n about-1-Zulus-n 7.3731 C
anything-n above-shortage-n 6.0142 C
anything-n above-1-gig-n 5.8967 C
anything-n above-1-magnification-n 7.8973 C
anything-n after-1-memory-n 2.5866 C
and here is a sample of the desired output:
algorithm-n 2 2
anything-n 5 0
The code I am using is the following (which one will see takes into consideration all suggestions from the comments):
from collections import defaultdict, Counter
def sortAndCount(opened_file):
lemma_sense_freqs = defaultdict(Counter)
for line in opened_file:
lemma, _, _, senseCode = line.split()
lemma_sense_freqs[lemma][senseCode] += 1
return lemma_sense_freqs
def writeOutCsv(output_file, input_dict):
with open(output_file, "wb") as outfile:
for lemma in input_dict.keys():
for senseCode in input_dict[lemma].keys():
outstring = "\t".join([lemma, senseCode,\
str(input_dict[lemma][senseCode])])
outfile.write(outstring + "\n")
import os
import glob
folderPath = "Python_Counter" # declare here
for input_file in glob.glob(os.path.join(folderPath, 'out_')):
with open(input_file, "rb") as opened_file:
lemma_sense_freqs = sortAndCount(input_file)
output_file = "count_*.csv"
writeOutCsv(output_file, lemma_sense_freqs)
My intuition is the problem is coming from the "glob" function.
But, as I said before: the code itself DOES NOT give me an error -- but it doesn't seem to produce an output either.
Can someone help?
I have referred to the documentation here and here, and I cannot seem to understand what I am doing wrong.
Can someone provide me insight on how to solve the problem by outputting the results from glob. As I have a large amount of files I need to process.
In regards to your original code, *lemma_sense_freqs* is not defined cause it should be returned by the function sortAndCount(). And you never call that function.
For instance, you have a second function in your code, which is called writeOutCsv. You define it, and then you actually call it on the last line.
While you never call the function sortAndCount() (which is the one that should return the value of *lemma_sense_freqs*). Hence, the error.
I don't know what you want to achieve exactly with that code, but you definitely need to write at a certain point (try before the last line) something like this
lemma_sense_freqs = sortAndCount(input_file)
this is the way you call the function you need and lemma_sense_freqs will then have a value associated and you shouldn't get the error.
I cannot be more specific cause it is not clear exactly what you want to achieve with that code. However, you just are experiencing a basic issue at the moment (you defined a function but never used it to retrieve the value lemma_sense_freqs). Try to add the piece of code I suggest and play with it.

Categories