Problems with variable referenced before assignment when using os.path.walk - python

OK. I have some background in Matlab and I'm now switching to Python.
I have this bit of code under Pythnon 2.6.5 on 64-bit Linux which scrolls through directories, finds files named 'GeneralData.dat', retrieves some data from them and stitches them into a new data set:
import pylab as p
import os, re
import linecache as ln
def LoadGenomeMeanSize(arg, dirname, files):
for file in files:
filepath = os.path.join(dirname, file)
if filepath == os.path.join(dirname,'GeneralData.dat'):
data = p.genfromtxt(filepath)
if data[-1,4] != 0.0: # checking if data set is OK
data_chopped = data[1000:-1,:] # removing some of data
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) + sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
else:
break
if filepath == os.path.join(dirname,'ModelParams.dat'):
l = re.split(" ", ln.getline(filepath, 6))
turb_param = float(l[2])
arg.append((Grand_mean, Grand_STD, turb_param))
GrandMeansData = []
os.path.walk(os.getcwd(), LoadGenomeMeanSize, GrandMeansData)
GrandMeansData = sorted(GrandMeansData, key=lambda data_sort: data_sort[2])
TheMeans = p.zeros((len(GrandMeansData), 3 ))
i = 0
for item in GrandMeansData:
TheMeans[i,0] = item[0]
TheMeans[i,1] = item[1]
TheMeans[i,2] = item[2]
i += 1
print TheMeans # just checking...
# later do some computation on TheMeans in NumPy
And it throws me this (though I would swear it was working a month ego):
Traceback (most recent call last):
File "/home/User/01_PyScripts/TESTtest.py", line 29, in <module>
os.path.walk(os.getcwd(), LoadGenomeMeanSize, GrandMeansData)
File "/usr/lib/python2.6/posixpath.py", line 233, in walk
walk(name, func, arg)
File "/usr/lib/python2.6/posixpath.py", line 225, in walk
func(arg, top, names)
File "/home/User/01_PyScripts/TESTtest.py", line 26, in LoadGenomeMeanSize
arg.append((Grand_mean, Grand_STD, turb_param))
UnboundLocalError: local variable 'Grand_mean' referenced before assignment
All right... so I went and did some reading and came up with this global variable:
import pylab as p
import os, re
import linecache as ln
Grand_mean = p.nan
Grand_STD = p.nan
def LoadGenomeMeanSize(arg, dirname, files):
for file in files:
global Grand_mean
global Grand_STD
filepath = os.path.join(dirname, file)
if filepath == os.path.join(dirname,'GeneralData.dat'):
data = p.genfromtxt(filepath)
if data[-1,4] != 0.0: # checking if data set is OK
data_chopped = data[1000:-1,:] # removing some of data
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) + sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
else:
break
if filepath == os.path.join(dirname,'ModelParams.dat'):
l = re.split(" ", ln.getline(filepath, 6))
turb_param = float(l[2])
arg.append((Grand_mean, Grand_STD, turb_param))
GrandMeansData = []
os.path.walk(os.getcwd(), LoadGenomeMeanSize, GrandMeansData)
GrandMeansData = sorted(GrandMeansData, key=lambda data_sort: data_sort[2])
TheMeans = p.zeros((len(GrandMeansData), 3 ))
i = 0
for item in GrandMeansData:
TheMeans[i,0] = item[0]
TheMeans[i,1] = item[1]
TheMeans[i,2] = item[2]
i += 1
print TheMeans # just checking...
# later do some computation on TheMeans in NumPy
It does not give error massages. Even gives a file with data... but data are bloody wrong! I checked some of them manually by running commands:
import pylab as p
data = p.genfromtxt(filepath)
data_chopped = data[1000:-1,:]
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) \
+ sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
on selected files. They are different :-(
1) Can anyone explain me what's wrong?
2) Does anyone know a solution to that?
I'll be grateful for help :-)
Cheers,
PTR

I would say this condition is not passing:
if filepath == os.path.join(dirname,'GeneralData.dat'):
which means you are not getting GeneralData.dat before ModelParams.dat. Maybe you need to sort alphabetically or the file is not there.

I see one issue with the code and the solution that you have provided.
Never hide the issue of "variable referencing before assignment" by just making the variable visible.
Try to understand why it happened?
Prior to creating a global variable "Grand_mean", you were getting an issue that you are accessing Grand_mean before any value is assigned to it. In such a case, by initializing the variable outside the function and marking it as global, only serves to hide the issue.
You see erroneous result because now you have made the variable visible my making it global but the issue continues to exist. You Grand_mean was never equalized to some correct data.
This means that section of code under "if filepath == os.path.join(dirname,..." was never executed.

Using global is not the right solution. That only makes sense if you do in fact want to reference and assign to the global "Grand_mean" name. The need for disambiguation comes from the way the interpreter prescans for assignment operators in function declarations.
You should start by assigning a default value to Grand_mean within the scope of LoadGenomeMeanSize(). You have 1 of 4 branches to actually assign a value to Grand_mean that has correct semantic meaning within one loop iteration. You are likely running into a case where
if filepath == os.path.join(dirname,'ModelParams.dat'): is true, but either
if filepath == os.path.join(dirname,'GeneralData.dat'): or if data[-1,4] != 0.0: is not. It's likely the second condition that is failing for you. Move the
The quick and dirty answer is you probably need to rearrange your code like this:
...
if filepath == os.path.join(dirname,'GeneralData.dat'):
data = p.genfromtxt(filepath)
if data[-1,4] != 0.0: # checking if data set is OK
data_chopped = data[1000:-1,:] # removing some of data
Grand_mean = data_chopped[:,2].mean()
Grand_STD = p.sqrt((sum(data_chopped[:,4]*data_chopped[:,3]**2) + sum((data_chopped[:,2]-Grand_mean)**2))/sum(data_chopped[:,4]))
if filepath == os.path.join(dirname,'ModelParams.dat'):
l = re.split(" ", ln.getline(filepath, 6))
turb_param = float(l[2])
arg.append((Grand_mean, Grand_STD, turb_param))
else:
break
...

Related

XML to dictionary extraction

I wrote a code and some values at ecd are missing. I would like to indicate them as 'None' or 0000 to be able to create a dataframe. Unfortunately, the code runs until the missing place and then it crashes and I cannot spot a mistake.
The error message:
File "extra.py", line 236, in <module>
if dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'] != None:
KeyError: 'ecdTdAv'
Code:
xml_file = 'C:\\Users\\jtfra\\Desktop\\Thesis\\Volve_Real_Time_DData\\WITSML Realtime drilling data\\Norway-Statoil-NO 15_$47$_9-F-11\\1\\mudLog\\1.xml'
def convert(xml_file, xml_attribs=True):
with open(xml_file, "rb") as f: # notice the "rb" mode
d = xmltodict.parse(f, xml_attribs=xml_attribs)
return d
dic = convert(xml_file)
mdTop, ecd = [], []
for i in range(len(dic['mudLogs']['mudLog']['geologyInterval'])):
mdTop.append(dic['mudLogs']['mudLog']['geologyInterval'][i]['mdTop']['#text'])
if dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'] != None:
ecd.append(dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'])
else:
ecd.append('None')
print(ecd)
Instead of accessing it as:
dic['mudLogs']['mudLog']['geologyInterval'][0]['ecdTdAv']
do:
dic['mudLogs']['mudLog']['geologyInterval'][0].get('ecdTdAv', '0000')
or similar.
You can also check if key is present with:
if 'ecdTdAv' in dic['mudLogs']['mudLog']['geologyInterval'][I]:
# do something with it, e.g.:
print(dic['mudLogs']['mudLog']['geologyInterval'][i]['ecdTdAv']['#text'])

I need a shortcut

So im just trying to make a simple script that can filter emails with different domains its working great but i need a shortcut, cause i dont wana write if and elif statements many time , Can anyone tell my how to write my script with function so that will become shorter and easier.. thanks in advance ,Script is below:
f_location = 'C:/Users/Jack The Reaper/Desktop/mix.txt'
text = open(f_location)
good = open('C:/Users/Jack The Reaper/Desktop/good.txt','w')
for line in text:
if '#yahoo' in line:
yahoo = None
elif '#gmail' in line:
gmail = None
elif '#yahoo' in line:
yahoo = None
elif '#live' in line:
live = None
elif '#outlook' in line:
outlook = None
elif '#hotmail' in line:
hotmail = None
elif '#aol' in line:
aol = None
else:
if ' ' in line:
good.write(line.strip(' '))
elif '' in line:
good.write(line.strip(''))
else:
good.write(line)
text.close()
good.close()
I would suggest you to use dict for this instead of having separate variables for all the cases.
my_dict = {}
...
if '#yahoo' in line:
my_dict['yahoo'] = None
But if you want to do the way you described in the question, you can do as done below,
email_domains = ['#yahoo', '#gmail', '#live', '#outlook', '#hotmail', '#aol']
for e in email_domains:
if e in line:
locals()[e[1:]] = None
#if you use dict, use the below line
#my_dict[e[1:]] = None
locals() returns a dictionary of the current namespace. The keys in this dict are the variable names and value is the value of the variable.
So locals()['gmail'] = None creates a local variable named gmail(if it doesn't exist) and assigns it None.
As you stated the problem and provided the sample file :
So i have two solution : One line solution and other is detailed solution.
First let's define regex pattern and import re module
import re
pattern=r'.+#(?!gmail|yahoo|aol|hotmail|live|outlook).+'
Now detailed version code:
emails=[]
with open('emails.txt','r') as f:
for line in f:
match=re.finditer(pattern,line)
for find in match:
emails.append(find.group())
with open('result.txt','w') as f:
f.write('\n'.join(emails))
output in result.txt file :
nic-os9#gmx.de
angelique.charuel#sfr.fr
nannik#interia.pl
l.andrioli#freenet.de
kamil_sieminski8#o2.pl
hugo.lebrun.basket#orange.fr
One line solution if you want too short:
with open('results.txt','w') as file:
file.write('\n'.join([find.group() for line in open('emails.txt','r') for find in re.finditer(pattern,line)]))
output:
nic-os9#gmx.de
angelique.charuel#sfr.fr
nannik#interia.pl
l.andrioli#freenet.de
kamil_sieminski8#o2.pl
hugo.lebrun.basket#orange.fr
P.S : with one line solution file will not close automatically but python clear that stuff its not a big issue (but not always) but still if you want you can use.

Why does pycharm warn about "Redeclared variable defined above without usage"?

Why does PyCharm warn me about Redeclared 'do_once' defined above without usage in the below code? (warning is at line 3)
for filename in glob.glob(os.path.join(path, '*.'+filetype)):
with open(filename, "r", encoding="utf-8") as file:
do_once = 0
for line in file:
if 'this_text' in line:
if do_once == 0:
//do stuff
do_once = 1
//some other stuff because of 'this text'
elif 'that_text' in line and do_once == 0:
//do stuff
do_once = 1
Since I want it to do it once for each file it seems appropriate to have it every time it opens a new file and it does work just like I want it to but since I have not studied python, just learned some stuff by doing and googling, I wanna know why it is giving me a warning and what I should do differently.
Edit:
Tried with a boolean instead and still got the warning:
Short code that reproduces the warning for me:
import os
import glob
path = 'path'
for filename in glob.glob(os.path.join(path, '*.txt')):
with open(filename, "r", encoding="utf-8") as ins:
do_once = False
for line in ins:
if "this" in line:
print("this")
elif "something_else" in line and do_once == False:
do_once = True
In order to solve the general case:
What you may be doing
v1 = []
for i in range(n):
v1.append([randrange(10)])
v2 = []
for i in range(n): # <<< Redeclared i without usage
v2.append([randrange(10)])
What you can do
v1 = [[randrange(10)] for _ in range(5)] # use dummy variable "_"
v2 = [[randrange(10)] for _ in range(5)]
My guess is PyCharm is being confused by the use of integers as flags, there are several alternatives that could be used in your use case.
Use a boolean flag instead of an integer
file_processed = False
for line in file:
if 'this' in line and not file_processed:
# do stuff
file_processed = True
...
A better approach would be to jump simply stop once you have processed something in the file eg:
for filename in [...list...]:
while open(filename) as f:
for line in f:
if 'this_text' in line:
# Do stuff
break # Break out of this for loop and go to the next file
Not really an answer, but maybe an explanation:
Apparently, PyCharm is trying to avoid code like
do_once = False
do_once = True
However, it's also flagging normal code like the OP's:
item_found = False
for item in items:
if item == item_that_i_want:
item_found = True
if item_found:
# do something
or, something like
last_message = ''
try:
# do something
if success:
last_message = 'successfully did something'
else:
last_message = 'did something without success'
# do something else
if success:
last_message = '2nd something was successful'
else
last_message = '2nd something was not successful'
# and so on
print(last_message)
Redeclared 'last_message' defined above without usage warning will appear for every line where last_message was reassigned without using it inbetween.
So, the workaround would be different for each case where this is happening:
ignore the warning(s)
print or log the value somewhere after setting it
perhaps make a function to call for setting/retrieving the value
determine if there's an alternate way to accomplish the desired outcome
My code was using the last_message example, and I just removed the code reassigning last_message in each case (though printing after each reassignment also removed the warnings). I was using it for testing to locate a problem, so it wasn't critical. Had I wanted to log the completed actions, I might've used a function to do so instead of reassigning the variable each time.
If I find a way to turn it off or avoid the warning in PyCharm, I'll update this answer.

Python variable/count not increasing

My code attempts to create a folder which then downloads a pdf to the corresponding folder. In my current code the variable and counter "i" keeps track of which folder to download to but it seems to not be updating for some reason. At the end of the else if statement I want the variable i to increase by 1. Not understanding what the issue is here, I'm fairly new to python and if a similar situation was coded in java I know this would work just file but not sure why it's not working in python.
import re
import os
import urllib
f = open("newfile.txt")
suffix = '.pdf'
for line in f:
i = 0
folderopt = str(i)
if suffix in line:
print('download')
url = line.rstrip('\n')
pdfname = url.split('LTN',1)[1]
print ('download to:'+'/Users/user/Desktop/PDF/'+folderopt+'/'+pdfname)
urllib.urlretrieve(url,'/Users/user/Desktop/PDF/'+folderopt+'/'+pdfname)
elif line>i:
filename = line.rstrip('\n')
print ('code:'+filename)
os.mkdir('/Users/user/Desktop/PDF/'+filename)
global i
i = i+1
f.close
EDIT: I put the variable outside of the for loop still getting this
IOError: [Errno 2] No such file or directory: '/Users/user/Desktop/PDF/0/20160412398.pdf'
the i count has not increased even though the folder /Users/user/Desktop/PDF/1 was created.
changed elif statement to
elif int(line.rstrip('\n'))>i
still not working
You set i to 0 in the loop,try this:
import re
import os
import urllib
f = open("newfile.txt")
suffix = '.pdf'
i = 0
for line in f:
folderopt = str(i)
if suffix in line:
print('download')
url = line.rstrip('\n')
pdfname = url.split('LTN',1)[1]
print ('download to:'+'/Users/user/Desktop/PDF/'+folderopt+'/'+pdfname)
urllib.urlretrieve(url,'/Users/user/Desktop/PDF/'+folderopt+'/'+pdfname)
elif line>i:
filename = line.rstrip('\n')
print ('code:'+filename)
os.mkdir('/Users/user/Desktop/PDF/'+filename)
global i
i+= 1
f.close
Thats because the first line in your loop sets i to 0.
Try
i = 0
for line in f:
instead of
for line in f:
i = 0
EDIT:
Also, the i+=1 should be outside the elif condition, like so
elif:
...
i+=1
not
elif:
...
i+=1

Auto-increment file name Python

I am trying to write a function that assigns a path name and filename to a variable that is based on a name of a file than exists in the folder. Then, if the name of the file already exists the file name is auto-incremented. I have seen some posts on this using while loop but I cannot get my head around this and would like to wrap it in a recursive function.
Here is what I have so far. When testing with print statement every works well. But it does not return the new name back to the main program.
def checkfile(ii, new_name,old_name):
if not os.path.exists(new_name):
return new_name
if os.path.exists(new_name):
ii+=1
new_name = os.path.join(os.path.split(old_name)[0],str(ii) + 'snap_'+ os.path.split(old_name)[1])
print new_name
old_name = “D:\Bar\foo”
new_name= os.path.join(os.path.split(old_name)[0],”output_” + os.path.split(old_name)[1])
checkfile(0,new_name,old_name)
While I wouldn't recommend using recursion for this (python's stack maxes out at about 1000 function calls deep), you're just missing a return for the recursive bit:
new_name= os.path.join(os.path.split(old_name)[0],”output_” + os.path.split(old_name)[1])
checkfile(0,new_name,old_name)
Should instead be:
new_name= os.path.join(os.path.split(old_name)[0],”output_” + os.path.split(old_name)[1])
return checkfile(ii,new_name,old_name)
But really, you can make this a whole lot simpler by re-writing it as:
def checkfile(path):
path = os.path.expanduser(path)
if not os.path.exists(path):
return path
root, ext = os.path.splitext(os.path.expanduser(path))
dir = os.path.dirname(root)
fname = os.path.basename(root)
candidate = fname+ext
index = 0
ls = set(os.listdir(dir))
while candidate in ls:
candidate = "{}_{}{}".format(fname,index,ext)
index += 1
return os.path.join(dir,candidate)
This form also handles the fact that filenames have extensions, which your original code doesn't, at least not very clearly. It also avoids needless os.path.exist's, which can be very expensive, especially if the path is a network location.

Categories