python variable m.start() from re.finditer does not get overwritten - python

I am currently writing a small python program for manipulating text files. (I am a newb programmer)
First, I am using re.finditer to find a specific string in lines1. Then I write this into a file and close it.
Next I want to grab the first line and search for this in another text file. The first time using re.finditer it was working great.
The problem is: m.start() always returns the last value of the first m.start. It does not get overwritten as it was the first time using re.finditer.
Could you help me understand why?
my code:
for m in re.finditer(finder1,lines1):
end_of_line = lines1.find('\n',m.start())
#print(m.start())
found_tag = lines1[m.start()+lenfinder1:end_of_line]
writefile.write(found_tag+'\n')
lenfinder2 = len(found_tag)
input_file3 = open ('out.txt')
writefile.close()
num_of_lines3 = file_len('out.txt')
n=1
while (n < num_of_lines3):
line = linecache.getline('out.txt', n)
n = n+1
re.finditer(line,lines2)
#print(m.start())

You've not declared\initialized line that you're using here :
re.finditer(line,lines2)
So, change :
linecache.getline('out.txt', n)
to
line = linecache.getline('out.txt', n)

Related

How do I break line at the range on ReportLab?

I am trying to write text on a PDF file but if the line is bigger than the page size it overflows and doesn't automatically go to the next line. How do I provide a range when the line overflows and automatically break the text to the next line?
Note: The text is automatically generated and may always not overflow.
Current Output:
What I want:
Code:
can = canvas.Canvas(packet, pagesize=A4, bottomup=0)
can.setFont('RobotoBold', 11)
can.drawString(40, 658, 'In Word:')
can.setFont('Roboto', 11)
can.drawString(84, 658, final_string)
You can break final_string into chunks by using list comprehension as follows.
n = 80 # or whatever length you find suitable.
finals_string_chunks = [final_string[i:i+n] for i in range(0, len(final_string), n)]
And then call can.drawString() within a for-each loop for each string in the list finals_string_chunks.
You might have to do some math for updating appropriate x, y coordinates on your drawString method as well.

How can I increase the amount of array iterated during the run-time of script?

My script cleans arrays from the unwanted string like "##$!" and other stuff.
The script works as intended but the speed of it is extremely slow when the excel row size is big.
I tried to use numpy if it could speed it up but I'm not too familiar with is so I might be using it incorrectly.
xls = pd.ExcelFile(path)
df = xls.parse("Sheet2")
TeleNum = np.array(df['telephone'].values)
def replace(orignstr): # removes the unwanted string from numbers
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
return orignstr
for UncleanNum in tqdm(TeleNum):
newnum = replace(str(UncleanNum)) # calling replace function
df['telephone'] = df['telephone'].replace(UncleanNum, newnum) # store string back in data frame
I also tried removing the method to if that would help and just place it as one block of code but the speed remained the same.
for UncleanNum in tqdm(TeleNum):
orignstr = str(UncleanNum)
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
print(orignstr)
df['telephone'] = df['telephone'].replace(UncleanNum, orignstr)
TeleNum = np.array(df['telephone'].values)
The current speed of the script running an excel file of 200,000 is around 70it/s and take around an hour to finish. Which is not that good since this is just one function of many.
I'm not too advanced in python. I'm just learning as I script so if you have any pointer it would be appreciated.
Edit:
Most of the array elements Im dealing with are numbers but some have string in them. I trying to remove all string in the array element.
Ex.
FD3459002912
*345*9002912$
If you are trying to clear everything that isn't a digit from the strings you can directly use re.sub like this:
import re
string = "FD3459002912"
regex_result = re.sub("\D", "", string)
print(regex_result) # 3459002912

Search text file and save to specific variables

I have looked at the previous questions and answers to my question but can't seem to get this to work. I am very new to python so I apologize in advance for this basic question.
I am running a Monte Carlo simulation in a separate piece of software. It creates a text file output that is quite lengthy. I want to retrieve 3 values that are under one heading. I have created the following code that isolates the part of the text file I want.
f = open("/Users/scott/Desktop/test/difinp0.txt","rt")
difout = f.readlines()
f.close()
d = range(1,5)
for i, line in enumerate(difout):
if "Chi-Square Test for Difference Testing" in line:
for l in difout[i:i+5]: print(l)
This produces the following:
Chi-Square Test for Difference Testing
Value 12.958
Degrees of Freedom 10
P-Value 0.2261
Note: there is a blank line between the heading and the next line titled "Value."
There are a different statistics with the same labels in the output but I need the ones here that are under the heading "Chi-square Test for Difference Testing.
What I am looking for is to save the values into 3 variables for use later.
chivalue (which in this case would be 12.958
chidf (which in this case would be 10)
chip (which in this case would be 0.2261
I've tried to enumerate "l" and retrieve from there but I can't seem to get it to work.
Any thoughts would be greatly appreciated. Again, apologies for such a basic question.
One option is to build a function that parses the input lines and returns the variables you want
def parser(text_lines):
v, d, p = [None]*3
for line in text_lines:
if line.strip().startswith('Value'):
v = float(line.strip().split(' ')[-1])
if line.strip().startswith('Degrees'):
d = float(line.strip().split(' ')[-1])
if line.strip().startswith('P-Value'):
p = float(line.strip().split(' ')[-1])
return v,d,p
for i, line in enumerate(difout):
if "Chi-Square Test for Difference Testing" in line:
for l in difout[i:i+5]:
print(l)
value, degree, p_val = parser(difout[i:i+5])

Having an issue with using median function in numpy

I am having an issue with using the median function in numpy. The code used to work on a previous computer but when I tried to run it on my new machine, I got the error "cannot perform reduce with flexible type". In order to try to fix this, I attempted to use the map() function to make sure my list was a floating point and got this error message: could not convert string to float: .
Do some more attempts at debugging, it seems that my issue is with my splitting of the lines in my input file. The lines are of the form: 2456893.248202,4.490 and I want to split on the ",". However, when I print out the list for the second column of that line, I get
4
.
4
9
0
so it seems to somehow be splitting each character or something though I'm not sure how. The relevant section of code is below, I appreciate any thoughts or ideas and thanks in advance.
def curve_split(fn):
with open(fn) as f:
for line in f:
line = line.strip()
time,lc = line.split(",")
#debugging stuff
g=open('test.txt','w')
l1=map(lambda x:x+'\n',lc)
g.writelines(l1)
g.close()
#end debugging stuff
return time,lc
if __name__ == '__main__':
# place where I keep the lightcurve files from the image subtraction
dirname = '/home/kuehn/m4/kepler/subtraction/detrending'
files = glob.glob(dirname + '/*lc')
print(len(files))
# in order to create our lightcurve array, we need to know
# the length of one of our lightcurve files
lc0 = curve_split(files[0])
lcarr = np.zeros([len(files),len(lc0)])
# loop through every file
for i,fn in enumerate(files):
time,lc = curve_split(fn)
lc = map(float, lc)
# debugging
print(fn[5:58])
print(lc)
print(time)
# end debugging
lcm = lc/np.median(float(lc))
#lcm = ((lc[qual0]-np.median(lc[qual0]))/
# np.median(lc[qual0]))
lcarr[i] = lcm
print(fn,i,len(files))

Writing to a multi dimensional array with split

I am trying to use python to parse a text file (stored in the var trackList) with times and titles in them it looks like this
00:04:45 example text
00:08:53 more example text
12:59:59 the last bit of example text
My regular expression (rem) works, I am also able to split the string (i) into two parts correctly (as in I separate times and text) but I am unable to then add the arrays (using .extend) that the split returns to a large array I created earlier (sLines).
f=open(trackList)
count=0
sLines=[[0 for x in range(0)] for y in range(34)]
line=[]
for i in f:
count+=1
line.append(i)
rem=re.match("\A\d\d\:\d\d\:\d\d\W",line[count-1])
if rem:
sLines[count-1].extend(line[count-1].split(' ',1))
else:
print("error on line: "+count)
That code should go through each line in the file trackList, test to see if the line is as expected, if so separate the time from the text and save the result of that as an array inside an array at the index of one less than the current line number, if not print an error pointing me to the line
I use array[count-1] as python arrays are zero indexed and file lines are not.
I use .extend() as I want both elements of the smaller array added to the larger array in the same iteration of the parent for loop.
So, you have some pretty confusing code there.
For instance doing:
[0 for x in range(0)]
Is a really fancy way of initializing an empty list:
>>> [] == [0 for x in range(0)]
True
Also, how do you know to get a matrix that is 34 lines long? You're also confusing yourself with calling your line 'i' in your for loop, usually that would be reserved as a short hand syntax for index, which you'd expect to be a numerical value. Appending i to line and then re-referencing it as line[count-1] is redundant when you already have your line variable (i).
Your overall code can be simplified to something like this:
# load the file and extract the lines
f = open(trackList)
lines = f.readlines()
f.close()
# create the expression (more optimized for loops)
expr = re.compile('^(\d\d:\d\d:\d\d)\s*(.*)$')
sLines = []
# loop the lines collecting both the index (i) and the line (line)
for i, line in enumerate(lines):
result = expr.match(line)
# validate the line
if ( not result ):
print("error on line: " + str(i+1))
# add an invalid list to the matrix
sLines.append([]) # or whatever you want as your invalid line
continue
# add the list to the matrix
sLines.append(result.groups())

Categories