I've only started python recently but am stuck on a problem.
# function that tells how to read the urls and how to process the data the
# way I need it.
def htmlreader(i):
# makes variable websites because it is used in a loop.
pricedata = urllib2.urlopen(
"http://website.com/" + (",".join(priceids.split(",")[i:i + 200]))).read()
# here my information processing begins but that is fine.
pricewebstring = pricedata.split("},{")
# results in [[1234,2345,3456],[3456,4567,5678]] for example.
array1 = [re.findall(r"\d+", a) for a in pricewebstring]
# writes obtained array to my text file
itemtxt2.write(str(array1) + '\n')
i = 0
while i <= totalitemnumber:
htmlreader(i)
i = i + 200
See the comments in the script as well.
This is in a loop and will each time give me an array (defined by array1).
Because I print this to a txt file it results in a txt file with separate arrays.
I need one big array so it needs to merge the results of htmlreader(i).
So my output is something like:
[[1234,2345,3456],[3456,4567,5678]]
[[6789,4567,2345],[3565,1234,2345]]
But I want:
[[1234,2345,3456],[3456,4567,5678],[6789,4567,2345],[3565,1234,2345]]
Any ideas how I can approach this?
Since you want to gather all the elements in a single list, you can simply gather them in another list, by flattening it like this
def htmlreader(i, result):
...
result.extend([re.findall(r"\d+", a) for a in pricewebstring])
i, result = 0, []
while i <= totalitemnumber:
htmlreader(i, result)
i = i + 200
itemtxt2.write(str(result) + '\n')
In this case, the result created by re.findall (a list) is added to the result list. Finally, you are writing the entire list as a whole to the file.
If the above shown method is confusing, then change it like this
def htmlreader(i):
...
return [re.findall(r"\d+", a) for a in pricewebstring]
i, result = 0, []
while i <= totalitemnumber:
result.extend(htmlreader(i))
i = i + 200
Related
I am new to python. In this script I am updating an array.
def do_smoothing(input_csv, original):
print(input_csv)
for ins, rw in input_csv.iterrows():
if(rw.start or rw.end != -999999):
order_value_equal = comp(rw.Previous_Three, rw.Next_Three)
two_equal = eval_tuples(rw.Previous_Three, rw.Next_Three)
check_fist_fourth = chcek_offset0_and_offset4(rw.Previous_Three, rw.Next_Three,ins)
check_two_zeros_onEither_side_val = check_two_zeros_onEither_side(rw.Previous_Three, rw.Next_Three,ins)
if order_value_equal:
original = np.array(original)
original[rw.start + 1: rw.end -1] = get_maximum_value(rw.Previous_Three)
if two_equal:
original = np.array(original)
original[rw.start + 1: rw.end -1] = get_maximum_value(rw.Previous_Three)
if check_fist_fourth:
original = np.array(original)
original[rw.start + 1: rw.end -1] = get_firstvalue(rw.Previous_Three)
if check_two_zeros_onEither_side_val:
original = np.array(original)
original[rw.start + 1: rw.end -1]= get_firstvalue(rw.Previous_Three)
else:
pass
return original
In this, I am updating the output array every time, but while returning it is not returning the updated, it is returning the same as before. Can anyone help me with this?
Your question and code are somewhat unclear, so correct the following assumptions if they are incorrect.
input_csv is a pandas DataFrame object derived from a CSV file. This is assumed on the basis that the iterrows method is a pandas method for that data type.
original is a default Python list that is being converted into a numpy array.
Methods like comp, eval_tuples, etc. are custom methods that produce boolean values.
Methods like get_maximum_value and get_firstvalue are custom methods that produce an int, or another basic data structure.
Let's clean up the code a little bit to get a better idea of what's going on.
def do_smoothing_better(input_csv, original):
print(input_csv)
for ins, row in input_csv.iterrows():
if row.start or row.end != -999999:
order_value_equal = comp(row.Previous_Three, row.Next_Three)
two_equal = eval_tuples(row.Previous_Three, row.Next_Three)
check_fist_fourth = check_offset0_and_offset4(
row.Previous_Three, row.Next_Three, ins
)
check_two_zeros_onEither_side_val = check_two_zeros_onEither_side(
row.Previous_Three, row.Next_Three, ins
)
else: # this fallback must be added so you don't end up
continue # with errors if no booleans are actuall initialized
if order_value_equal or two_equal: # both of these can be combined
original = np.array(original) # as they resulted in the same expression
original[row.start + 1 : row.end - 1] = get_maximum_value(
row.Previous_Three
)
if check_fist_fourth or check_two_zeros_onEither_side_val: # as can these
original = np.array(original)
original[row.start + 1 : row.end - 1] = get_firstvalue(row.Previous_Three)
else:
continue
return original
Adding proper else tests and formatting the second conditional as such should prevent the interpreter from dropping into the wrong block, and thereby producing an inaccurate value.
I'm building a large dictionary from XBRL data in order to automate the generation of custom financial ratios. The following code works fine, but I was curious if there is a better way to reference the dictionary items without having to write out the dictionary name every time I want to pull a variable from it.
FinStatItems = {'GainOnSaleOfRealEstate': 0, 'DepreciationAndAmortization': 104044000.0, 'NetIncome': -4086000.0, 'ImpairmentOnInvestmentsInRealEstate': 122472000.0}
NAREIT_FFO = FinStatItems['NetIncome'] + FinStatItems['DepreciationAndAmortization'] + FinStatItems['ImpairmentOnInvestmentsInRealEstate'] - FinStatItems['GainOnSaleOfRealEstate']
print('NAREIT FFO = ' + str(NAREIT_FFO))
Is there a better way to write this line:
NAREIT_FFO = FinStatItems['NetIncome'] + FinStatItems['DepreciationAndAmortization'] + FinStatItems['ImpairmentOnInvestmentsInRealEstate'] - FinStatItems['GainOnSaleOfRealEstate']
If all you are looking for is a little code aesthetics then you can use operator.itemgetter() which can take a list of args, e.g.:
>>> import operator as op
>>> fn = op.itemgetter('NetIncome', 'DepreciationAndAmortization', 'ImpairmentOnInvestmentsInRealEstate')
>>> NAREIT_FFO = sum(fn(FinStatItems)) - FinStatItems['GainOnSaleOfRealEstate']
But this will be no more efficient than your original code. And only really useful if you want to reuse fn().
You could reference a list of the items
FinStatItems = {'GainOnSaleOfRealEstate': 0, 'DepreciationAndAmortization': 104044000.0, 'NetIncome': -4086000.0, 'ImpairmentOnInvestmentsInRealEstate': 122472000.0}
items = ['NetIncome', 'DepreciationAndAmortization', 'ImpairmentOnInvestmentsInRealEstate', 'GainOnSaleOfRealEstate']
NAREIT_FFO = sum(FinStatItems[item] for item in items[:-1]) - FinStatItems[items[-1]]
print('NAREIT FFO = ' + str(NAREIT_FFO))
I would heavily need your help to simplify a code that allows me to have a data analysis on salesforce leads.
I have the following dataframes, which as you can see are split due to python limit on managing more than 550 objects in a single list.
Iterlist = list()
for x in range(0, int(len(List)/550)+1):
m = List[550*x: 550*x+550]
Iterlist.append(m)
Iterlist0= pd.DataFrame(Iterlist[0])
Iterlist1= pd.DataFrame(Iterlist[1])
Iterlist2= pd.DataFrame(Iterlist[2])
...and so on until the initial longer list is split
...
converted in the following lists for formatting reasons:
A= Iterlist0["Id"].tolist()
mylistA = "".join(str(x)+ "','" for x in A)
mylistA = mylistA[:-2]
mylistA0 = "('" + mylistA + ")"
B = Iterlist1["Id"].tolist()
mylistB = "".join(str(x)+ "','" for x in B)
mylistB = mylistB[:-2]
mylistB1 = "('" + mylistB + ")"
C = Iterlist2["Id"].tolist()
mylistC = "".join(str(x)+ "','" for x in C)
mylistC = mylistC[:-2]
mylistC2 = "('" + mylistC + ")"
and so on...
...
I want to create a loop that allows me to query from salesforce each of the lists using the following code template for example:
queryA='SELECT '+cols[1]+', '+cols[2]+', '+cols[3]+', '+cols[4]+', '+cols[5]+', '+cols[6]+', '+cols[7]+', '+cols[8]+' FROM LeadHistory WHERE LeadId IN '+mylistA0
and then finally:
sf = Salesforce(username='xxx', password='xxx', security_token='xxx')
leadhistory = sf.query_all(queryA)
I donĀ“t want to write over and over numerous dataframes with specific names, lists and queries in order to get to the result. I would like to have a line for each of the codes written above, and let python automatically update the naming according to the number of 550 elements list.
I am new to this programming language and any tip would help me a lot. I think is possible to simplify it a lot but no idea how can be done.
Thanks in advance!
I am currently writing a steganography program. I currently have the majority of the things I want working. However I want to rebuild my message using multiple processes, this obviously means the bits returned from the processes need to be ordered. So currently I have:
Ok im home now I will put some actual code up.
def message_unhide(data):
inp = cv.LoadImage(data[0]) #data[0] path to image
steg = LSBSteg(inp)
bin = steg.unhideBin()
return bin
#code in main program underneath
count = 0
f = open(files[2], "wb") #files[2] = name of file to rebuild
fat = open("fat.txt", 'w+')
inp = cv.LoadImage(files[0][count]) # files[0] directory path of images
steg = LSBSteg(inp)
bin = steg.unhideBin()
fat.write(bin)
fat.close()
fat = open("fat.txt", 'rb')
num_files = fat.read() #amount of images message hidden across
fat.close()
count += 1
pool = Pool(5)
binary = []
''' Just something I was testing
for x in range(int(num_files)):
binary.append(0)
print (binary)
'''
while count <= int(num_files):
data = [files[0][count], count]
#f.write(pool.apply(message_unhide, args=(data, ))) #
#binary[count - 1] = [pool.apply_async(message_unhide, (data, ))] #
#again just another few ways i was trying to overcome
binary = [pool.apply_async(message_unhide, (data, ))]
count += 1
pool.close()
pool.join()
bits = [b.get() for b in binary]
print(binary)
#for b in bits:
# f.write(b)
f.close()
This method just overwrites binary
binary = [pool.apply_async(message_unhide, (data, ))]
This method fills the entire binary, however I loose the .get()
binary[count - 1] = [pool.apply_async(message_unhide, (data, ))]
Sorry for sloppy coding I am certainly no expert.
Your main issue has to do with overwriting binary in the loop. You only have one item in the list because you're throwing away the previous list and recreating it each time. Instead, you should use append to modify the existing list:
binary.append(pool.apply_async(message_unhide, (data, )))
But you might have a much nicer time if you use pool.map instead of rolling your own version. It expects an iterable yielding a single argument to pass to the function on each iteration, and it returns a list of the return values. The map call blocks until all the values are ready, so you don't need any other synchronization logic.
Here's an implementation using a generator expression to build the data argument items on the fly. You could simplify things and just pass files[0] to map if you rewrote message_unhide to accept the filename as its argument directly, without indexing a list (you never use the index, it seems):
# no loop this time
binary = pool.map(message_unhide, ([file, i] for i, file in enumerate(files[0])))
I am implementing a statistical program and have created a performance bottleneck and was hoping that I could obtain some help from the community to possibly point me in the direction of optimization.
I am creating a set for each row in a file and finding the intersection of that set by comparing the set data of each row in the same file. I then use the size of that intersection to filter certain sets from the output. The problem is that I have a nested for loop (O(n2)) and the standard size of the files incoming into the program are just over 20,000 lines long. I have timed the algorithm and for under 500 lines it runs in about 20 minutes but for the big files it takes about 8 hours to finish.
I have 16GB of RAM at disposal and a significantly quick 4-core Intel i7 processor. I have noticed no significant difference in memory use by copying the list1 and using a second list for comparison instead of opening the file again(maybe this is because I have an SSD?). I thought the 'with open' mechanism reads/writes directly to the HDD which is slower but noticed no difference when using two lists. In fact, the program rarely uses more than 1GB of RAM during operation.
I am hoping that other people have used a certain datatype or maybe better understands multiprocessing in Python and that they might be able to help me speed things up. I appreciate any help and I hope my code isn't too poorly written.
import ast, sys, os, shutil
list1 = []
end = 0
filterValue = 3
# creates output file with filterValue appended to name
with open(arg2 + arg1 + "/filteredSets" + str(filterValue) , "w") as outfile:
with open(arg2 + arg1 + "/file", "r") as infile:
# create a list of sets of rows in file
for row in infile:
list1.append(set(ast.literal_eval(row)))
infile.seek(0)
for row in infile:
# if file only has one row, no comparisons need to be made
if not(len(list1) == 1):
# get the first set from the list and...
set1 = set(ast.literal_eval(row))
# ...find the intersection of every other set in the file
for i in range(0, len(list1)):
# don't compare the set with itself
if not(pos == i):
set2 = list1[i]
set3 = set1.intersection(set2)
# if the two sets have less than 3 items in common
if(len(set3) < filterValue):
# and you've reached the end of the file
if(i == len(list1)):
# append the row in outfile
outfile.write(row)
# increase position in infile
pos += 1
else:
break
else:
outfile.write(row)
Sample input would be a file with this format:
[userID1, userID2, userID3]
[userID5, userID3, userID9]
[userID10, userID2, userID3, userID1]
[userID8, userID20, userID11, userID1]
The output file if this were the input file would be:
[userID5, userID3, userID9]
[userID8, userID20, userID11, userID1]
...because the two sets removed contained three or more of the same user id's.
This answer is not about how to split code in functions, name variables etc. It's about faster algorithm in terms of complexity.
I'd use a dictionary. Will not write exact code, you can do it yourself.
Sets = dict()
for rowID, row in enumerate(Rows):
for userID in row:
if Sets.get(userID) is None:
Sets[userID] = set()
Sets[userID].add(rowID)
So, now we have a dictionary, which can be used to quickly obtain rownumbers of rows containing given userID.
BadRows = set()
for rowID, row in enumerate(Rows):
Intersections = dict()
for userID in row:
for rowID_cmp in Sets[userID]:
if rowID_cmp != rowID:
Intersections[rowID_cmp] = Intersections.get(rowID_cmp, 0) + 1
# Now Intersections contains info about how many "times"
# row numbered rowID_cmp intersectcs current row
filteredOut = False
for rowID_cmp in Intersections:
if Intersections[rowID_cmp] >= filterValue:
BadRows.add(rowID_cmp)
filteredOut = True
if filteredOut:
BadRows.add(rowID)
Having rownumbers of all filtered out rows saved to BadRows, now we do iteration one last time:
for rowID, row in enumerate(Rows):
if rowID not in BadRows:
# output row
This works in 3 scans and in O(nlogn) time. Maybe you'd have to rework iterating Rows array, because it's a file in your case, but doesn't really change much.
Not sure about python syntax and details, but you get the idea behind my code.
First of all, please pack your the code into functions which do one thing well.
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def loop_over_some_result(result):
# insert assertions so that you don't end up looping in infinity:
assert result is not None
...
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
intersects = find_intersections_sets(L1,L2)
...
if __name__ == "__main__":
myfunc()
then you can easily profile the code using:
if __name__ == "__main__":
import cProfile
cProfile.run('myfunc()')
which gives you invaluable insight into your code behaviour and allows you to track down logical bugs. For more on cProfile, see How can you profile a python script?
An option to track down a logical flaw (we're all humans, right?) is to user a timeout function in a decorate like this (python2) or this (python3):
Hereby myfunc can be changed to:
def get_data(*args):
# get the data.
def find_intersections_sets(list1, list2):
# do the intersections part.
def myfunc(*args):
source1, source2 = args
L1, L2 = get_data(source1), get_data(source2)
#timeout(10) # seconds <---- the clever bit!
intersects = find_intersections_sets(L1,L2)
...
...where the timeout operation will raise an error if it takes too long.
Here is my best guess:
import ast
def get_data(filename):
with open(filename, 'r') as fi:
data = fi.readlines()
return data
def get_ast_set(line):
return set(ast.literal_eval(line))
def less_than_x_in_common(set1, set2, limit=3):
if len(set1.intersection(set2)) < limit:
return True
else:
return False
def check_infile(datafile, savefile, filtervalue=3):
list1 = [get_ast_set(row) for row in get_data(datafile)]
outlist = []
for row in list1:
if any([less_than_x_in_common(set(row), set(i), limit=filtervalue) for i in outlist]):
outlist.append(row)
with open(savefile, 'w') as fo:
fo.writelines(outlist)
if __name__ == "__main__":
datafile = str(arg2 + arg1 + "/file")
savefile = str(arg2 + arg1 + "/filteredSets" + str(filterValue))
check_infile(datafile, savefile)