Most efficient way to iterate through list of lists - python

I'm currently collecting data from quandl and is saved as a list of lists. The list looks something like this (Price data):
['2', 1L, datetime.date(1998, 1, 2), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), '82.1900', '83.6200', '81.7500', '83.5000', '28.5183', 1286500.0]
This is typically 1 of about 5000 lists, and every once in awhile Quandl will spit back some NaN values that don't like being saved into the database.
['2', 1L, datetime.date(1998, 1, 2), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), 'nan', 'nan', 'nan', 'nan', 'nan', 0]
What would be the most efficient way of iterating through the list of lists to change 'nan' values into zeros?
I know I could do something like this, but it seems rather inefficient. This operation will need to be performed on 11 different values * 5000 different dates * 500 companies:
def screen_data(data):
new_data = []
for d in data:
new_list = []
for x in d:
new_value = x
if math.isNan(x):
new_value = 0
new_list.append(new_value)
new_data.append(new_list)
return new_data
I would be interested in any solution that could reduce the time. I know DataFrames might work, but not sure how it would solve the NaN issue.
Or if there is a way to include NaN values in an SQLServer5.6 database along with floats, changing the database is also a viable option.

Don't create a new list - rather, edit the old list in-place:
import math
def screenData(L):
for subl in L:
for i,n in enumerate(subl):
if math.isnan(n): subl[i] = 0
The only way I can think of, to make this faster, would be with multiprocessing

I haven't timed it but have you tried using nested list comprehension with conditional expressions ?
For example:
import datetime
data = [
['2', 1, datetime.date(1998, 1, 2),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
'82.1900', '83.6200', '81.7500', '83.5000',
'28.5183', 1286500.0],
['2', 1, datetime.date(1998, 1, 2),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
'nan', 'nan', 'nan', 'nan', 'nan', 0],
]
new_data = [[y if str(y).lower() != 'nan' else 0 for y in x] for x in data]
print(new_data)
I did not use math.isnan(y) because you have to be sure that y is a float number or you'll get an error. This is much more difficult to do while almost everything has a string representation. But I still made sure that I did the lower case comparison to 'nan' (with .lower()) since 'NaN' or 'Nan' are legal ways to express "Not a Number".

how about this
import math
def clean_nan(data_list,value=0):
for i,x in enumerate(data_list):
if math.isnan(x):
data_list[i] = value
return data_list
(the return is optional, as the modification was made in-place, but it is needed if used with map or similar, assuming of course that data_list is well a list or similar container)
depending on how you get your data and how you work with it will determined how to use it, for instance if you do something like this
for data in (my database/Quandl/whatever):
#do stuff with data
you can change it to
for data in (my database/Quandl/whatever):
clean_nan(data)
#do stuff with data
or use map or if you are in python 2 imap
for data in map(clean_nan,(my database/Quandl/whatever)):
#do stuff with data
that way you get to work with your data as soon as that arrive from the database/Quandl/whatever, granted if the place where you get the data also work as a generator, that is don't process the whole thing all at once, and if it does, procure to change it to a generator if possible. In either case with this you get to work with your data as soon as possible.

Related

This particular way of using .map() in python

I was reading an article and I came across this below-given piece of code. I ran it and it worked for me:
x = df.columns
x_labels = [v for v in sorted(x.unique())]
x_to_num = {p[1]:p[0] for p in enumerate(x_labels)}
#till here it is okay. But I don't understand what is going with this map.
x.map(x_to_num)
The final result from the map is given below:
Int64Index([ 0, 3, 28, 1, 26, 23, 27, 22, 20, 21, 24, 18, 10, 7, 8, 15, 19,
13, 14, 17, 25, 16, 9, 11, 6, 12, 5, 2, 4],
dtype='int64')
Can someone please explain to me how the .map() worked here. I searched online, but could not find anything related.
ps: df is a pandas dataframe.
Let's look what .map() function in general does in python.
>>> l = [1, 2, 3]
>>> list(map(str, l))
# ['1', '2', '3']
Here the list having numeric elements is converted to string elements.
So, whatever function we are trying to apply using map needs an iterator.
You probably might have got confused because the general syntax of map (map(MappingFunction, IteratorObject)) is not used here and things still work.
The variable x takes the form of IteratorObject , while the dictionary x_to_num contains the mapping and hence takes the form of MappingFunction.
Edit: this scenario has nothing to with pandas as such, x can be any iterator type object.

numpy: how to get the maximum value like this in numpy? [duplicate]

I need a fast way to keep a running maximum of a numpy array. For example, if my array was:
x = numpy.array([11,12,13,20,19,18,17,18,23,21])
I'd want:
numpy.array([11,12,13,20,20,20,20,20,23,23])
Obviously I could do this with a little loop:
def running_max(x):
result = [x[0]]
for val in x:
if val > result[-1]:
result.append(val)
else:
result.append(result[-1])
return result
But my arrays have hundreds of thousands of entries and I need to call this many times. It seems like there's got to be a numpy trick to remove the loop, but I can't seem to find anything that will work. The alternative will be to write this as a C extension, but it seems like I'd be reinventing the wheel.
numpy.maximum.accumulate works for me.
>>> import numpy
>>> numpy.maximum.accumulate(numpy.array([11,12,13,20,19,18,17,18,23,21]))
array([11, 12, 13, 20, 20, 20, 20, 20, 23, 23])
As suggested, there is scipy.maximum.accumulate:
In [9]: x
Out[9]: [1, 3, 2, 5, 4]
In [10]: scipy.maximum.accumulate(x)
Out[10]: array([1, 3, 3, 5, 5])

Find lists which together contain all values from 0-23 in list of lists python

I have a list of lists. The lists within these list look like the following:
[0,2,5,8,7,12,16,18], [0,9,18,23,5,8,15,16], [1,3,4,17,19,6,13,23],
[9,22,21,10,11,20,14,15], [2,8,23,0,7,16,9,15], [0,5,8,7,9,11,20,16]
Every small list has 8 values from 0-23 and there are no value repeats within a small list.
What I need now are the three lists which have the values 0-23 stored. It is possible that there are a couple of combinations to accomplish it but I do only need one.
In this particular case the output would be:
[0,2,5,8,7,12,16,18], [1,3,4,17,19,6,13,23], [9,22,21,10,11,20,14,15]
I thought to do something with the order but I'm not a python pro so it is hard for me to handle all the lists within the list (to compare all).
Thanks for your help.
The following appears to work:
from itertools import combinations, chain
lol = [[0,2,5,8,7,12,16,18], [0,9,18,23,5,8,15,16], [1,3,4,17,19,6,13,23], [9,22,21,10,11,20,14,15], [2,8,23,0,7,16,9,15], [0,5,8,7,9,11,20,16]]
for p in combinations(lol, 3):
if len(set((list(chain.from_iterable(p))))) == 24:
print(p)
break # if only one is required
This displays the following:
([0, 2, 5, 8, 7, 12, 16, 18], [1, 3, 4, 17, 19, 6, 13, 23], [9, 22, 21, 10, 11, 20, 14, 15])
If it will always happen that 3 list will form numbers from 0-23, and you only want first list, then this can be done by creating combinations of length 3, and then set intersection:
>>> li = [[0,2,5,8,7,12,16,18], [0,9,18,23,5,8,15,16], [1,3,4,17,19,6,13,23], [9,22,21,10,11,20,14,15], [2,8,23,0,7,16,9,15], [0,5,8,7,9,11,20,16]]
>>> import itertools
>>> for t in itertools.combinations(li, 3):
... if not set(t[0]) & set(t[1]) and not set(t[0]) & set(t[2]) and not set(t[1]) & set(t[2]):
... print t
... break
([0, 2, 5, 8, 7, 12, 16, 18], [1, 3, 4, 17, 19, 6, 13, 23], [9, 22, 21, 10, 11, 20, 14, 15])
Let's do a recursive solution.
We need a list of lists that contain these values:
target_set = set(range(24))
This is a function that recursively tries to find a list of lists that match exactly that set:
def find_covering_lists(target_set, list_of_lists):
if not target_set:
# Done
return []
if not list_of_lists:
# Failed
raise ValueError()
# Two cases -- either the first element works, or it doesn't
try:
first_as_set = set(list_of_lists[0])
if first_as_set <= target_set:
# If it's a subset, call this recursively for the rest
return [list_of_lists[0]] + find_covering_lists(
target_set - first_as_set, list_of_lists[1:])
except ValueError:
pass # The recursive call failed to find a solution
# If we get here, the first element failed.
return find_covering_lists(target_set, list_of_lists[1:])

Python: Specifying Positions in Lists?

I have three lists, of the form:
Day: [1, 1, 1, 2, 2, 2, 3, 3, 3, ..... n, n, n]
Wavelength: [10, 20, 30, 10, 20, 30, 10, 20, 30, ..... 10, 20, 30]
Flux: [1, 2, 3, 1, 2, 3, 1, 2, 3, ..... 1, 2, 3]
I want to split the lists so that the sections of the list with the "Day" value of 1 are seperated and run through a function, and the process then repeats and does this all the way through until it has been done for all n days.
I've tried splitting them into lists and currently have:
x=[]
y=[]
z=[]
for i in day:
if Day[i] == Day[i+1]:
x.append(Day(i))
y.append(Wavelength[i])
z.append(Flux[i])
i+=1
else "integrate over the Wavelength/Flux values where the value of Day is 1"
i+=1
This doesn't work, and I'm not convinced I'm going about it the best way. I'm relatively new to programming so it still takes me ages to find and fix errors!
If you use zip() to combine the three lists into one list of tuples, you can then filter it for each day you care about. (This isn't particularly efficient if you have lots of data, and will require more memory than your approach, but has the advantage of being, concise, fairly pythonic, and I believe readable.)
data = zip(day, wavelength, flux)
for d in range(min(day), max(day)+1):
print d, [ datum for datum in data if datum[0] == d ]
Instead of print you could just pass that list (the output of the […] list comprehension) to whatever function you need to run over the data (possibly with d, the day you're dealing with at that time).

Counting like elements in a list and appending list

I am trying to create a list in Python with values pulled from an active excel sheet. I want it to pull the step # value from the excel file and append it to the list while also including which number of that element it is. For example, 1_1 the first time it pulls 1, 1_2 the second time, 1_3 the third, etc. My code is as follows...
import win32com.client
xl = win32com.client.Dispatch("Excel.Application")
CellNum = xl.ActiveSheet.UsedRange.Rows.Count
Steps = []
for i in range(2,CellNum + 1): #Create load and step arrays in abaqus after importing from excel
if str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6') in Steps:
StepCount = 1
for x in Steps:
if x == str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6'):
StepCount+=1
Steps.append(str(int(xl.Cells(i,1).value))+'_'+str(StepCount))
else:
Steps.append(str(int(xl.Cells(i,1).value))+'_1')
I understand that without the excel file, the program will not run for any of you, but I was just wondering if it is some simple error that I am missing. When I run this, the StepCount does not go higher than 2 so I receive a bunch of 1_2, 2_2, 3_2, etc elements. I've posted my resulting list below.
>>> Steps
['1_1', '2_1', '3_1', '4_1', '5_1', '6_1', '7_1', '8_1', '9_1', '10_1', '11_1', '12_1',
'13_1', '14_1', '1_2', '14_2', '13_2', '12_2', '11_2', '10_2', '2_2', '3_2', '9_2',
'8_2', '7_2', '6_2', '5_2', '4_2', '3_2', '2_2', '1_2', '2_2', '3_2', '4_2', '5_2',
'6_2', '7_2', '8_2', '9_2', '10_2', '11_2', '12_2', '13_2', '14_2', '1_2', '2_2']
EDIT #1: So, if the ('_1' or '_2' or '_3' or '_4' or '_5' or '_6') will ALWAYS only use _1, is it this line of code that is messing with my counter?
if x == str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6'):
Since it is only using _1, it will only count 1_1 and not check 1_2, 1_3, 1_4, etc
EDIT #2: Now I am using the following code. My input list is also below.
from collections import defaultdict
StepsList = []
Steps = []
tracker = defaultdict(int)
for i in range(2,CellNum + 1):
StepsList.append(int(xl.Cells(i,1).value))
>>> StepsList
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1, 14, 13, 12, 11, 10, 2, 3, 9, 8,
7, 6, 5, 4, 3, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1, 2]
for cell in StepsList:
Steps.append('{}_{}'.format(cell, tracker[cell]+1)) # This is +1 because the tracker starts at 0
tracker[cell]+=1
I get the following error: ValueError: zero length field name in format from the for cell in StepsList: iteration block
EDIT #3: Got it working. For some reason it didn't like
Steps.append('{}_{}'.format(cell, tracker[cell]+1))
So I just changed it to
for cell in StepsList:
tracker[cell]+=1
Steps.append(str(cell)+'_'+str(tracker[cell]))
Thanks for all of your help!
This line:
if str(int(xl.Cells(i,1).value))+('_1' or '_2' or '_3' or '_4' or '_5' or '_6') in Steps:
does not do what you think it does. ('_1' or '_2' or '_3' or '_4' or '_5' or '_6') will always return '_1'. It does not iterate over that series of or values looking for a match.
Without seeing expected input vs. expected output, it's hard to point you in the correct direction to actually get what you want out of your code, but likely you'll want to leverage itertools.product or one of the other combinatoric methods from itertools.
Update
Based on your comments, I think that this is a way of solving your problem. Assuming an input list of the following:
in_list = [1, 1, 1, 2, 3, 3, 4]
You can do the following:
from collections import defaultdict
tracker = defaultdict(int) # defaultdict is just a regular dict with a default value at new keys (in this case 0)
steps = []
for cell in in_list:
steps.append('{}_{}'.format(cell, tracker[cell]+1)) # This is +1 because the tracker starts at 0
tracker[cell]+=1
Result:
>>> steps
['1_1', '1_2', '1_3', '2_1', '3_1', '3_2', '4_1']
There are likely more efficient ways to do this using combinations of itertools, but this way is certainly the most straight-forward

Categories