I have a list of lists, each list contains four elements, and the elements represent id, age, val1, val2. I am manipulating each list in such a way that the val1 and val2 values of that list always depend on the most recent values seen in the previous lists. The previous lists for a list are those lists for which the age difference is not less than timeDelta. The list of lists are in sorted order by age.
My code is working perfect but it is slow. I feel that the line marked ** is generating too many lists of lists and can be avoided, by keep on deleting the lists from the begining one I know that the age difference of a list with the next list is more than timeDelta.
myList = [
[1, 20, '', 'x'],
[1, 25, 's', ''],
[1, 26, '', 'e'],
[1, 30, 'd', 's'],
[1, 50, 'd', 'd'],
[1, 52, 'f', 'g']
]
age_Idx =1
timeDelta = 10
for i in range(len(myList))[1:]:
newList = myList[:i+1] #Subset of lists. #********
respList = newList.pop(-1)
currage = float(respList[age_Idx])
retval = collapseListTogether(newList, age_Idx, currage, timeDelta)
if(len(retval) == 0):
continue
retval[0:2] = respList[0:2]
print(retval)
def collapseListTogether(li, age_Idx, currage, timeDelta):
finalList = []
for xl in reversed(li) :
#print(xl)
oldage = float(xl[age_Idx])
if ((currage-timeDelta) <= oldage < currage):
finalList.append(xl)
else:
break
return([reduce(lambda a, b: b or a, tup) for tup in zip(*finalList[::-1])])
Example
[1, 20, '', 'x'] ==> Not dependent on anything. Skip this list
[1, 25, 's', ''] == > [1, 25, '', 'x']
[1, 26, '', 'e'] ==> [1, 26, 's', 'x']
[1, 30, 'd', 's'] ==> [1, 30, 's', 'e']
[1, 50, 'd', 'd'] ==> Age difference (50-30 = 20) which is more than 10
[1, 52, 'f', 'g'] ==> [1, 52, 'd', 'd']
I'm just rewriting your data structure and your code:
from collections import namedtuple
Record = namedtuple('Record', ['id', 'age', 'val1', 'val2'])
myList = [
Record._make([1, 20, '', 'x']),
Record._make([1, 25, 's', '']),
Record._make([1, 26, '', 'e']),
Record._make([1, 30, 'd', 's']),
Record._make([1, 50, 'd', 'd']),
Record._make([1, 52, 'f', 'g'])
]
timeDelta = 10
for i in range(1, len(myList)):
subList = list(myList[:i+1])
rec = supList.pop(-1)
age = float(rec.age)
retval = collapseListTogether(subList, age, timeDelta)
if len(retval) == 0:
continue
retval.id, retval.age = rec.id, rec.age
print(retval)
def collapseListTogether(lst, age, tdelta):
finalLst = []
[finalLst.append(ele) if age - float(ele.age) <= tdelta and age > float(ele.age)
else None for ele in lst]
return([reduce(lambda a, b: b or a, tup) for tup in zip(*finalLst[::-1])])
Your code is not readable to me. I did not change the logic, but just modify places for performance.
One of the way out is to replace your 4-element list with tuple, even better with namedtuple, which is a famous high-performance container in Python. Also, for-loop should be avoided in interpreted languages. In python, one would use comprehensions instead of for-loop if possible to enhance performance. Your list is not too large, so time earned in efficient line interpreting should be more than that in breaking.
To me, your code should not work, but I am not sure.
Assuming your example is correct, I see no reason you can't do this in a single pass, since they're sorted by age. If the last sublist you inspected has too great a difference, you know nothing earlier will count, so you should just leave the current sublist unmodified.
previous_age = None
previous_val1 = ''
previous_val2 = ''
for sublist in myList:
age = sublist[1]
latest_val1 = sublist[2]
latest_val2 = sublist[3]
if previous_age is not None and ((age - previous_age) <= timeDelta):
# there is at least one previous list
sublist[2] = previous_val1
sublist[3] = previous_val2
previous_age = age
previous_val1 = latest_val1 or previous_val1
previous_val2 = latest_val2 or previous_val2
When testing, that code produces this modified value for your initial myList:
[[1, 20, '', 'x'],
[1, 25, '', 'x'],
[1, 26, 's', 'x'],
[1, 30, 's', 'e'],
[1, 50, 'd', 'd'],
[1, 52, 'd', 'd']]
It's a straightforward modification to build a new list rather than edit one in place, or to entirely omit the skipped lines rather than just leave them unchanged.
reduce and list comprehensions are powerful tools, but they're not right for all problems.
Related
a = ['a', 'b', 'c']
b = [10, 20, 30]
output should be like [[a:10], [b:20], [c:30]]
I do know how to use the zip to interweave two lists
l = []
for x,y in zip(a,b):
l.append([x,y])
And the output is : [['a', 10], ['b', 20], ['c', 30]]
instead of [[a:10], [b:20], [c:30]]
How should I make like this with ':'
Thanks
Assuming that you mean to create a list of singleton dicts, you can zip the two lists, convert sequence of value pairs to singletons with another zip, and map the resulting sequence to the dict constructor:
list(map(dict, zip(zip(a, b))))
Or use a list comprehension:
[{i: j} for i, j in zip(a, b)]
Both return:
[{'a': 10}, {'b': 20}, {'c': 30}]
I have a data frame like this:
2pair counts
'A','B','C','D' 5
'A','B','K','D' 3
'A','B','P','R' 2
'O','Y','C','D' 1
'O','Y','CL','lD' 4
I want to make a nested list, based on the first 2 elements. the first element is the first 2 letters and the rest is 2 other letter and counts column. For example, for the above data the result should be:
[
[
['A','B'],
[['C','D'],5],
[['K','D'],3],
['P','R'],2]
],
[
['O','Y'],
[['C','D'],1],
[['CL','lD'],4]
]
]
The following code does exactly what I want, but it is too slow. How can I make it faster?
pairs=[]
trans=[]
for i in range(df3.shape[0]):
if df3['2pair'].values[i].split(',')[:2] not in trans:
trans.append(df3['2pair'].values[i].split(',')[:2])
sub=[]
sub.append(df3['2pair'].values[i].split(',')[:2])
for j in range(df3.shape[0]):
if df3['2pair'].values[i].split(',')[:2]==df3['2pair'].values[j].split(',')[:2]:
sub.append([df3['2pair'].values[j].split(',')[2:],df3['counts'].values[j]])
pairs.append(sub)
Here's one way using str.split to split the strings in 2pair column; then use groupby.apply + to_dict to create the lists:
df[['head', 'tail']] = [[(*x[:2],), x[2:]] for x in df['2pair'].str.split(',')]
out = [[[*k]] + v for k,v in (df.groupby('head')[['tail','counts']]
.apply(lambda x: x.to_numpy().tolist()).to_dict()
.items())]
Output:
[[['A', 'B'], [['C', 'D'], 5], [['K', 'D'], 3], [['P', 'R'], 2]],
[['O', 'Y'], [['C', 'D'], 1], [['CL', 'lD'], 4]]]
I am a total noob in coding, but am currently working on some stuff just to play around in python - which is really cool!
Can you help me figure out a way on how to avoid the "ValueError: All arrays must be of the same length" in case some partial data (e.g. price) is not availabe on a website I want to crawl? I can see that the data is not available through print(len) - for example the dataframe has then a len of 10,11,11, which causes the error, because the first is missing a row value. Everything else works just fine. For me it would be cool if the missing line could simply be filled with something like "-" or "Not available".
I tried reading a lot and am suffering a lot of trial and error, therefore, would be really glad if someone could help me out. Here is my code:
#add parser
page_source = driver.page_source
soup = BeautifulSoup(driver.page_source, 'html.parser')
#add scrape info
images = []
for img in soup.findAll('div', {'class': 'gridContent'}):
images.append(img.get('src'))
marke = [marke.text for marke in soup.findAll('span', {'class': 'ZZZ'})]
titel = [titel.text for titel in soup.findAll('h3', {'class': 'YYY'})]
preis = [preis.text for preis in soup.findAll('div', {'class': 'XXX'})]
#assign DF's
alle_daten = {'Zeitstempel:': timestamp_human, 'URL:': url, 'Marke:': marke, 'Titel:': titel, 'Preis:': preis}
df_all = pd.DataFrame(data=alle_daten)
df_scrape_all_clean = df_all.replace('\n', ' ',)
clean_stack = pd.concat([df_scrape_all_clean], axis=1)
df_all_urls = df_all_urls.append(df_all)
df_all_urls.to_excel("AAA.xlsx")
print(url)
I will give you an example with a case study to understand why this error comes and how to avoid it in the future.
Suppose we attempt to create the following pandas DataFrame:
import pandas as pd
#define arrays to use as columns in DataFrame
team = ['A', 'A', 'A', 'A', 'B', 'B', 'B']
position = ['G', 'G', 'F', 'F', 'G', 'G', 'F', 'F']
points = [5, 7, 7, 9, 12, 9, 9, 4]
#attempt to create DataFrame from arrays
df = pd.DataFrame({'team': team,
'position': position,
'points': points})
result:
ValueError: All arrays must be of the same length
We receive an error that tells us each array does not have the same length.
We can verify this by printing the length of each array:
#print length of each array
print(len(team), len(position), len(points))
result:
7 8 8
We see that the ‘team’ array only has 7 elements while the ‘position’ and ‘points’ arrays each have 8 elements.
How to Fix the Error
The easiest way to address this error is to simply make sure that each array we use has the same length:
import pandas as pd
#define arrays to use as columns in DataFrame
team = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
position = ['G', 'G', 'F', 'F', 'G', 'G', 'F', 'F']
points = [5, 7, 7, 9, 12, 9, 9, 4]
#create DataFrame from arrays
df = pd.DataFrame({'team': team,
'position': position,
'points': points})
#view DataFrame
df
team position points
0 A G 5
1 A G 7
2 A F 7
3 A F 9
4 B G 12
5 B G 9
6 B F 9
7 B F 4
Notice that each array has the same length this time.
Thus, when we use the arrays to create the pandas DataFrame we don’t receive an error because each column has the same length.
I'd like to merge a list of dictionaries with lists as values. Given
arr[0] = {'number':[1,2,3,4], 'alphabet':['a','b','c']}
arr[1] = {'number':[3,4], 'alphabet':['d','e']}
arr[2] = {'number':[6,7], 'alphabet':['e','f']}
the result I want would be
merge_arr = {'number':[1,2,3,4,3,4,6,7,], 'alphabet':['a','b','c','d','e','e','f']}
could you recommend any compact code?
If you know these are the only keys in the dict, you can hard code it. If it isn't so simple, show a complicated example.
from pprint import pprint
arr = [
{
'number':[1,2,3,4],
'alphabet':['a','b','c']
},
{
'number':[3,4],
'alphabet':['d','e']
},
{
'number':[6,7],
'alphabet':['e','f']
}
]
merged_arr = {
'number': [],
'alphabet': []
}
for d in arr:
merged_arr['number'].extend(d['number'])
merged_arr['alphabet'].extend(d['alphabet'])
pprint(merged_arr)
Output:
{'alphabet': ['a', 'b', 'c', 'd', 'e', 'e', 'f'],
'number': [1, 2, 3, 4, 3, 4, 6, 7]}
arr = [{'number':[1,2,3,4], 'alphabet':['a','b','c']},{'number':[3,4], 'alphabet':['d','e']},{'number':[6,7], 'alphabet':['e','f']}]
dict = {}
for k in arr[0].keys():
dict[k] = sum([dict[k] for dict in arr], [])
print (dict)
output:
{'number': [1, 2, 3, 4, 3, 4, 6, 7], 'alphabet': ['a', 'b', 'c', 'd', 'e', 'e', 'f']}
Here is code that uses defaultdict to more easily collect the items. You could leave the result as a defaultdict but this version converts that to a regular dictionary. This code will work with any keys, and the keys in the various dictionaries can differ, as long as the values are lists. Therefore this answer is more general than the other answers given so far.
from collections import defaultdict
arr = [{'number':[1,2,3,4], 'alphabet':['a','b','c']},
{'number':[3,4], 'alphabet':['d','e']},
{'number':[6,7], 'alphabet':['e','f']},
]
merge_arr_default = defaultdict(list)
for adict in arr:
for key, value in adict.items():
merge_arr_default[key].extend(value)
merge_arr = dict(merge_arr_default)
print(merge_arr)
The printed result is
{'number': [1, 2, 3, 4, 3, 4, 6, 7], 'alphabet': ['a', 'b', 'c', 'd', 'e', 'e', 'f']}
EDIT: As noted by #pault, the solution below is of quadratic complexity, and therefore not recommended for large lists. There are more optimal ways to go around it.
However if you’re looking for compactness and relative simplicity, keep reading.
If you want a more functional form, this two-liner will do:
arr = [{'number':[1,2,3,4], 'alphabet':['a','b','c']},{'number':[3,4], 'alphabet':['d','e']},{'number':[6,7], 'alphabet':['e','f']}]
keys = ['number', 'alphabet']
merge_arr = {key: reduce(list.__add__, [dict[key] for dict in arr]) for key in keys}
print arr
Outputs:
{'alphabet': ['a', 'b', 'c', 'd', 'e', 'e', 'f'], 'number': [1, 2, 3, 4, 3, 4, 6, 7]}
This won't merge recursively.
If you want it to work with arbitrary keys, not present in each dict, use:
keys = {k for k in dict.keys() for dict in arr}
merge_arr = {key: reduce(list.__add__, [dict.get(key, []) for dict in arr]) for key in keys}
My problem, is that I have a nested list
l = [
['a','apple',1],
['b', 'banana', 0],
['a', 'artichoke', 'antenna'],
['b', 'brocolli', 'baton'],
['c', None, 22]
]
and i wanted to merge those list that have a common index value also without sorting the resultant list.
My prefered output:
[
['a','apple', 1, 'artichoke', 'antenna'],
['b', 'banana', 0, 'brocolli', 'baton'],
['c', None, 22]
]
I found the solution from here and here
But the output im getting is somehow sorted, which it comes to my current output:
[['c', None, 22], [1, 'antenna', 'apple', 'artichoke', 'a'], [0, 'b', 'banana', 'brocolli', 'baton']]
My code goes:
len_l = len(l)
i = 0
while i < (len_l - 1):
for j in range(i + 1, len_l):
# i,j iterate over all pairs of l's elements including new
# elements from merged pairs. We use len_l because len(l)
# may change as we iterate
i_set = set(l[i])
j_set = set(l[j])
if len(i_set.intersection(j_set)) > 0:
# Remove these two from list
l.pop(j)
l.pop(i)
# Merge them and append to the orig. list
ij_union = list(i_set.union(j_set))
l.append(ij_union)
# len(l) has changed
len_l -= 1
# adjust 'i' because elements shifted
i -= 1
# abort inner loop, continue with next l[i]
break
i += 1
print(l)
I would appreciate the help in here, and im also open to new suggest on how to do this in an easier way, coz honestly the i havent use the union() nor intersection() methods before.
thanx
You can use a dictionary with the first element of each list as the key and extend a list each time as they're encountered in the list-of-lists, eg:
data = [
['a','apple',1],
['b', 'banana', 0],
['a', 'artichoke', 'antenna'],
['b', 'brocolli', 'baton'],
['c', None, 22]
]
Then we:
d = {}
for k, *vals in data:
d.setdefault(k, []).extend(vals)
Optionally you can use d = collections.OrderedDict() here if it's completely necessary to guarantee the order of the keys is as seen in the list.
Which gives you a d of:
{'a': ['apple', 1, 'artichoke', 'antenna'],
'b': ['banana', 0, 'brocolli', 'baton'],
'c': [None, 22]}
If you then want to unpack back to a lists of lists (although it's probably more useful being a dict) then you can do:
new_data = [[k, *v] for k, v in d.items()]
To get:
[['a', 'apple', 1, 'artichoke', 'antenna'],
['b', 'banana', 0, 'brocolli', 'baton'],
['c', None, 22]]