Randomly extract x items from a list using python

Randomly extract x items from a list using python - python

Starting with two lists such as:
lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted. For example say I wanted 50% the output would be
newLstOne = ['8', '1', '3', '7', '5']
newLstTwo = ['8', '1', '3', '7', '5']
I have achieved this using the following code:
from random import randrange
lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
LengthOfList = len(lstOne)
print LengthOfList
PercentageToUse = input("What Percentage Of Reads Do you want to extract? ")
RangeOfListIndices = []
HowManyIndicesToMake = (float(PercentageToUse)/100)*float(LengthOfList)
print HowManyIndicesToMake
for x in lstOne:
if len(RangeOfListIndices)==int(HowManyIndicesToMake):
break
else:
random_index = randrange(0,LengthOfList)
RangeOfListIndices.append(random_index)
print RangeOfListIndices
newlstOne = []
newlstTwo = []
for x in RangeOfListIndices:
newlstOne.append(lstOne[int(x)])
for x in RangeOfListIndices:
newlstTwo.append(lstTwo[int(x)])
print newlstOne
print newlstTwo
But I was wondering if there was a more efficient way of doing this, in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?
Thank you

Q. I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted.
A. The most straight-forward approach directly matches your specification:
percentage = float(raw_input('What percentage? '))
k = len(data) * percentage // 100
indicies = random.sample(xrange(len(data)), k)
new_list1 = [list1[i] for i in indicies]
new_list2 = [list2[i] for i in indicies]
Q. in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?
A. In Python 2 and Python 3, the random.randrange() function completely eliminates bias (it uses the internal _randbelow() method that makes multiple random choices until a bias-free result is found).
In Python 2, the random.sample() function is slightly biased but only in the round-off in the last of 53 bits. In Python 3, the random.sample() function uses the internal _randbelow() method and is bias-free.

Just zip your two lists together, use random.sample to do your sampling, then zip again to transpose back into two lists.
import random
_zips = random.sample(zip(lstOne,lstTwo), 5)
new_list_1, new_list_2 = zip(*_zips)
demo:
list_1 = range(1,11)
list_2 = list('abcdefghij')
_zips = random.sample(zip(list_1, list_2), 5)
new_list_1, new_list_2 = zip(*_zips)
new_list_1
Out[33]: (3, 1, 9, 8, 10)
new_list_2
Out[34]: ('c', 'a', 'i', 'h', 'j')

The way you are doing it looks mostly okay to me.
If you want to avoid sampling the same object several times, you could proceed as follows:
a = len(lstOne)
choose_from = range(a) #<--- creates a list of ints of size len(lstOne)
random.shuffle(choose_from)
for i in choose_from[:a]: # selects the desired number of items from both original list
newlstOne.append(lstOne[i]) # at the same random locations & appends to two newlists in
newlstTwo.append(lstTwo[i]) # sequence

Related

Parse a list into a list of lists in python

I am trying to figure out how to parse a list into a list of lists.
tileElements = browser.find_element(By.CLASS_NAME, 'tile-container')
tileHTML = (str(tileElements.get_attribute('innerHTML')))
tileNUMS = re.findall('\d+',tileHTML)
NumTiles = int(len(tileNUMS)/4)
#parse out list, each 4 list items are one tile
print(str(tileNUMS))
print(str(NumTiles))
TileList = [[i+j for i in range(len(tileNUMS))]for j in range (NumTiles)]
print(str(TileList))
The first part of this code works find and gives me a list of Tile Numbers:
['2', '3', '1', '2', '2', '4', '4', '2']
However, what I need is a list of lists made out of this and that is where I am getting stuck.
The list of lists should be 4 elements long and look like this:
[['2', '3', '1', '2'] , ['2', '4', '4', '2']]
It should be able to do this for as many tiles as there are in the game (up to 19 I believe). It would be really nice if when the middle numbers are repeated that the two outside numbers are replaced with the latest value from the source list.

You can use a list comprehension to get slices from the list like so.
elements = ['2', '3', '1', '2', '2', '4', '4', '2']
size = 4
result = [elements[i:i+size] for i in range(0, len(elements), size)]
(By the way, there's no need to cast things into str to print them, and tileHTML is probably already a string, too.)

Converting string to 2D list

def convert_to_list(VertexList):
VerticesList = []
items = VertexList.split(';')
for item in items:
i = item.split(',')
SubList = []
for item in i:
SubList.append(item)
VerticesList.append(SubList)
return VerticesList
This code converts string in this format to a 2D list. However, I am sure it can be optimized.
Input -> '1,2,4,5,6,7;2,3,4,5,6,7,8;1,2,4,5,6,8'
Output -> [['1', '2', '4', '5', '6', '7'], ['2', '3', '4', '5', '6', '7', '8'], ['1', '2', '4', '5', '6', '8']]

Use a comprehension.
inp = '1,2,4,5,6,7;2,3,4,5,6,7,8;1,2,4,5,6,8'
print([s.split(',') for s in inp.split(';')])
Results in
[['1', '2', '4', '5', '6', '7'], ['2', '3', '4', '5', '6', '7', '8'], ['1', '2', '4', '5', '6', '8']]
This is smaller, easier to read code, which is part of the optimization I expect you were looking for. It doesn't loop through things any fewer times, but it's executing fewer assignments, using less temporary variabels, and making fewer function calls (i.e. append()). Maybe some of those calls are being made behind the scenes in the comprehension, but you should be taking advantage of whatever optimizations Python does to its comprehensions in terms of what functions calls are made.
--update--
Check out this answer for a performance analysis of the OP and this answer.
-- update 2 --
To convert all strings to int, you can use map or another comprehension.
inp = '1,2,4,5,6,7;2,3,4,5,6,7,8;1,2,4,5,6,8'
print([list(map(int, s.split(','))) for s in inp.split(';')])
or
inp = '1,2,4,5,6,7;2,3,4,5,6,7,8;1,2,4,5,6,8'
print([[int(c) for c in s.split(',')] for s in inp.split(';')])

This is not a solution, but only a comparison of the optimality of the above codes in terms of actual performance:
from timeit import Timer
code1 = """\
def convert_to_list(VertexList):
VerticesList = []
items = VertexList.split(';')
for item in items:
i = item.split(',')
SubList = []
for item in i:
SubList.append(item)
VerticesList.append(SubList)
return VerticesList
inp = '1,2,4,5,6,7;2,3,4,5,6,7,8;1,2,4,5,6,8'
convert_to_list(inp)
"""
code2 = """\
inp = '1,2,4,5,6,7;2,3,4,5,6,7,8;1,2,4,5,6,8'
out = [s.split(',') for s in inp.split(';')]
"""
t = Timer(stmt=code1)
time1 = t.timeit() # 1000000 iteration by default
print(f"Original time:{round(time1, 6)} sec.")
t = Timer(stmt=code2)
time2 = t.timeit() # 1000000 iteration by default
print(f"New time: {round(time2, 6)} sec.")
print(f'New solution faster in = {round(time1 / time2, 1)} times')
Output:
Original time:1.812856 sec.
New time: 0.741987 sec.
New solution faster in = 2.4 times

how to remove the first occurence of an integer in a list

this is my code:
positions = []
for i in lines[2]:
if i not in positions:
positions.append(i)
print (positions)
print (lines[1])
print (lines[2])
the output is:
['1', '2', '3', '4', '5']
['is', 'the', 'time', 'this', 'ends']
['1', '2', '3', '4', '1', '5']
I would want my output of the variable "positions" to be; ['2','3','4','1','5']
so instead of removing the second duplicate from the variable "lines[2]" it should remove the first duplicate.

You can reverse your list, create the positions and then reverse it back as mentioned by #tobias_k in the comment:
lst = ['1', '2', '3', '4', '1', '5']
positions = []
for i in reversed(lst):
if i not in positions:
positions.append(i)
list(reversed(positions))
# ['2', '3', '4', '1', '5']

You'll need to first detect what values are duplicated before you can build positions. Use an itertools.Counter() object to test if a value has been seen more than once:
from itertools import Counter
counts = Counter(lines[2])
positions = []
for i in lines[2]:
counts[i] -= 1
if counts[i] == 0:
# only add if this is the 'last' value
positions.append(i)
This'll work for any number of repetitions of values; only the last value to appear is ever used.
You could also reverse the list, and track what you have already seen with a set, which is faster than testing against the list:
positions = []
seen = set()
for i in reversed(lines[2]):
if i not in seen:
# only add if this is the first time we see the value
positions.append(i)
seen.add(i)
positions = positions[::-1] # reverse the output list
Both approaches require two iterations; the first to create the counts mapping, the second to reverse the output list. Which is faster will depend on the size of lines[2] and the number of duplicates in it, and wether or not you are using Python 3 (where Counter performance was significantly improved).

you can use a dictionary to save the last position of the element and then build a new list with that information
>>> data=['1', '2', '3', '4', '1', '5']
>>> temp={ e:i for i,e in enumerate(data) }
>>> sorted(temp, key=lambda x:temp[x])
['2', '3', '4', '1', '5']
>>>

How to work out an average from items within a dict.?

I am new to python so a simplified explanation would be much appreciated!
As of now I have a dictionary that looks like this:
names = {'Bob Smith': ['5', '6', '7', '5'], 'Fred Jones': ['8', '5', '7', '5', '9'], 'James Jackson': ['5','8','8','6','5']}
I need to do the following:
Take the last three items from each of the entries in the dict. e.g. 6, 7, 5 for bob smith.
Calculate an average based upon those values. e.g. Bob smith would be 6.
List the averages in order from highest to lowest (without the dict keys).
So far I have the following enclosed in an if statement:
if method == 2:
for scores in names.items():
score = scores[-1,-2,-3]
average = sum(int(score)) / float(3)
print(average)
I had a look at this thread too but I am still stuck.
Can anyone give me some pointers?

Scores[-1,-2,-3] does not get the last three elements. It gets the element at the key (-1,-2,-3) in a dictionary, which will raise an error in the case of a list. Scores[-3:] would get the last three elements.
When getting the scores, you need to use names.values() instead of names.items()
The python string-to-integer conversions in the int type constructor are not smart enough to handle lists of strings, only individual strings. Using map(int,score) or int(i) for i in score would fix that.
The variable score is also an extremely poor choice of name for a list of elements.

In Python3.4+, there is a statistics module
>>> names = {'Bob Smith': ['5', '6', '7', '5'], 'Fred Jones': ['8', '5', '7', '5', '9'], 'James Jackson': ['5','8','8','6','5']}
>>> import statistics
>>> sorted((statistics.mean(map(int, x[-3:])) for x in names.values()), reverse=True)
[7.0, 6.333333333333333, 6.0]

names = {'Bob Smith': ['5', '6', '7', '5'], 'Fred Jones': ['8', '5', '7', '5', '9'], 'James Jackson': ['5','8','8','6','5']}
def avg(l):
l = list(map(int,l))
return sum(l[-3:])/3
avgs = []
for each in names.values():
avgs.append(avg(each))
avgs.sort(reverse=True)
print avgs
Output:
[7, 6, 6]

maintain relative order in a set

I am having a list which contains some elements with repetition and from this list I want to generate a list which has no repeated elements in it AND also maintains theie Order in the List.
I tried set(['1','1','2','3','4','4','5','2','2','3','3','6']) and got the output as set(['1', '3', '2', '5', '4', '6'])
But I want the output as set(['1', '2', '3', '4', '5', '6']) i.e. maintain the relative order of the elements already present.
How to do this??? Thanks in advance...

One way to do this:
In [9]: x = ['1','1','2','3','4','4','5','2','2','3','3','6']
In [10]: s = set()
In [11]: y = []
In [12]: for i in x:
...: if i not in s:
...: y.append(i)
...: s.add(i)
...:
In [13]: y
Out[13]: ['1', '2', '3', '4', '5', '6']
As noted by Martijn, a set is unordered by definition, so you need a list to store the result. See also this old question.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Randomly extract x items from a list using python - python

Related

Parse a list into a list of lists in python

Converting string to 2D list

how to remove the first occurence of an integer in a list

How to work out an average from items within a dict.?

maintain relative order in a set

Categories

Resources