Printing a groupby object in pandas - python

df.groupby("female").apply(display)
This displays all the groups in the dataset but my dataset is very large and my VScode crashed after I ran this since the output was long, So how to display only the first n groups, just to get an idea of how my grouped data looks like.

You can use a generator. For example if you want to call only the first group, you can call next on generator once.
…
group_gen = (group for group in df.groupby("female"))
# get first group: returns tuple
g = next(group_gen)
# show g
display(g[1])
Generators allows you to lazy load what you need.
To load N times you would call next N times
gs = pd.concat((next(group_gen)[1] for _ in range(N)), ignore_index=True)
display(gs)

To follow up on the nice answer from #Prayson, you can use itertools.islice to get only the first few groups:
from itertools import islice
n = 5
top_n = list(islice(df.groupby('female'), n))

Related

Compare items in list with each other

I did try to figure this out, but I am always getting index out of bounds or skipping some cases.
I have list of times:
list = ["8:00:33","8:05:02","8:06:12","8:58:17","8:58:58","9:53:11","11:03:54","11:45:51","13:54:42"]
I want to split this list into smaller chunks (lists) with 15 minutes diffrence from first.
Expected output:
list=[["8:00:33","8:05:02","8:06:12"],["8:58:17","8:58:58"],["9:53:11","11:03:54"],["11:45:51"]...)
I hope you get what I want, ask any question and sorry for bad english.
Thank you for your time and help :)
I got this far:
start=list[0]
firstchunk.append(list[0])
for i in range(len(list)-1):
if(int(time_diff(start,list[i+1])<900)): // time_diff is function that checks if times are 15 min apart
firstchunk.append(list[i+1])
print("Start: ",start," End: ",list[i+1])
else:
start=list[i+1]
print("Finished chunk")
result.append(firstchunk)
firstchunk=[start]
if(int(time_diff(start,list[i+1])>5400)): //ignore this part
print("Start: ",start)
Can you help with better solutions
Edit:
Thanks for the comments and solutions. Special thanks to Alain T.
Fastest of cars and most of money for you my brother. Thank you once again to all good people of stack overflow. I wish you a good and long life <3
You could use groupby from itertools to form the groups for you. You simply need to provide it with a function that will return the same value for times that are within 15 minutes from the last starting point. This can be computed using the accumulate function (also from itertools):
from itertools import accumulate,groupby
times = ["8:00:33","8:05:02","8:06:12","8:58:17","8:58:58","9:53:11",
"10:03:54","11:45:51","13:54:42"]
groups = accumulate(times,lambda g,t:g if time_diff(g,t)<900 else t)
grouped = [[*g] for _,g in groupby(times,key=lambda _:next(groups))]
print(grouped)
# [['8:00:33', '8:05:02', '8:06:12'], ['8:58:17', '8:58:58'],
# ['9:53:11', '10:03:54'], ['11:45:51'], ['13:54:42']]
If you don't want to use libraries, a simple loop on times can produce the same result by adding times to the last group or adding a new group based on the difference with the first entry of the last group:
grouped = []
for t in times:
if grouped and time_diff(grouped[-1][0],t)<900:
grouped[-1].append(t) # add to last group
else:
grouped.append([t]) # add a new group
[EDIT] Injecting empty groups for each chunk of 1:30 between groups:
Use zip() to get consecutive pairs of groups and compare the end of the first group with the start of the next one. For each chunk of 1.5 hour (5400 seconds), insert an empty group using a nested comprehension:
grouped[1:] = [ g for g1,g2 in zip(grouped,grouped[1:])
for g in time_diff(g1[-1],g2[0])//5400*[[]]+[g2] ]
print(grouped)
# [['8:00:33', '8:05:02', '8:06:12'], ['8:58:17', '8:58:58'],
# ['9:53:11', '10:03:54'], [], ['11:45:51'], [], ['13:54:42']]

Why `itertools.repeat` always generate the same random number?

Compare the outputs of these two functions:
from itertools import repeat
def rand_list1():
l = lambda: np.random.rand(3)
return list(repeat(l(), 5))
def rand_list2():
return [np.random.rand(3) for i in range(5)]
We see that rand_list1 who uses itetools.repeat always generates the same 3 numbers. why is this? Can it be avoided, so each call of rand_list() will generate new numbers?
For example, the output of rand_list1():
[[0.07678796 0.22623777 0.07533145]
[0.07678796 0.22623777 0.07533145]
[0.07678796 0.22623777 0.07533145]
[0.07678796 0.22623777 0.07533145]
[0.07678796 0.22623777 0.07533145]]
and the output of rand_list2():
[[0.77863856 0.30345662 0.7007517 ]
[0.56422447 0.97138115 0.47976387]
[0.20576279 0.92875791 0.06518335]
[0.2992384 0.89726684 0.16917078]
[0.8440534 0.38016789 0.51691172]]
There is a basic miscomprehension on how the language works in your question.
With the lambda expression, you simply create a new function named l.
At the moment you do l() Python will call the function - and it will return a value: it is the returned value that will be used in place of the expression l() in the remaining of the larger expression. So, in this case, you are actually calling repeat with a single, already generated, number as the first parameter.
FUnctions that are passed as arguments to be called on their destination, and then are run anew each time are an allowed construct in Python, but (1) they depend on the receiving function being able to use functions as arguments, and that is not the case of repeat, and (2) more importanty one has to pass the function name without typing in the parentheses.
In this case, repeat is redundant, as the universal syntax that allows one to call a function multiple times to create an iterator already does the repetition you thought repeat would create for you.
Just do:
return [l() for _ in range(5)]
This will call l() for each interaction of the loop.
(btw, one should strongly avoid l as a single variable or function name in any context, as in many fonts it ishard to distinguish l from 1)
The reason why list(repeat(l(), 5)) repeats the same value.
itertools.repeat() will just iterate the same result.
Because l() is assigned the result of l() in list(repeat(l(), 5)) when the l is called (which means l()).
itertools.repeat() document
repeat(10, 3) --> 10 10 10
So what is exactly going on?
stage 1
list(repeat(l(), 5))
stage 2
list(repeat([some numbers], 5))
stage 3
list(repeat([some numbers], 5)) --> [some numbers], [some numbers], [some numbers], [some numbers], [some numbers]

What is the most optimal method for "slicing" path objects, specifically the iterdir() function? [duplicate]

I would like to loop over a "slice" of an iterator. I'm not sure if this is possible as I understand that it is not possible to slice an iterator. What I would like to do is this:
def f():
for i in range(100):
yield(i)
x = f()
for i in x[95:]:
print(i)
This of course fails with:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-37-15f166d16ed2> in <module>()
4 x = f()
5
----> 6 for i in x[95:]:
7 print(i)
TypeError: 'generator' object is not subscriptable
Is there a pythonic way to loop through a "slice" of a generator?
Basically the generator I'm actually concerned with reads a very large file and performs some operations on it line by line. I would like to test slices of the file to make sure that things are performing as expected, but it is very time consuming to let it run over the entire file.
Edit:
As mentioned I need to to this on a file. I was hoping that there was a way of specifying this explicitly with the generator for instance:
import skbio
f = 'seqs.fna'
seqs = skbio.io.read(f, format='fasta')
seqs is a generator object
for seq in itertools.islice(seqs, 30516420, 30516432):
#do a bunch of stuff here
pass
The above code does what I need, however is still very slow as the generator still loops through the all of the lines. I was hoping to only loop over the specified slice
In general, the answer is itertools.islice, but you should note that islice doesn't, and can't, actually skip values. It just grabs and throws away start values before it starts yield-ing values. So it's usually best to avoid islice if possible when you need to skip a lot of values and/or the values being skipped are expensive to acquire/compute. If you can find a way to not generate the values in the first place, do so. In your (obviously contrived) example, you'd just adjust the start index for the range object.
In the specific cases of trying to run on a file object, pulling a huge number of lines (particularly reading from a slow medium) may not be ideal. Assuming you don't need specific lines, one trick you can use to avoid actually reading huge blocks of the file, while still testing some distance in to the file, is the seek to a guessed offset, read out to the end of the line (to discard the partial line you probably seeked to the middle of), then islice off however many lines you want from that point. For example:
import itertools
with open('myhugefile') as f:
# Assuming roughly 80 characters per line, this seeks to somewhere roughly
# around the 100,000th line without reading in the data preceding it
f.seek(80 * 100000)
next(f) # Throw away the partial line you probably landed in the middle of
for line in itertools.islice(f, 100): # Process 100 lines
# Do stuff with each line
For the specific case of files, you might also want to look at mmap which can be used in similar ways (and is unusually useful if you're processing blocks of data rather than lines of text, possibly randomly jumping around as you go).
Update: From your updated question, you'll need to look at your API docs and/or data format to figure out exactly how to skip around properly. It looks like skbio offers some features for skipping using seq_num, but that's still going to read if not process most of the file. If the data was written out with equal sequence lengths, I'd look at the docs on Alignment; aligned data may be loadable without processing the preceding data at all, by e.g by using Alignment.subalignment to create new Alignments that skip the rest of the data for you.
islice is the pythonic way
from itertools import islice
g = (i for i in range(100))
for num in islice(g, 95, None):
print num
You can't slice a generator object or iterator using a normal slice operations.
Instead you need to use itertools.islice as #jonrsharpe already mentioned in his comment.
import itertools
for i in itertools.islice(x, 95)
print(i)
Also note that islice returns an iterator and consume data on the iterator or generator. So you will need to convert you data to list or create a new generator object if you need to go back and do something or use the little known itertools.tee to create a copy of your generator.
from itertools import tee
first, second = tee(f())
let's clarify something first.
Spouse you want to extract the first values ​​from your generator, based on the number of arguments you specified to the left of the expression. Starting from this moment, we have a problem, because in Python there are two alternatives to unpack something.
Let's discuss these alternatives using the following example. Imagine you have the following list l = [1, 2, 3]
1) The first alternative is to NOT use the "start" expression
a, b, c = l # a=1, b=2, c=3
This works great if the number of arguments at the left of the expression (in this case, 3 arguments) is equal to the number of elements in the list.
But, if you try something like this
a, b = l # ValueError: too many values to unpack (expected 2)
This is because the list contains more arguments than those specified to the left of the expression
2) The second alternative is to use the "start" expression; this solve the previous error
a, b, c* = l # a=1, b=2, c=[3]
The "start" argument act like a buffer list.
The buffer can have three possible values:
a, b, *c = [1, 2] # a=1, b=2, c=[]
a, b, *c = [1, 2, 3] # a=1, b=2, c=[3]
a, b, *c = [1, 2, 3, 4, 5] # a=1, b=2, c=[3,4,5]
Note that the list must contain at least 2 values (in the above example). If not, an error will be raised
Now, jump to your problem. If you try something like this:
a, b, c = generator
This will work only if the generator contains only three values (the number of the generator values must be the same as the number of left arguments). Elese, an error will be raise.
If you try something like this:
a, b, *c = generator
If the number of values in the generator is lower than 2; an error will be raise because variables "a", "b" must have a value
If the number of values in the generator is 3; then a=, b=(val_2>, c=[]
If the numeber of values in the generator is greater than 3; then a=, b=(val_2>, c=[, ... ]
In this case, if the generator is infinite; the program will be blocked trying to consume the generator
What I propose for you is the following solution
# Create a dummy generator for this example
def my_generator():
i = 0
while i < 2:
yield i
i += 1
# Our Generator Unpacker
class GeneratorUnpacker:
def __init__(self, generator):
self.generator = generator
def __iter__(self):
return self
def __next__(self):
try:
return next(self.generator)
except StopIteration:
return None # When the generator ends; we will return None as value
if __name__ == '__main__':
dummy_generator = my_generator()
g = GeneratorUnpacker(dummy_generator )
a, b, c = next(g), next(g), next(g)

How to remove a list from a nested list in python?

I'm exploring probability with the use of python and I want to resolve this kind of problem. I have 5 couples and from these couples, I want to extract each group of three, that doesn't contain persons from a married couple.
import itertools as it
married_couples = [["Mary","John"],["Homer","Marge"],["Beauty","Beast"],["Abigail","George"],["Marco","Luisa"]]
all_persons = []
for couples in married_couples:
for person in couples:
all_persons.append(person)
sample_space = list(it.combinations(all_persons,3))
# Better solution with generator:
sample_space = [s for t in it.combinations(married_couples, 3) for s in it.product(*t)]
Now I would like to proceed excluding from the sample space all results that contains people from the same married couple and I try:
correct_results = sample_space.copy()
for couple in married_couples:
for res in sample_space:
if couple[0] in res and couple[1] in res:
if res in correct_results:
correct_results.remove(res)
Edit: I edited and put also the generator solution inside, so you can use this code for the purpose
The problem is in
if couple[0] and couple[1] in res:
because it tests not that the couple is in res, but that the first element of the couple is not null and the second is in res.
You should use:
if couple[0] in res and couple[1] in res:
Here's a much more efficient way to do this:
r = [s for t in itertools.combinations(married_couples, 3) for s in itertools.product(*t)]
If you just need to iterate over this, and don't need it as an actual expanded list, then you can instead create a generator:
iter = (s for t in itertools.combinations(married_couples, 3) for s in itertools.product(*t))
Then you can do:
for triple in iter:
...
Both solutions work as follows: There are two levels to it. At the top level, it calls itertools.combinations(married_couples, 3). This generates all triplets from the original set of married couples. Then, for each triplet, it uses iterrools.product(*t) to generate all 8 combinations that take one from each of the three pairs in the triplet.

Counting number of yields by an iterator?

One can consider the functions
def f(iterator):
count = 0
for _ in iterator:
count += 1
return count
def g(iterator):
return len(tuple(iterator))
I believe the only way they can differ is that g might run out of memory while f doesn't.
Assuming I'm right about that:
Is there a faster and/or otherwise-better (roughly, shorter without becoming code-golf)
way to get f(iterator) while using less memory than tuple(iterator) takes up,
preferably inline rather than as a function?
(If there is some other way for f, g ​to differ, then I believe that
f is more likely than g to properly define the functionality I'm after.
I already looked at the itertools documentation page, and don't see any solution there.)
You can use sum with generator expression yielding 1 for every item in iterator:
>>> it = (i for i in range(4))
>>> sum(1 for _ in it)
4

Categories