How can I improve this Pandas DataFrame construction? - python

I wrote this ugly piece of code. It does the job, but it is not elegant. Any suggestion to improve it?
Function returns a dict given i, j.
pairs = [dict({"i":i, "j":j}.items() + function(i, j).items()) for i,j in my_iterator]
pairs = pd.DataFrame(pairs).set_index(['i', 'j'])
The dict({}.items() + function(i, j).items()) is supposed to merge both dict in one as dict().update() does not return the merged dict.

A favourite trick* to return an updated a newly created dictionary:
dict(i=i, j=j, **function(i, j))
*and of much debate on whether this is actually "valid"...
Perhaps also worth mentioning the DataFrame from_records method:
In [11]: my_iterator = [(1, 2), (3, 4)]
In [12]: df = pd.DataFrame.from_records(my_iterator, columns=['i', 'j'])
In [13]: df
Out[13]:
i j
0 1 2
1 3 4
I suspect there would be a more efficient method by vectorizing your function (but it's hard to say what makes more sense without more specifics of your situation)...

Related

Every way to split a list in half in python without duplicates

I have a list of unique items, such as this one:
myList = ['a','b','c','d','e','f','g','h','i','j']
I want to find every possible way to split this list in half. For example, this is one way:
A = ['g','b','j','d','e']
B = ['f','a','h','i','c']
The first thing I thought of was to find all the combinations of 5 items from the list, and make this be sub-list A, and then everything else would be sub-list B:
for combination in itertools.combinations(myList, 5):
A = combination
B = everything_else()
This however does not work, as I will get every result twice. For example, if one of the combinations is ['a','b','c','d','e'] then, from this loop, I will get:
A = ['a','b','c','d','e']
B = ['f','g','h','i','j']
But then later on, when the combination ['f','g','h','i','j'] comes up, I will also get:
A = ['f','g','h','i','j']
B = ['a','b','c','d','e']
For my purposes, two sets of combinations are the same, therefore I should only get this result once. How can I achieve this?
EDIT: And to clarify, I want every single possible way to split the list (without any element appearing in both A and B at the same time, of course).
Liberal application of sets can solve this quite easily:
def split(items):
items = frozenset(items)
combinations = (frozenset(combination) for combination in itertools.combinations(items, len(items) // 2))
return {frozenset((combination, items - combination)) for combination in combinations}
Which works as expected:
>>> split([1, 2, 3, 4])
{
frozenset({frozenset({2, 4}), frozenset({1, 3})}),
frozenset({frozenset({1, 4}), frozenset({2, 3})}),
frozenset({frozenset({3, 4}), frozenset({1, 2})})
}
This follows your basic idea—we use the combinations of five from the original large set of items, and then get the other elements (which is easy enough with a set difference). We can then simplify down the duplicates by making the pairs sets as well, so the order doesn't matter and the two in any order are treated as equivalent. We then make the outer structure a set, which means the duplicates are removed.
The use of frozenset over set here is because mutable sets can't be members of other sets. We don't need any mutation here though, so that isn't a problem.
Obviously this isn't the most efficient possible solution, as we are still generating the duplicates, but it is probably the easiest and most foolproof way of implementing it.
This also leads pretty clearly into a simple upgrade for the later extension to the problem you give in the comments:
def split(items, first_length=None):
items = frozenset(items)
if first_length == None:
first_length = len(items) // 2
combinations = (frozenset(combination) for combination in itertools.combinations(items, first_length))
return {frozenset((combination, items - combination)) for combination in combinations}
Your basic idea was sound, but as you noted you were getting duplicate splits. The obvious and simplest correction is to record every split you compute and check each new split computed against those already generated. Of course, the most efficient way to record and test splits is to keep them in a set:
import itertools
def split(myList):
assert len(myList) % 2 == 0
s = set(tuple(myList))
seen = set()
for combination in itertools.combinations(myList, len(myList) // 2):
A = list(combination)
A.sort()
A = tuple(A)
if A in seen: # saw this split already
continue
B = list(s - set(A))
B.sort()
B = tuple(B)
if B in seen: # saw this split
continue
seen.add(A) # record that we have seen this split
seen.add(B) # record that we have seen this split
yield (A, B) # yield next split
for s in split(['a', 'b', 'c', 'd']):
print(s)
Prints:
(('a', 'b'), ('c', 'd'))
(('a', 'c'), ('b', 'd'))
(('a', 'd'), ('b', 'c'))

Update a dataframe within apply after using groupby

I have a pandas dataframe that I want to group on and then update the original dataframe using iterrows and set_value. This doesn't appear to work.
Here is an example.
In [1]: def func(df, n):
...: for i, row in df.iterrows():
...: print("Updating {0} with value {1}".format(i, n))
...: df.set_value(i, 'B', n)
In [2]: df = pd.DataFrame({"A": [1, 2], "B": [0, 0]})
In [3]: df
Out[4]:
A B
0 1 0
1 2 0
In [125]: func(df, 1)
Updating 0 with value 1
Updating 1 with value 1
In [126]: df
Out[126]:
A B
0 1 1
1 2 1
In [127]: df.groupby('A').apply(lambda df: func(df, 2))
Updating 0 with value 2
Updating 0 with value 2
Updating 1 with value 2
In [126]: df
Out[126]:
A B
0 1 1
1 2 1
I was hoping that B would have been updated to 2.
Why isn't this working, and what is the best way to achieve this result?
The way you have things written, you seem to want the function func(df, n) to modify df in place. But df.groupby('A') (in some sense) creates another set of dataframes (one for each group), so using func() as an argument to df.groupby('A').apply() only modifies the these newly created dataframes and not the original df. Furthermore, the returned dataframe is a concatenation of the outputs of func() called with each group as an argument, which is why the returned dataframe is empty.
The shortest fix to your problem is to return df at the end of func:
def func(df, n):
for i, row in df.iterrows():
print("Updating {0} with value {1}".format(i, n))
df.set_value(i, 'B', n)
return df
df = df.groupby('A').apply(lambda df: func(df, 2))
I presume this is not exactly what you had in mind because you're probably expecting to modify everything in place. If modifying everything in place is your intention, you'd need to use combinations of a for loop and .loc, but modifying your dataframe with .loc will be computationally expensive if you intend to call .loc many times.
I would also guess that your function to set values depends on a more complicated criterion, but usually you can vectorize things and avoid having to use .iterrows() altogether.
To avoid the XY problem, I'd suggest describing your function in more detail, because chances are that you can get everything done with a few lines incorporating the use of .loc and avoiding the need to iterate through every row in Python. Case in point: df['B'] = 2 (sans a print statement) is a one-liner solution to your problem.
This isn't working because you are altering the copied subsets of df delivered by the groupby object's get_group method. You are changing something, just not what you were expecting.
If that weren't reason enough not to do this, you'll notice you had 3 print statements. That's because pandas runs that first group once to test and infer output. Then again to actually do stuff. If you altered things outside the scope , you may end up with unintended consequences.
Someone else can provide a better example of how to do it. I just wanted to explain why it wasn't working.
In some situations, if func() does things based on index, you could modify the original dataframe directly.
Instead of this:
def func(group, n):
for i, row in group.iterrows():
print("Updating {0} with value {1}".format(i, n))
group.set_value(i, 'B', n)
return group
df.groupby('A').apply(lambda group: func(group, 2))
You could do this:
for key, group in df.groupby('A'):
n = 2
for i, row in group.iterrows():
print("Updating {0} with value {1}".format(i, n))
df.set_value(i, 'B', n)

Most efficient way with Pandas to check pair of values from 2 series?

Lets say I have a series/dataframe A that looks like
A = [3,2,1,5,4,...
A could also be sorted as it doesn't matter to me. I want to create a new series that keeps track of possible pairs. That is, I want the result to look like
B = [3_1, 3_2, 3_4, ..., 2_1, 2_4, ..., 1_4, 1_5,...
That is, I want to exclude 2_3, since 3_2 already exists. I figure I could create each element in B using something like
for i in A:
for j in A:
s = A[i].astype(str) + '_' + A[j].astype(str)
B.append(pd.Series([s]))
But I'm not sure how to make sure the (i,j) pairing doesn't already exist, such as making sure 2_3 doesn't get added as I mentioned above
What is the most efficient way to deal with this?
from itertools import combinations
s = pd.Series([1, 2, 3, 4])
s2 = pd.Series("_".join([str(a), str(b)]) for a, b in combinations(s, 2))
>>> s2
0 1_2
1 1_3
2 1_4
3 2_3
4 2_4
5 3_4
dtype: object
I don't think this really has much to do with pandas, except for the values originating (and possibly ending) in a series. Instead, I'd use itertools
Say you have an iterable a of values. Then
import itertools
set((str(i) + '_' + str(j)) for (i, j) in itertools.product(a, a) if i <= j)
will create a set of pairs where the integer before the _ is not larger than that after that, removing duplicates.
Example
import itertools
>>> set((str(i) + '_' + str(j)) for (i, j) in itertools.product(a, a) if i < j)
{'1_2',
'1_3',
'1_4',
'1_6',
'1_7',
'2_3',
'2_4',
'2_6',
'2_7',
'3_4',
'3_6',
'3_7',
'4_6',
'4_7',
'6_7'}
This can be done via a list comprehension:
>>> a = [3, 2, 1, 5, 4]
>>> [(str(x)+'_'+str(y)) for x in a for y in a if y>x]
['3_5', '3_4', '2_3', '2_5', '2_4', '1_3', '1_2', '1_5', '1_4', '4_5']
Note that the ordering of the members in the pairs in the result is sorted because of the y>x statement, which is why we have '1_3' in our output instead of '3_1'.
While importing itertools and using combinations is a correct way to do this, I usually prefer not to import libraries if I only need one one or two things from them that can also be easily accomplished via direct means.

Pandas Equivalent of R's which()

Variations of this question have been asked before, I'm still having trouble understanding how to actually slice a python series/pandas dataframe based on conditions that I'd like to set.
In R, what I'm trying to do is:
df[which(df[,colnumber] > somenumberIchoose),]
The which() function finds indices of row entries in a column in the dataframe which are greater than somenumberIchoose, and returns this as a vector. Then, I slice the dataframe by using these row indices to indicate which rows of the dataframe I would like to look at in the new form.
Is there an equivalent way to do this in python? I've seen references to enumerate, which I don't fully understand after reading the documentation. My sample in order to get the row indices right now looks like this:
indexfuture = [ x.index(), x in enumerate(df['colname']) if x > yesterday]
However, I keep on getting an invalid syntax error. I can hack a workaround by for looping through the values, and manually doing the search myself, but that seems extremely non-pythonic and inefficient.
What exactly does enumerate() do? What is the pythonic way of finding indices of values in a vector that fulfill desired parameters?
Note: I'm using Pandas for the dataframes
I may not understand clearly the question, but it looks like the response is easier than what you think:
using pandas DataFrame:
df['colname'] > somenumberIchoose
returns a pandas series with True / False values and the original index of the DataFrame.
Then you can use that boolean series on the original DataFrame and get the subset you are looking for:
df[df['colname'] > somenumberIchoose]
should be enough.
See http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
What what I know of R you might be more comfortable working with numpy -- a scientific computing package similar to MATLAB.
If you want the indices of an array who values are divisible by two then the following would work.
arr = numpy.arange(10)
truth_table = arr % 2 == 0
indices = numpy.where(truth_table)
values = arr[indices]
It's also easy to work with multi-dimensional arrays
arr2d = arr.reshape(2,5)
col_indices = numpy.where(arr2d[col_index] % 2 == 0)
col_values = arr2d[col_index, col_indices]
enumerate() returns an iterator that yields an (index, item) tuple in each iteration, so you can't (and don't need to) call .index() again.
Furthermore, your list comprehension syntax is wrong:
indexfuture = [(index, x) for (index, x) in enumerate(df['colname']) if x > yesterday]
Test case:
>>> [(index, x) for (index, x) in enumerate("abcdef") if x > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
Of course, you don't need to unpack the tuple:
>>> [tup for tup in enumerate("abcdef") if tup[1] > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
unless you're only interested in the indices, in which case you could do something like
>>> [index for (index, x) in enumerate("abcdef") if x > "c"]
[3, 4, 5]
And if you need an additional statement panda.Series allows you to do Operations between Series (+, -, /, , *).
Just multiplicate the indexes:
idx1 = df['lat'] == 49
idx2 = df['lng'] > 15
idx = idx1 * idx2
new_df = df[idx]
Instead of enumerate, I usually just use .iteritems. This saves a .index(). Namely,
[k for k, v in (df['c'] > t).iteritems() if v]
Otherwise, one has to do
df[df['c'] > t].index()
This duplicates the typing of the data frame name, which can be very long and painful to type.
A nice simple and neat way of doing this is the following:
SlicedData1 = df[df.colname>somenumber]]
This can easily be extended to include other criteria, such as non-numeric data:
SlicedData2 = df[(df.colname1>somenumber & df.colname2=='24/08/2018')]
And so on...

Why can't I iterate through list of lists this way?

Sorry I'm fairly new to python, but I needed to take 6 individual lists and concatenate them such that they resemble a list of lists.
i.e. a1 from list A + b1 from list B + c1 from list C
and a2 from list A + b2.... etc
should become [[a1,b1,c1], [a2,b2,c2]...]
I tried this:
combList = [[0]*6]*len(lengthList)
for i in range(len(lengthList)):
print i
combList[i][0] = posList[i]
combList[i][1] = widthList[i]
combList[i][2] = heightList[i]
combList[i][3] = areaList[i]
combList[i][4] = perimList[i]
combList[i][5] = lengthList[i]
# i++
print combList
and then tried a variation where I appended instead:
for i in range(len(lengthList)):
print i
combList[i][0].append(posList[i])
combList[i][1].append(widthList[i])
combList[i][2].append(heightList[i])
combList[i][3].append(areaList[i])
combList[i][4].append(perimList[i])
combList[i][5].append(lengthList[i])
# i++
print combList
So I have two questions.
Why didn't either of those work, cus in my mind they should have. And I don't need to put i++ at the bottom right? For some reason it just wasn't working so I was just trouble shooting.
I ended up finding a solution, which is below, but I'd just like to understand what happened in the above two codes that failed so terribly.
for j in range(len(fNameList)):
rows = [fNameList[j], widthList[j], heightList[j], areaList[j], perimeterList[j], lengthList[j]]
print rows
combList.append(rows)
print combList
The issue with at you did is that you are creating a list of 6 references to the same thing.
[0]*6 will generate a list of 6 references to the same number (zero), and [[0]*6]*len(lengthList) will generate a list of references to the same [0]*6 list.
I think the function you want is zip:
A = ['a1','a2','a3']
B = ['b1','b2','b3']
C = ['c1','c2','c3']
print [x for x in zip(A,B,C)]
which gives:
[('a1', 'b1', 'c1'), ('a2', 'b2', 'c2'), ('a3', 'b3', 'c3')]
So in your case, this would work:
combList = [x for x in zip(fNameList, widthList, heightList, areaList, perimeterList, lengthList)]
a = [0]*6 defines a list with 6 references, all those references point to the number 0
[a]*m defines a list with m references, all pointing to a, in this case [0]*6.
The code in your final example works because it adds references to new objects, rather than modifying an existing one repeatedly.
Other people recommended you use zip, and it is indeed the best solution, IMHO.
You are making a list of names all pointing at the same list of six zeros when you do:
combList = [[0]*6]*len(lengthList)
This is equivalent to doing this:
internal_list = [0] * 6
combList = [internal_list, internal_list, internal_list, internal_list, internal_list]
Instead, if you use zip you can get what you want in one pass:
zipped_list = zip(posList, widthList, heightList, areaList, perimList, lengthList)
You could do this with a comprehension across a zip, if the lists are all the same length
[list(tup) for tup in zip (l1, l2, l3...)]
Depending on version of Python and how big your 3 lists are you should use either zip or izip.
izip if you're running Python < 3 (you can use zip as well but if the lists are really big then a generator would be a whole heap faster and better for you).
zip if you're running Python >= 3
from itertools import izip
zipped_list = izip(a,b,c)
for item in zipped_list:
print item
>> (1, 1, 1)
>> (2, 2, 2)
>> (3, 3, 3)
>> (4, 4, 4)
>> (5, 5, 5)
And just for a bit tutoring on how to write good clean looking Python:
Your loop that you've done for i in range(len(lengthList) could very easily be transformed to whats really Pythonic.
for item in lengthList:
Now you're thinking "what about my index, i can't access the index of the element".
Well Python has a fix for that too it's called enumerate and you use it like so:
for index, item in enumerate(lengthlist):
So translating your code down to a more Pythonic syntax:
for index, element in enumerate(lengthList):
combList[index][0].append(posList[index])
combList[index][1].append(widthList[index])
combList[index][2].append(heightList[index])
combList[index][3].append(areaList[index])
combList[index][4].append(perimList[index])
combList[index][5].append(lengthList[index])

Categories