Slice pandas series with elements not in the index - python

I have a pandas series indexed by tuples, like this:
from pandas import Series
s = Series({(0, 0): 1, (0, 1): 2, (0, 3): 3, (1, 0): 1, (1, 2): 4, (3, 0): 5})
I want to slice such a series by using indexes that are also tuples (using lexicographic ordering), but not necessarily in the index. Slicing seems to work when I pass an index that is on the series:
s[:(1,0)]
(0, 0) 1
(0, 1) 2
(0, 3) 3
(1, 0) 1
dtype: int64
but if I try slicing by an index which is not on the series there is an error:
s[:(1,1)]
...
ValueError: Index(...) must be called with a collection of some kind, 0 was passed
Ideally I'd like to get the series elements indexed by (0, 0), (0, 1), (0, 3), (1, 0), similar to what happens when slicing using dates in TimeSeries. Is there a simple way to achieve this?

This works if you have a MultiIndex rather than an index of tuples:
In [11]: s.index = pd.MultiIndex.from_tuples(s.index)
In [12]: s
Out[12]:
0 0 1
1 2
3 3
1 0 1
2 4
3 0 5
dtype: int64
In [13]: s[:(1,1)]
Out[13]:
0 0 1
1 2
3 3
1 0 1
dtype: int64
In a previous edit I had suggested this could be a bug, and had created an awful hack...

Related

Get log(n, 2) features after dummy encoding

During the dummy encoding (for example OneHotEncoder) we can drop first feature (with param: drop='first'). It helps when we have 3 features. It works because it's enough to have 2 cells to encoding 3 features like: (0, 0), (0, 1), (1, 0); also for 4 features: (0, 0), (0, 1), (1, 0), (1, 1). So I noticed that to encode n-categories it's enough to have math.ceil(log(n, 2)) features. But I can't find function (in sklearn/pandas) that allows to do this. I ask your help.
What you're searching for is bin() function which is a standard built-in in Python.
Suppose you happen to have a simple pandas df:
df = pd.DataFrame({"a":["a","b","c","x","a","c"]})
print(df)
a
0 a
1 b
2 c
3 x
4 a
5 c
Then you may proceed as follows:
df["enc"] = df["a"].astype("category").cat.codes
max_enc_length = len(bin(df["enc"].max())[2:])
df["enc"]=df["enc"].apply(lambda x: bin(x)[2:].zfill(max_enc_length))
df = pd.concat([df["a"], df["enc"].apply(lambda x: pd.Series(list(x)))], axis=1)
print(df)
a 0 1
0 a 0 0
1 b 0 1
2 c 1 0
3 x 1 1
4 a 0 0
5 c 1 0
Note, linear models are out for such type of encoding.

Count the frequency that a combination occurs in a Dataframe column - Apriori algorithm

I have a problem to search the correct solution of the frequency of a combination.
This my code:
import pandas as pd
import itertools
list = [1,20,1,50]
combinations = []
for i in itertools.combinations(list ,2):
combinations .append(i)
data = pd.DataFrame({'products':combinations})
data['frequency'] = data.groupby('products')['products'].transform('count')
print data
The out is:
products frequency
0 (1, 20) 1
1 (1, 1) 1
2 (1, 50) 2
3 (20, 1) 1
4 (20, 50) 1
5 (1, 50) 2
The problem is (1, 20) and (20, 1), the frequency puts 1 but are the same combination and has to be 2, Is there any method with the correct solution?
You can use group by a modification on the column by using applyand lambda
import pandas as pd
import itertools
list = [1,20,1,50]
combinations = []
for i in itertools.combinations(list ,2):
combinations .append(i)
data = pd.DataFrame({'products':combinations})
data['frequency'] = data.groupby(data['products'].apply(
lambda i :tuple(sorted(i))))['products'].transform('count')
print (data)
The output will be
products frequency
0 (1, 20) 2
1 (1, 1) 1
2 (1, 50) 2
3 (20, 1) 2
4 (20, 50) 1
5 (1, 50) 2

Convert pandas dataframe elements to tuple

I have a dataframe:
>>> df = pd.DataFrame(np.random.random((3,3)))
>>> df
0 1 2
0 0.732993 0.611314 0.485260
1 0.935140 0.153149 0.065653
2 0.392037 0.797568 0.662104
What is the easiest way for me convert each entry to a 2-tuple, with first element from the current dataframe, and 2nd element from the last columns ('2')?
i.e. I want the final results to be:
0 1 2
0 (0.732993, 0.485260) (0.611314, 0.485260) (0.485260, 0.485260)
1 (0.935140, 0.065653) (0.153149, 0.065653) (0.065653, 0.065653)
2 (0.392037, 0.662104) (0.797568, 0.662104) (0.662104, 0.662104)
As of pd version 0.20, you can use df.transform:
In [111]: df
Out[111]:
0 1 2
0 1 3 4
1 2 4 5
2 3 5 6
In [112]: df.transform(lambda x: list(zip(x, df[2])))
Out[112]:
0 1 2
0 (1, 4) (3, 4) (4, 4)
1 (2, 5) (4, 5) (5, 5)
2 (3, 6) (5, 6) (6, 6)
Or, another solution using df.apply:
In [113]: df.apply(lambda x: list(zip(x, df[2])))
Out[113]:
0 1 2
0 (1, 4) (3, 4) (4, 4)
1 (2, 5) (4, 5) (5, 5)
2 (3, 6) (5, 6) (6, 6)
You can also use dict comprehension:
In [126]: pd.DataFrame({i : df[[i, 2]].apply(tuple, axis=1) for i in df.columns})
Out[126]:
0 1 2
0 (1, 4) (3, 4) (4, 4)
1 (2, 5) (4, 5) (5, 5)
2 (3, 6) (5, 6) (6, 6)
I agree with Corley's comment that you are better off leaving the data in the current format, and changing your algorithm to process data explicitly from the second column.
However, to answer your question, you can define a function that does what's desired and call it using apply.
I don't like this answer, it is ugly and "apply" is syntatic sugar for a "For Loop", you are definitely better off not using this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((3,3)))
df
0 1 2
0 0.847380 0.897275 0.462872
1 0.161202 0.852504 0.951304
2 0.093574 0.503927 0.986476
def make_tuple(row):
n= len(row)
row = [(x,row[n - 1]) for x in row]
return row
df.apply(make_tuple, axis =1)
0 (0.847379908309, 0.462871875315) (0.897274903359, 0.462871875315)
1 (0.161202442072, 0.951303842798) (0.852504052133, 0.951303842798)
2 (0.0935742441563, 0.986475692614) (0.503927404884, 0.986475692614)
2
0 (0.462871875315, 0.462871875315)
1 (0.951303842798, 0.951303842798)
2 (0.986475692614, 0.986475692614)

Pandas: Conditionally Selecting Columns to perform Calculation based on Header of another Column

My dataframe looks like this:
(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (1, 7) (1, 8) (1, 9) (1, 10) (1, 11) ... 2 3 4 5 6 7 8 9 10 11
0 0 1 0 1 1 1 1 0 1 0 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
1 0 0 0 0 0 0 0 0 0 0 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
2 1 0 0 1 0 1 1 0 0 0 ... 0.481291 0.593353 0.271028 0.498949 0.588807 0.641602 0.901779 0.424495 0.303309 0.669657
3 1 1 0 1 0 1 1 0 0 1 ... 0.583228 0.698729 0.343934 0.602037 0.694230 0.745422 0.954682 0.521298 0.382381 0.771640
4 0 0 0 1 1 1 1 1 1 1 ... 0.612544 0.727393 0.366578 0.631451 0.722980 0.772853 0.964982 0.549801 0.406692 0.798083
where i have column headers with a tuple like (1, 2) and column headers that are a single element, like 1. I want to perform a calculation on the tuple columns based on the columns that have the elements of that tuple. For example, with the tuple (1, 2), I want to retrieve the columns 1 and 2, multiply them together, then subtract the result from the column (1, 2).
The solution that I thought of was to create (55) new columns that perform the first calculation from the columns that contain only a single element (e.g. 1 or 2), and then do some sort of identity match using the .where() and all() statements. However, this seems rather computationally inefficient since I'd be making a whole other set of data, rather than performing the calculation directly on the tuple column. How would I go about this?
Not sure if this is faster, but here's a solution without needing where()/all()
import pandas as pd
# create some sample data
arr = [[1, 2, 3, 4, 5, 6, 7],
[7, 6, 5, 4, 3, 2, 1]]
df = pd.DataFrame(arr, columns=[('a', 'b'), ('c','d'), ('a', 'd'), 'a', 'b', 'c', 'd'])
# get all tuple headers
tuple_columns = [col for col in df.columns if isinstance(col, tuple)]
# put the results into a list of series and concat into a DataFrame
results = pd.concat([df[col] - df[col[0]] * df[col[1]] for col in tuple_columns], axis=1)
# rename the columns
results.columns = tuple_columns

python- pandas- concatenate columns with a loop

I have a list of columns that I need to concatenate. An example table would be:
import numpy as np
cats1=['T_JW', 'T_BE', 'T_FI', 'T_DE', 'T_AP', 'T_KI', 'T_HE']
data=np.array([random.sample(range(0,2)*7,7)]*3)
df_=pd.DataFrame(data, columns=cats1)
So I need to get the concatenation of each line (if it's possible with a blank space between each value). I tried:
listaFin=['']*1000
for i in cats1:
lista=list(df_[i])
listaFin=zip(listaFin,lista)
But I get a list of tuples:
listaFin:
[((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1),
((((((('', 0), 0), 1), 0), 1), 0), 1)]
And I need to get something like
[0 0 1 0 1 0 1,
0 0 1 0 1 0 1,
0 0 1 0 1 0 1]
How can I do this only using one loop or less (i don't want to use a double loop)?
Thanks.
I don't think you can have a list of space delimited integers in Python without them being in a string (I might be wrong). Having said that, the answer I have is:
output = []
for i in range(0,df_.shape[0]):
output.append(' '.join(str(x) for x in list(df_.loc[i])))
print(output)
output looks like this:
['1 0 0 0 1 0 1', '1 0 0 0 1 0 1', '1 0 0 0 1 0 1']

Categories