In-place Compression of IntervalIndex'd Pandas Series with Repeated Data - python

I have a pandas data-frame with a IntervalIndex. Some of the columns will have several connecting intervals with the same data, and I would like to break the data-frame into a dictionary of series and then do an in-place compression where I merge adjacent intervals that have the same data value.
I have what I think is a pretty ugly solution using a while loop and a list of compressed intervals as a hack around the fact that an IntervalIndex is immutable. I am not posting this on code review because I think it is just terrible. Only posting here because it shows what I am trying to do.
Is there some built-in function that I am unaware of or maybe a more pythonic combination I can use to do this?
import pandas as pd
# create data
index = pd.IntervalIndex.from_breaks([1, 2, 3, 4, 5])
data = {"a": [1, 2, 3, 4], "b": [5, 5, 6, 6], "c": [7, 7, 7, 7]}
df = pd.DataFrame(index=index, data=data)
# break dataframe into a dict of series to we can compress
dict_of_series = df.to_dict("series")
# combine overlapping bins with redundant data
def compress(s):
# we need to build the compressed index rather than modify in-place, as IntervalIndex is immuatable
compressed_index = [s.index[0]]
i = 1
while True:
try:
i0 = s.index[i - 1]
i1 = s.index[i]
d0 = s.iloc[i - 1]
d1 = s.iloc[i]
except IndexError:
break
# if intervals don't touch, we cannot combine bins
if compressed_index[-1].right < i1.left:
compressed_index.append(i1)
i += 1
continue
# if intervals touch and the data is the same, combine bins
if d0 == d1:
s.drop(i1, inplace=True)
compressed_index[-1] = pd.Interval(compressed_index[-1].left, i1.right)
continue
compressed_index.append(i1)
i += 1
# replace the index with the compressed index we built
s.index = pd.IntervalIndex(compressed_index)
print("Original Data for 'a'")
print(dict_of_series["a"])
print()
print("Compressed Data for 'a'")
compress(dict_of_series["a"])
print(dict_of_series["a"])
print()
print("Original Data for 'b'")
print(dict_of_series["b"])
print()
print("Compressed Data for 'b'")
compress(dict_of_series["b"])
print(dict_of_series["b"])
print()
print("Original Data for 'c'")
print(dict_of_series["c"])
print()
print("Compressed Data for 'c'")
compress(dict_of_series["c"])
print(dict_of_series["c"])
Which gives the following result:
Original Data for 'a'
(1, 2] 1
(2, 3] 2
(3, 4] 3
(4, 5] 4
Name: a, dtype: int64
Compressed Data for 'a'
(1, 2] 1
(2, 3] 2
(3, 4] 3
(4, 5] 4
Name: a, dtype: int64
Original Data for 'b'
(1, 2] 5
(2, 3] 5
(3, 4] 6
(4, 5] 6
Name: b, dtype: int64
Compressed Data for 'b'
(1, 3] 5
(3, 5] 6
Name: b, dtype: int64
Original Data for 'c'
(1, 2] 7
(2, 3] 7
(3, 4] 7
(4, 5] 7
Name: c, dtype: int64
Compressed Data for 'c'
(1, 5] 7
Name: c, dtype: int64

Try Series.duplicated:
>>> df.loc[~df['a'].duplicated(), 'a']
(1, 2] 1
(2, 3] 2
(3, 4] 3
(4, 5] 4
Name: a, dtype: int64
>>> df.loc[~df['b'].duplicated(), 'b']
(1, 2] 5
(3, 4] 6
Name: b, dtype: int64
>>> df.loc[~df['c'].duplicated(), 'c']
(1, 2] 7
Name: c, dtype: int64

Related

Loss of numpy array dimensions when save and retrieve from csv file using pandas

I have a numpy.array data type and I want to write it to a .csv file with pandas, so I run this:
data = numpy.array([1, 2, 3, 4, 5, 6])
print(data)
print((data.shape))
df = pd.DataFrame(columns = ['content'])
df.loc[0, 'content'] = data
df.to_csv('data.csv', index = False)
print(df.head())
>>> [1 2 3 4 5 6]
>>> (6,)
>>> content
0 [1, 2, 3, 4, 5, 6]
As seen in the output, the dimensions of the numpy array is (6,).
But the problem is that when I retrieve it from .csv file array dimension loss and change to ()
data = pd.read_csv('data.csv')
val = numpy.array(data['content'][0])
print(val.shape)
print(val)
>>> ()
>>> [1 2 3 4 5 6]
Why is this happening? How can I solve this problem?
In [46]: import pandas as pd
In [47]: data = np.arange(1,7)
In [48]: data.shape
Out[48]: (6,)
The original dataframe:
In [49]: df = pd.DataFrame(columns = ['content'])
...: df.loc[0, 'content'] = data
In [50]: df
Out[50]:
content
0 [1, 2, 3, 4, 5, 6]
In [52]: df.to_numpy()
Out[52]: array([[array([1, 2, 3, 4, 5, 6])]], dtype=object)
to_numpy from a dataframe produces a 2d array, here with 1 element, and that element is an array ifself.
In [54]: df.to_numpy()[0,0]
Out[54]: array([1, 2, 3, 4, 5, 6])
Look at the full file, not just the head:
In [55]: df.to_csv('data.csv', index = False)
In [56]: cat data.csv
content
[1 2 3 4 5 6]
That 2nd line is the str(data) display - with [] and no commas
read_csv loads that as a string. It does not try convert it to an array; it can't.
In [57]: d = pd.read_csv('data.csv')
In [58]: d
Out[58]:
content
0 [1 2 3 4 5 6]
In [59]: d.to_numpy()
Out[59]: array([['[1 2 3 4 5 6]']], dtype=object)
In [60]: d.to_numpy()[0,0]
Out[60]: '[1 2 3 4 5 6]'
csv is not a good format for saving a dataframe that contains objects such as arrays or lists as elements. It only works well for elements that are simple numbers and strings.

Dynamically updating the condition while indexing and selecting data in pandas Dataframe

While selecting data from a dataframe I need to do in such a way that the condition changes according to the length of the input list. This is my current code. The first element in the list is the name of the column and the second element is the value of the column.
import pandas as pd
list_1 = [('a', 2), ('b', 5)]
list_2 = [('a', 1), ('b', 2), ('c', 3)]
data = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [2, 5, 6]], columns=['a', 'b', 'c'])
def select_data(l, dataset):
df = None
i = len(l)
if i == 2:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1])]
if i == 3:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1]) & (dataset[l[2][0]] == l[2][1])]
return df
print(select_data(list_1, data))
print(select_data(list_2, data))
There must be a cleaner way of doing this.
Short filtering with Dataframe.loc[<filtering_mask>, :] approach:
In [150]: data
Out[150]:
a b c
0 1 2 3
1 1 2 3
2 1 1 1
3 2 5 6
In [151]: def select_data(filter_lst, df):
...: d = dict(filter_lst)
...: res = df.loc[(df[list(d.keys())] == pd.Series(d)).all(axis=1)]
...: return res
...:
...:
In [152]: select_data(list_1, data)
Out[152]:
a b c
3 2 5 6
In [153]: select_data(list_2, data)
Out[153]:
a b c
0 1 2 3
1 1 2 3
I think you want to create dynamic condition. To do, you can write the condition in a string format and then evaluate it with the eval function. To do so, you also need to know the dataset name.
Here the code assuming all the keys are in the given dataset:
import pandas as pd
input1 = [('a', 1), ('b', 2)]
# input2 = [('a', 1), ('b', 2), ('c', 3)] # Same output as for input1 for the below dataset
dataset = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [
2, 5, 6]], columns=['a', 'b', 'c'])
def create_query_condition(list_tuple, df_name):
""" Create a string condition from the keys and values in list_
Arguments:
:param list_tuple: 2D array: 1st column is key, 2nd column is value
:param df_name: dataframe name
"""
my_array = np.array(list_tuple)
# Get keys - values
keys = my_array[:, 0]
values = my_array[:, 1]
# Create string query
query = ' & '.join(['({0}.{1} == {2})'.format(df_name, k, v)
for k, v in zip(keys, values)])
return query
query = create_query_condition(input1, 'dataset')
print(query)
# (dataset.a == 1) & (dataset.b == 2)
print(dataset[eval(query)])
# a b c
# 0 1 2 3
# 1 1 2 3

Using a tuple from df.itertuples(), how can I retrieve column values for each tuple element under a condition?

I have a pandas.DataFrame such as:
1 2 3
1 1 0 0
2 0 1 0
3 0 0 1
Which has been created from a set containing relations such as:
{(1,1),(2,2),(3,3)}
I am trying to make the equivalence classes for this. Something like this:
[1] = {1}
[2] = {2}
[3] = {3}
I have done the following so far:
testGenerator = generatorTest(matrix)
indexCount = 1
while True:
classRelation, loopCount = [], 1
iterable = next(testGenerator)
for i in iterable[1:]:
if i == 1:
classRelation.append(loopCount)
loopCount += 1
print ("[",indexCount,"] = ",set(classRelation))
indexCount += 1
Which as you can see is very messy. But I do get more or less desired output:
[ 1 ] = {1}
[ 2 ] = {2}
[ 3 ] = {3}
How can I accomplish the same output, in a tidier and more pythonic fashion?
In this case you can use pandas.DataFrame.idxmax() like:
Code:
df.idxmax(axis=1)
Test code:
df = pd.DataFrame([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0]],
columns=[1, 2, 3], index=[1, 2, 3, 4])
print(df.idxmax(axis=1))
Results:
1 1
2 2
3 3
4 2
dtype: int64
Consider the dataframe df
df = pd.DataFrame(np.eye(3, dtype=int), [1, 2, 3], [1, 2, 3])
numpy.where
i, j = np.where(df.values == 1)
list(zip(df.index[i], df.columns[j]))
[(1, 1), (2, 2), (3, 3)]
stack and compress
s = df.stack()
s.compress(s.astype(bool)).index.tolist()
[(1, 1), (2, 2), (3, 3)]

Combine multiple columns into 1 column [python,pandas]

I have a pandas data frame with 2 columns:
{'A':[1, 2, 3],'B':[4, 5, 6]}
I want to create a new column where:
{'C':[1 4,2 5,3 6]}
Setup
df = pd.DataFrame({'A':[1, 2, 3],'B':[4, 5, 6]})
Solution
Keep in mind, per your expected output, [1 4,2 5,3 6] isn't a thing. I'm interpreting you to mean either [(1, 4), (2, 5), (3, 6)] or ["1 4", "2 5", "3 6"]
First assumption
df.apply(lambda x: tuple(x.values), axis=1)
0 (1, 4)
1 (2, 5)
2 (3, 6)
dtype: object
Second assumption
df.apply(lambda x: ' '.join(x.astype(str)), axis=1)
0 1 4
1 2 5
2 3 6
dtype: object
If you don't mind zip object, then you can usedf['C'] = zip(df.A,df.B).
If you like tuple then you can cast zip object with list(). Please refer to this post. It's pretty handy to use zip in this kind of scenarios.

The most frequent pattern of specific columns in Pandas.DataFrame in python

I know how to get the most frequent element of list of list, e.g.
a = [[3,4], [3,4],[3,4], [1,2], [1,2], [1,1],[1,3],[2,2],[3,2]]
print max(a, key=a.count)
should print [3, 4] even though the most frequent number is 1 for the first element and 2 for the second element.
My question is how to do the same kind of thing with Pandas.DataFrame.
For example, I'd like to know the implementation of the following method get_max_freq_elem_of_df:
def get_max_freq_elem_of_df(df):
# do some things
return freq_list
df = pd.DataFrame([[3,4], [3,4],[3,4], [1,2], [1,2], [1,1],[1,3],[2,2],[4,2]])
x = get_max_freq_elem_of_df(df)
print x # => should print [3,4]
Please notice that DataFrame.mode() method does not work. For above example, df.mode() returns [1, 2] not [3,4]
Update
have explained why DataFrame.mode() doesn't work.
You could use groupby.size and then find the max:
>>> df.groupby([0,1]).size()
0 1
1 1 1
2 2
3 1
2 2 1
3 4 3
4 2 1
dtype: int64
>>> df.groupby([0,1]).size().idxmax()
(3, 4)
In python you'd use Counter*:
In [11]: from collections import Counter
In [12]: c = Counter(df.itertuples(index=False))
In [13]: c
Out[13]: Counter({(3, 4): 3, (1, 2): 2, (1, 3): 1, (2, 2): 1, (4, 2): 1, (1, 1): 1})
In [14]: c.most_common(1) # get the top 1 most common items
Out[14]: [((3, 4), 3)]
In [15]: c.most_common(1)[0][0] # get the item (rather than the (item, count) tuple)
Out[15]: (3, 4)
* Note that your solution
max(a, key=a.count)
(although it works) is O(N^2), since on each iteration it needs to iterate through a (to get the count), whereas Counter is O(N).

Categories