While selecting data from a dataframe I need to do in such a way that the condition changes according to the length of the input list. This is my current code. The first element in the list is the name of the column and the second element is the value of the column.
import pandas as pd
list_1 = [('a', 2), ('b', 5)]
list_2 = [('a', 1), ('b', 2), ('c', 3)]
data = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [2, 5, 6]], columns=['a', 'b', 'c'])
def select_data(l, dataset):
df = None
i = len(l)
if i == 2:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1])]
if i == 3:
df = dataset[(dataset[l[0][0]] == l[0][1]) & (dataset[l[1][0]] == l[1][1]) & (dataset[l[2][0]] == l[2][1])]
return df
print(select_data(list_1, data))
print(select_data(list_2, data))
There must be a cleaner way of doing this.
Short filtering with Dataframe.loc[<filtering_mask>, :] approach:
In [150]: data
Out[150]:
a b c
0 1 2 3
1 1 2 3
2 1 1 1
3 2 5 6
In [151]: def select_data(filter_lst, df):
...: d = dict(filter_lst)
...: res = df.loc[(df[list(d.keys())] == pd.Series(d)).all(axis=1)]
...: return res
...:
...:
In [152]: select_data(list_1, data)
Out[152]:
a b c
3 2 5 6
In [153]: select_data(list_2, data)
Out[153]:
a b c
0 1 2 3
1 1 2 3
I think you want to create dynamic condition. To do, you can write the condition in a string format and then evaluate it with the eval function. To do so, you also need to know the dataset name.
Here the code assuming all the keys are in the given dataset:
import pandas as pd
input1 = [('a', 1), ('b', 2)]
# input2 = [('a', 1), ('b', 2), ('c', 3)] # Same output as for input1 for the below dataset
dataset = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 1, 1], [
2, 5, 6]], columns=['a', 'b', 'c'])
def create_query_condition(list_tuple, df_name):
""" Create a string condition from the keys and values in list_
Arguments:
:param list_tuple: 2D array: 1st column is key, 2nd column is value
:param df_name: dataframe name
"""
my_array = np.array(list_tuple)
# Get keys - values
keys = my_array[:, 0]
values = my_array[:, 1]
# Create string query
query = ' & '.join(['({0}.{1} == {2})'.format(df_name, k, v)
for k, v in zip(keys, values)])
return query
query = create_query_condition(input1, 'dataset')
print(query)
# (dataset.a == 1) & (dataset.b == 2)
print(dataset[eval(query)])
# a b c
# 0 1 2 3
# 1 1 2 3
Related
I am a Python newbie and have a question.
As a simple example, I have three variables:
a = 3
b = 10
c = 1
I'd like to create a data frame with three columns ('a', 'b', and 'c') with:
each column +/- a certain constant from the original value AND also >0 and <=10.
If the constant is 1 then:
the possible values of 'a' will be 2, 3, 4
the possible values of 'b' will be 9, 10
the possible values of 'c' will be 1, 2
The final data frame will consist of all possible combination of a, b and c.
Do you know any Python code to do so?
Here is a code to start.
import pandas as pd
data = [[3 , 10, 1]]
df1 = pd.DataFrame(data, columns=['a', 'b', 'c'])
You may use itertools.product for this.
Create 3 separate lists with the necessary accepted data. This can be done by calling a method which will return you the list of possible values.
def list_of_values(n):
if 1 < n < 9:
return [n - 1, n, n + 1]
elif n == 1:
return [1, 2]
elif n == 10:
return [9, 10]
return []
So you will have the following:
a = [2, 3, 4]
b = [9, 10]
c = [1,2]
Next, do the following:
from itertools import product
l = product(a,b,c)
data = list(l)
pd.DataFrame(data, columns =['a', 'b', 'c'])
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe which looks like (below 1 represents having a trait and 0 represents not having it):
Person Trait_1 Trait_2 Trait_3 Trait_4
A 1 1 1 1
B 0 1 1 0
C 0 1 0 0
D 1 1 0 1
E 0 0 0 1
I want a function which returns, for each person, the top 10 people with the most number of common traits.
So for person A the output can be:
D (3 traits), B (2 traits), C(1 trait), E(1 trait)
I thought having a matrix which looks like encodes how many traits each person has in common with the other would be a good start:
A B C D E
A 4 2 1 3 1
B 2 4 1 1 0
C 1 1 4 1 0
D 3 1 1 4 1
E 1 0 0 1 4
But I am not sure how to achieve this or what this is called.
Create a DataFrame
import pandas as pd
df = pd.DataFrame(data=[[1, 1, 1, 1],
[0, 1, 1, 0],
[0, 1, 0, 0],
[1, 1, 0, 1],
[0, 0, 0, 1]],
index=['A', 'B', 'C', 'D', 'E',],
columns=['Trait_1', 'Trait_2', 'Trait_3', 'Trait_4'])
Using matrix multiplication create the matrix of common traits (the one you described)
common_traits = df # df.T
For each person print the string you want (top10 people with most traits in common)
n = 10
for index, row in common_traits.iterrows():
top10 = row.drop(index).nlargest(n)
top10 = top10[top10 > 0]
string = ', '.join(top10.index + top10.map(lambda x: f' ({x} trait{"s" if x != 1 else ""})'))
print(f'{index}: {string}')
Output
A: D (3 traits), B (2 traits), C (1 trait), E (1 trait)
B: A (2 traits), C (1 trait), D (1 trait)
C: A (1 trait), B (1 trait), D (1 trait)
D: A (3 traits), B (1 trait), C (1 trait), E (1 trait)
E: A (1 trait), D (1 trait)
Try:
# if not done already
df.set_index("Person", inplace=True)
res = (df#df.T).stack().reset_index(level=1)
res = res.loc[res["Person"].ne(res.index) & res[0].gt(0)].sort_values(0, ascending=False).groupby(level=0).apply(lambda x: list(x.values))
Outputs:
>>> res
Person
A [[D, 3], [B, 2], [C, 1], [E, 1]]
B [[A, 2], [C, 1], [D, 1]]
C [[A, 1], [B, 1], [D, 1]]
D [[A, 3], [B, 1], [C, 1], [E, 1]]
E [[A, 1], [D, 1]]
dtype: object
And your function (results are in descending order):
>>> res.loc['C']
[array(['A', 1], dtype=object), array(['B', 1], dtype=object), array(['D', 1], dtype=object)]
This is more of a linear algebra answer, but if you want the number of common traits between person A and person B for example, you evaluate the scalar product of you line vector of person A and line vector of person B (this only works because you matrix is a binary matrix).
I have no idea what framework/library you are using, but if you are using pandas, you can easily extract line vectors and convert them to numpy arrays then do the scalar product.
I think you can't do it without an iterrows somewhere,
first get the matrix you talk about (A*A.t) then get the top columns for each row with nlargest (and remove the diagonal as you don't want to count a person as its closest neighbour)
import pandas as pd
import numpy as np
NB_MOST_SIMILAR = 3
df = pd.DataFrame(
data={'t_1': [1,0,0,1,0], 't_2': [1,1,1,1,0],
't_3': [1,1,0,0,0], 't_4': [1,0,0,1,1]},
index=['A', 'B', 'C', 'D', 'E']
)
prod = df.dot(df.T)
prod.iloc[[np.arange(prod.shape[0])]*2] = 0
common_traits = dict()
for person_name, com_traits in prod.iterrows():
traits = [
(neighbour, traits) for neighbour, traits
in com_traits.nlargest(NB_MOST_SIMILAR).items()
if traits > 0
]
common_traits[person_name] = traits
common_traits
# {'A': [('D', 3), ('B', 2), ('C', 1)],
# 'B': [('A', 2), ('C', 1), ('D', 1)],
# 'C': [('A', 1), ('B', 1), ('D', 1)],
# 'D': [('A', 3), ('B', 1), ('C', 1)],
# 'E': [('A', 1), ('D', 1)]}
I have the following dataframe
df = pd.DataFrame({'a': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4]})
I want a function that would output the following dataframe definition:
df = pd.DataFrame({'a': [ 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [ 1, 2, 4, 1, 2, 3, 4],
'c': ['A_0', 'A_0', 'A_1', 'B_0', 'B_0', 'B_0', 'B_0']})
The logic is that given for each value of 'a' (each group), I create a value 'c' which could be described as a "continuous" series of 'b' values.
So far, my code is the following:
def detection(dataset):
def detect(series, avalue):
_id = 0
start = True
visits = []
prev_ = None
for h in series:
if start:
start = False
prev_ = h
else:
if h - prev_ > 1:
_id += 1
prev_ = h
visits.append(f"{avalue}_{_id}")
return visits
res = []
gb = dataset.groupby("a")
for avalue in gb.groups:
dd = gb.get_group(avalue)
dd["VISIT_ID"] = detect(dd["b"], avalue)
res.append(dd)
return pd.concat(res, axis=0)
The good is, it works perfectly !!
The bad: it is extremely slow on a large dataset (7 millions of entries, 250k of different 'a' values.
Is there something better to do?
You can find the numeric part of column c using groupby and concat values
df['c'] = df.groupby('a').b.apply(lambda x: (x.diff() > 1).cumsum())
df['c'] = df['a'] + '_' + df['c'].astype(str)
a b c
0 A 1 A_0
1 A 2 A_0
2 A 4 A_1
3 B 1 B_0
4 B 2 B_0
5 B 3 B_0
6 B 4 B_0
I have a pandas.DataFrame such as:
1 2 3
1 1 0 0
2 0 1 0
3 0 0 1
Which has been created from a set containing relations such as:
{(1,1),(2,2),(3,3)}
I am trying to make the equivalence classes for this. Something like this:
[1] = {1}
[2] = {2}
[3] = {3}
I have done the following so far:
testGenerator = generatorTest(matrix)
indexCount = 1
while True:
classRelation, loopCount = [], 1
iterable = next(testGenerator)
for i in iterable[1:]:
if i == 1:
classRelation.append(loopCount)
loopCount += 1
print ("[",indexCount,"] = ",set(classRelation))
indexCount += 1
Which as you can see is very messy. But I do get more or less desired output:
[ 1 ] = {1}
[ 2 ] = {2}
[ 3 ] = {3}
How can I accomplish the same output, in a tidier and more pythonic fashion?
In this case you can use pandas.DataFrame.idxmax() like:
Code:
df.idxmax(axis=1)
Test code:
df = pd.DataFrame([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0]],
columns=[1, 2, 3], index=[1, 2, 3, 4])
print(df.idxmax(axis=1))
Results:
1 1
2 2
3 3
4 2
dtype: int64
Consider the dataframe df
df = pd.DataFrame(np.eye(3, dtype=int), [1, 2, 3], [1, 2, 3])
numpy.where
i, j = np.where(df.values == 1)
list(zip(df.index[i], df.columns[j]))
[(1, 1), (2, 2), (3, 3)]
stack and compress
s = df.stack()
s.compress(s.astype(bool)).index.tolist()
[(1, 1), (2, 2), (3, 3)]
Suppose I have two tables A and B where
A is
and B is
I have to merge them in such a way that the new table looks like this
For the common first elements in A and B, I'm taking the weighted average of middle elements of both rows having the common first elements. For example:
A and B have 'AAA' in common, so I'll compute the middle element using (5 * 3 + 5 * 2) / (3 + 2) = 5. Hence the first row of the third table becomes 'AAA', 5, 3 + 2 = 5.
I know it can be done by iterating over all the elements if I use lists, but is there a faster way to do this?
edit from comments: I'm also searching for a simpler way using pandas.DataFrame
The simplest pure python solution would be to use dictionary-like data structure, with keys being your labels, and values pairs (value = quantity * weight, quantity):
from collections import defaultdict
A = [
('A', 5, 3),
('B', 6, 1),
('D', 10, 2),
('C', 2, 4),
]
B = [
('A', 5, 5),
('D', 2, 1),
('B', 5, 4),
]
# we need to calculate (value, quantity) for each label:
a = {key: [weight * quantity, quantity] for key, weight, quantity in A}
b = {key: [weight * quantity, quantity] for key, weight, quantity in B}
# defaultdict is a dictionary like structure, but able to create
# a new item if key is not found, in our case a new [0, 0] list:
merged = defaultdict(lambda: list((0, 0)))
# let's sum quantities and
for key, pair in a.items() + b.items():
# add both value and quantity respectively
quantity, value = map(sum, zip(merged[key], pair))
merged[key] = [quantity, value]
# now let's calculate means
for key, (quantity, value) in merged.items():
mean = quantity / float(value)
merged[key] = [mean, value]
for item in merged.items():
print item
And even simplier it is using pandas:
import pandas as pd
# first let's create dataframes
colnames = 'label weight quantity'.split()
A = pd.DataFrame.from_records([
('A', 5, 3),
('B', 6, 1),
('D', 10, 2),
('C', 2, 4),
], columns=colnames)
B = pd.DataFrame.from_records([
('A', 5, 2),
('D', 2, 1),
('B', 5, 4),
], columns=colnames)
# we can just concatenate those DataFrames and do calculation:
df = pd.concat([A, B])
df['value'] = df.weight * df.quantity
# sum each group with the same label
df = df.groupby('label').sum()
del df['weight'] # it got messed up anyway and we don't need it
# and calculate means:
df['mean'] = df.value / df.quantity
print df
print(df[['mean', 'quantity']])
# mean quantity
# label
# A 5.000000 5
# B 5.200000 5
# C 2.000000 4
# D 7.333333 3
You can do better than this, but here is a pandas solution
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df1 = pd.DataFrame({'AAA':np.array([5,3]),'BBB':np.array([6,1]),
.....: 'DDD':np.array([10,2]),'CCC':np.array([2,4])})
In [4]: df2 = pd.DataFrame({'AAA':np.array([5,2]),'DDD':np.array([2,1]),
.....: 'BBB':np.array([5,4])})
In [5]: df = pd.concat([df1,df2])
In [6]: df.transpose()
0 1 0 1
AAA 5 3 5 2
BBB 6 1 5 4
CCC 2 4 NaN NaN
DDD 10 2 2 1
In [7]: vals = np.nan_to_num(df.values)
In [8]: _mean = (vals[0,:]*vals[1,:]+vals[2,:]*vals[3,:])/(vals[1,:]+vals[3,:])
In [9]: _sum = (vals[1,:]+vals[3,:])
In [10]: result = pd.DataFrame(columns = df.columns,data = [_mean,_sum], index=['mean','sum'])
In [11]: result.transpose()
mean sum
AAA 5.000000 5
BBB 5.200000 5
CCC 2.000000 4
DDD 7.333333 3
It probably isn't the most elegant solution, but gets the job done.