How to efficiently calculate co-existence of elements in pandas [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe in which each row show one transaction, items purchased together. Here is how my dataframe looks like
items
['A','B','C']
['A','C]
['C','F']
...
I need to create a dictionary which shows how many times items have been purchased together, something like below
{'A':[('B',1),('C':5)], 'B': [('A':1),('C':6)], ...}
Right now, I have defined a variable freq and then loop through my dataframe and calculate/update the dictionary (freq). but it's taking very long.
What's the efficient way of calculating this without looping through the dataframe?

You can speed this up with sklearn's MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
Transform your data using:
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['items']),
columns=mlb.classes_,
index=df.index)
to get it in the following format:
A B C F
0 1 1 1 0
1 1 0 1 0
2 0 0 1 1
And then getting you can define a trivial function like:
get_num_coexisting = lambda x, y: (df[x] & df[y]).sum()
And use as so:
get_num_coexisting('A', 'C')
>>> 2

Related

Sum DF columns stored in dictionary [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I got a dictionary with 3 Dataframes {0: DataFrame, 1: DataFrame, 2: DataFrame}.
Each DataFrame has the same size. 6 variables, 25 rows.
I'd like to sum all the values/rows from each DataFrame column 'Income' and pass the sum to a list.
Looking like this
list_of_sums = [Sum of income DF0, Sum of income DF1, Sum of income DF2]
Try this below :
list_of_sums = []
input_dict = {0: DataFrame, 1: DataFrame, 2: DataFrame}
for obj in input_dict:
list_of_sums.append(input_dict[obj]['Income'].sum()) # use sum function and append the result to the output list.
print(list_of_sums)
Not the most optimized one but this works.
import pandas as pd
d, dd, ddd = pd.DataFrame(),pd.DataFrame(),pd.DataFrame()
d['Income'] = [3,4,5,6]
dd['Income'] = [30,40,50,60]
ddd['Income'] = [300,400,500,600]
dicts = [d,dd,ddd]
for dc in dicts:
print(sum([s for s in dc['Income']]))

User entry as column names in pandas [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have the user input two lists, one for sizes one for minutes they are each stored in a list. For example they can input sizes: 111, 121 and for minutes, 5, 10, 15.
I want to have the dataframe have columns that are named by the size and minute. (I did a for loop to extract each size and minute.) For example I want the columns to say 111,5 ; 111,10; 111;15, etc. I tried to do df[size+minute]=values (Values is data I want to input into each column) but instead the column name is just the values added up so I got the column name to be 116 instead of 111,5.
If you have two lists:
l = [111,121]
l2 = [5,10,15]
Then you can use list comprehension to form your column names:
col_names = [str(x)+';'+str(y) for x in l for y in l2]
print(col_names)
['111;5', '111;10', '111;15', '121;5', '121;10', '121;15']
And create a dataframe with these column names using pandas.DataFrame:
df = pd.DataFrame(columns=col_names)
If we add a row of data:
row = pd.DataFrame([[1,2,3,4,5,6]])
row.columns = col_names
df = df.append(pd.DataFrame(row))
We can see that our dataframe looks like this:
print(df)
111;5 111;10 111;15 121;5 121;10 121;15
0 1 2 3 4 5 6

Pandas new column with calculation based on other existing column [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a Panda and want to do a calculation based on an existing column.
However, the apply. function is not working for some reason.
It's something like letssay
df = pd.DataFrame({'Age': age, 'Input': input})
and the input column is something like [1.10001, 1.49999, 1.60001]
Now I want to add a new column to the Dataframe, that is doing the following:
Add 0.0001 to each element in column
Multiply each value by 10
Transform each value of new column to int
Use Series.add, Series.mul and Series.astype:
#input is python code word (builtin), so better dont use it like variable
inp = [1.10001, 1.49999, 1.60001]
age = [10,20,30]
df = pd.DataFrame({'Age': age, 'Input': inp})
df['new'] = df['Input'].add(0.0001).mul(10).astype(int)
print (df)
Age Input new
0 10 1.10001 11
1 20 1.49999 15
2 30 1.60001 16
You could make a simple function and then apply it by row.
def f(row):
return int((row['input']+0.0001)*10))
df['new'] = df.apply(f, axis=1)

Python :Select the rows for the most recent entry from multiple users [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a dataframe df with 3 columns :
df=pd.DataFrame({
'User':['A','A','B','A','C','B','C'],
'Values':['x','y','z','p','q','r','s'],
'Date':[14,11,14,12,13,10,14]
})
I want to create a new dataframe that will contain the rows corresponding to highest values in the 'Date' columns for each user. For example for the above dataframe I want the desired dataframe to be as follows ( its a jpeg image):
Can anyone help me with this problem?
This answer assumes that there is different maximum values per user in Values column:
In [10]: def get_max(group):
...: return group[group.Date == group.Date.max()]
...:
In [12]: df.groupby('User').apply(get_max).reset_index(drop=True)
Out[12]:
Date User Values
0 14 A x
1 14 B z
2 14 C s

How to perform Associative Mining on CSV data? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have the following CSV data set. The data represents: A,B,C,D, and F - Entities. Column 2 is rules and the final column is rank of that entity for a particular rule.
A,Rule_1,1
B,Rule_1,1
C,Rule_1,2
D,Rule_1,1
E,Rule_1,2
F,Rule_1,3
A,Rule_2,3
B,Rule_2,1
C,Rule_2,2
D,Rule_2,1
E,Rule_2,2
F,Rule_2,1
I basically want to perform associative mining (maximum of 3 entities) with number of entities that has rule i and j rank, and create a bucket_ij. Based on this, I want to find out given the entities with rank 1, what are the entities are most likely to have rank 2. So when A, B, D = 1 then C, E = 2. How can I perform this associative mining where when certain entities have rank 1, what are the entities to have rank 2?
You can use pandas.
First you have to name your columns on csv file:
Entities,Rule,Rank
A,Rule_1,1
B,Rule_1,1
C,Rule_1,2
D,Rule_1,1
E,Rule_1,2
F,Rule_1,3
A,Rule_2,3
B,Rule_2,1
C,Rule_2,2
D,Rule_2,1
E,Rule_2,2
F,Rule_2,1
Then save it somewhere.
import pandas
pathToCsvFile = 'C:\\file.csv' #for example
df = pandas.DataFrame.from_csv(pathToCsvFile,index_col=None)
df.groupby(('Entities','Rank')).count()
I think with this you can get what you want. It will count how many times each entity had rank.
Output:
Entities Rank
A 1 1
3 1
B 1 2
C 2 2
D 1 2
E 2 2
F 1 1
3 1
Or:
from scipy import stats
df.groupby(('Entities')).agg(lambda x:stats.mode(x)[0]).Rank
Will get the mode for each entity.
Outputs:
Entities
A 1
B 1
C 2
D 1
E 2
F 1

Categories