Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have the following CSV data set. The data represents: A,B,C,D, and F - Entities. Column 2 is rules and the final column is rank of that entity for a particular rule.
A,Rule_1,1
B,Rule_1,1
C,Rule_1,2
D,Rule_1,1
E,Rule_1,2
F,Rule_1,3
A,Rule_2,3
B,Rule_2,1
C,Rule_2,2
D,Rule_2,1
E,Rule_2,2
F,Rule_2,1
I basically want to perform associative mining (maximum of 3 entities) with number of entities that has rule i and j rank, and create a bucket_ij. Based on this, I want to find out given the entities with rank 1, what are the entities are most likely to have rank 2. So when A, B, D = 1 then C, E = 2. How can I perform this associative mining where when certain entities have rank 1, what are the entities to have rank 2?
You can use pandas.
First you have to name your columns on csv file:
Entities,Rule,Rank
A,Rule_1,1
B,Rule_1,1
C,Rule_1,2
D,Rule_1,1
E,Rule_1,2
F,Rule_1,3
A,Rule_2,3
B,Rule_2,1
C,Rule_2,2
D,Rule_2,1
E,Rule_2,2
F,Rule_2,1
Then save it somewhere.
import pandas
pathToCsvFile = 'C:\\file.csv' #for example
df = pandas.DataFrame.from_csv(pathToCsvFile,index_col=None)
df.groupby(('Entities','Rank')).count()
I think with this you can get what you want. It will count how many times each entity had rank.
Output:
Entities Rank
A 1 1
3 1
B 1 2
C 2 2
D 1 2
E 2 2
F 1 1
3 1
Or:
from scipy import stats
df.groupby(('Entities')).agg(lambda x:stats.mode(x)[0]).Rank
Will get the mode for each entity.
Outputs:
Entities
A 1
B 1
C 2
D 1
E 2
F 1
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
Is there anyway to determine which row values in a column make the final expression true and then outputting those values?
Example:
Equation
a && b || (c && d)
if
a = T, b = T, c = F, d = T
then the output should produce:
True Values: a, b
False Values: c, d
Is this possible in pandas or python?
df[((df["column_a"]==True) & (df["column_b"]==True)) | ((df["column_a"]==False) & (df["column_d"]==True))]
I guess this is what you're tying to say, but if you're trying to do something based on individual data points in the respective columns then you have to use for loop(s) depending on the problem.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe in which each row show one transaction, items purchased together. Here is how my dataframe looks like
items
['A','B','C']
['A','C]
['C','F']
...
I need to create a dictionary which shows how many times items have been purchased together, something like below
{'A':[('B',1),('C':5)], 'B': [('A':1),('C':6)], ...}
Right now, I have defined a variable freq and then loop through my dataframe and calculate/update the dictionary (freq). but it's taking very long.
What's the efficient way of calculating this without looping through the dataframe?
You can speed this up with sklearn's MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
Transform your data using:
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['items']),
columns=mlb.classes_,
index=df.index)
to get it in the following format:
A B C F
0 1 1 1 0
1 1 0 1 0
2 0 0 1 1
And then getting you can define a trivial function like:
get_num_coexisting = lambda x, y: (df[x] & df[y]).sum()
And use as so:
get_num_coexisting('A', 'C')
>>> 2
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Good afternoon , I was wondering if it was possible to move the last character of Column A to the first character of column B using excel or maybe even python. I know how to remove the last character in excel but i dont know how to go about adding it to another column?
Ex.
Column A = ABCD 1 , Column B = 234567
DEsired results:
Column A = ABCD, Column B = 1234567
In Excel:
Define column C with the following formula:
C1 = LEFT(A1, LEN(A1)-1)
Extend downwards along the entire column.
Define column D with the following formula:
D1 = RIGHT(A1, 1) & B1
Extend this in the same way down column D.
Copy columns C and D, use Paste Special > Paste Values over columns A and B.
You can now delete the temporary columns C and D.
In Column C, just use the formula:
=LEFT(A1,LEN(A1)-1)
and in D:
=RIGHT(A1,1)&B1
Then copy/paste Columns C and D as Values to "lock in" column C and D.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'd like to count the number of IDs in terms of how many times it appears in data.
Now I got
U6492ea665413f304b323fea3e7f76739 7
Uf873b1e4dfc9f18d92758020dc1435c6 7
Ua30d2a8da85ac1144f9cbbf390c10d3c 7
Uf169ffec7dc767b89694a26cb057a258 7
U9e9c89c308d6c2f77dad28f8ec8e7993 7
.
The left is ID, and the right is how many times ID appears in data.
What I wannna get is like
7 900
6 435
5 434
4 343
3 453
2 34
1 121 .
The left is the number of appearances. The right is the number of IDs.
uid = data['id']
col=uid.value_counts()
col
The information of the original data is below.
I think this is what you want to do - just reset the index to get the ids as a separate column and then group by on the counts that you previously got - then count the IDs (here they'll be called index
df = col.reset_index()
df.groupby(by='count')['index'].count()
uid = data['uid']
col=uid.value_counts()
col
num = col.value_counts()
num
Repeating value_counts() has resolved the issue.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a file and want to count few names on it. The problem is for one of the names, I have more than one name! what i can do to count them as one name and not different names?
For example:
LR = lrr = LRr = lrrs they are all same thing but when I want to count them they assume as different names.
Thank you
It is not easy. And solution is simplified - first read_csv, then convert all letters to lower and then replace one or more s from end of string to empty string. Then remove duplicates - a bit modified this solution(replaced to only one letter). Last value_counts:
So if some words what need end with s there are replaced too.
df = pd.read_csv('file.csv')
#sample DataFrame
df = pd.DataFrame({'names': ['LR','lrr','LRr','lrrs', 'lrss', 'lrsss']})
print (df)
names
0 LR
1 lrr
2 LRr
3 lrrs
4 lrss
5 lrsss
print (df.names.str.lower().str.replace('s{1,}$','').str.replace(r'(.)\1+', r'\1'))
0 lr
1 lr
2 lr
3 lr
4 lr
5 lr
Name: names, dtype: object
print (df.names.str.lower()
.str.replace('s{1,}$','')
.str.replace(r'(.)\1+', r'\1')
.value_counts())
lr 6
Name: names, dtype: int64