I have a huge dataset having some repeated data(user log file) and would like to to do similar pattern occurrence recognition and recommendation based on the user download. Once the pattern recognition is made, I need to recommend best possible value to the user.
For example following are the download logs based on time:
A C D F A C D A B D A C D F A C D A B D
I would like to recognize the pattern that exists between this dataset and display the result as:
A -> C = 4
C -> D = 4
D -> F = 2
F -> A = 2
D -> A = 3
A -> B = 1
B -> D = 1
A -> C -> D = 2
C -> D -> F = 2
D -> F -> A = 1
F -> A -> C = 1
C -> D -> A = 1
D -> A -> B = 1
A -> B -> D = 1
The number at the end represents the number of repetition of that pattern.
When the user inputs "A", the best recommendation should be "C", And if the user input is "A -> C", then it should be "D".
Currently I am doing data cleaning using pandas in Python and for pattern recognition, I think scikit-learn might work (not sure though).
Is there any good library or algorithm that I can make a use for this problem or is there any good approach for this kind of problem ?
Since the data size is very big, I am implementing it using Python.
The current problem can be easily solved by n_grams. You can use CountVectorizer to find out n_grams and their frequency in the text and generate the output you want.
from sklearn.feature_extraction.text import CountVectorizer
# Changed the token_pattern to identify only single letter words
# n_gram = (2,5), to identify from 2 upto 5-grams
cv = CountVectorizer(ngram_range=(1,5), token_pattern=r"(?u)\b\w\b",
lowercase=False)
# Wrapped the data in a list, because CountVectorizer requires an iterable
data = ['A C D F A C D A B D A C D F A C D A B D']
# Learn about the data
cv.fit(data)
# This is just to prettify the printing
import pandas as pd
df = pd.DataFrame(cv.get_feature_names(), columns = ['pattern'])
# Add the frequencies
df['count'] = cv.transform(data).toarray()[0] #<== Changing to dense matrix
df
#Output
pattern count
A B 2
A B D 2
A B D A 1
A B D A C 1
A C 4
A C D 4
A C D A 2
A C D A B 2
A C D F 2
A C D F A 2
B D 2
B D A 1
B D A C 1
B D A C D 1
C D 4
C D A 2
C D A B 2
C D A B D 2
C D F 2
C D F A 2
C D F A C 2
D A 3
D A B 2
D A B D 2
D A B D A 1
D A C 1
D A C D 1
D A C D F 1
D F 2
D F A 2
D F A C 2
D F A C D 2
F A 2
F A C 2
F A C D 2
F A C D A 2
But I would recommend you to try recommender, pattern finding, association rule mining (apriori) algorithms etc. which will help you more.
Related
I tried looking for similar answers, but solutions didn't work for me.
I have a dataframe with two columns: template(str) and content(str).
I also have a separate function, split_template_name that takes a string and returns a tuple of 5 values, eg:
split_template_name(some_string) will return a tuple of 5 strings ('str1', 'str2', 'str3', 'str4', 'str5')
I'm trying to process the df[template] with this function, so that the dataframe gets 5 more columns with the 5 outputs.
Tried
df[template].apply(split_template_name) and it returns full tuple as one column, which is not what I need.
Some stackoverflow answers suggest adding result_type='expand', So I tried df['template'].apply(split_template_name, axis = 1, result_type ='expand')
but that gives errors: split_template_name() got an unexpected keyword argument 'axis' or split_template_name() got an unexpected keyword argument 'result_type'
Basically the goal is to start with df['template', 'content'] and to end with dataframe that has df['template', 'content', 'str1', 'str2', 'str3', 'str4', 'str5']
This seems to work:
df[['str1', 'str2', 'str3', 'str4', 'str5']] = pd.DataFrame(
df['template'].apply(split_template_name).tolist(), index = df.index)
if it's possible to split the column with a regular expression you could use:
df.template.str.extract()
see this example:
import pandas as pd
df = pd.DataFrame({'sentences': ['how_are_you', 'hello_world_good']})
how this dataframe looks like:
sentences
0 how_are_you
1 hello_world_good
using Series.str.extract()
df['sentences'].str.extract(r'(?P<first>\w+)_(?P<second>\w+)_(?P<third>\w+)')
output:
first second third
0 how are you
1 hello world good
This worked for me.
df_dict = {"template" :["A B C D E","A B C D E","A B C D E","A B C D E","A
B C D E"], "content" : ["text1","text2","text3","text4","text5"]}
df = pd.DataFrame(df_dict)
print(df)
template content
0 A B C D E text1
1 A B C D E text2
2 A B C D E text3
3 A B C D E text4
4 A B C D E text5
def split_template_name(row):
return row.split()
df[['A','B','C','D','E']] = df['template'].apply(split_template_name)
print(df)
template content A B C D E
0 A B C D E text1 A A A A A
1 A B C D E text2 B B B B B
2 A B C D E text3 C C C C C
3 A B C D E text4 D D D D D
4 A B C D E text5 E E E E E
My csv file row column data looks like this -
a a a a a
b b b b b
c c c c c
d d d d d
a b c d e
a d b c c
When I have patterns like row 1-5, I want to return value 0
When I have row like 6 or random alphabets (not like row 1-5), I want to return value 1.
How do I do it using python?It must be done by using csv file
You can read your csv file to pandas dataframe using:
df = pd.read_csv(header=None)
output:
0 1 2 3 4
0 a a a a a
1 b b b b b
2 c c c c c
3 d d d d d
4 a b c d e
5 a d b c c
Then, use nunique to count the number of unique values per row, if 1 or 5 (the max), then it is valid, else not. Use between for that.
df.nunique(1).between(2, len(df.columns)-1).astype(int)
output:
0 0
1 0
2 0
3 0
4 0
5 1
dtype: int64
Is there a function in Python that does what the R fct_lump function does (i.e. to group all groups that are too small into one 'OTHER' group)?
Example below:
library(dplyr)
library(forcats)
> x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
> x
[1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B
[49] B B C C C C C D D D D D D D D D D D D D D D D D D D D D D D D D D D E F G H I
Levels: A B C D E F G H I
> x %>% fct_lump_n(3)
[1] A A A A A A A A A A A A A A A A
[17] A A A A A A A A A A A A A A A A
[33] A A A A A A A A B B B B B B B B
[49] B B Other Other Other Other Other D D D D D D D D D
[65] D D D D D D D D D D D D D D D D
[81] D D Other Other Other Other Other
Levels: A B D Other
pip install siuba
#( in python or anaconda prompth shell)
#use library as:
from siuba.dply.forcats import fct_lump, fct_reorder
#just like fct_lump of R :
df['Your_column'] = fct_lump(df['Your_column'], n= 10)
df['Your_column'].value_counts() # check your levels
#it reduces the level to 10, lumps all the others as 'Other'
You may also want to try datar:
>>> from datar.all import factor, rep, LETTERS, c, fct_lump_n, fct_count
>>>
>>> x = factor(rep(LETTERS[:9], times=c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
>>> x >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 C 5
3 D 27
4 E 1
5 F 1
6 G 1
7 H 1
8 I 1
>>> x >> fct_lump_n(3) >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 D 27
3 Other 10
Disclaimer: I am the author of the datar package.
I have a pandas data frame with 2 columns - user1 and user2
something like this
Now, I want to do a transitive relation such that if A is related to B and B is to C and C is to D, then I want the output as a list like "A-B-C-D" in one group and "E-F-G" in another group.
Thanks
If you have just 2 groups, you can do in this way. But it works only for 2 groups, and you cannot generalize:
x = []
y = []
x.append(df['user1'][0])
x.append(df['user2'][0])
for index, i in enumerate(df['user1']):
if df['user1'][index] in x:
x.append(df['user2'][index])
else:
y.append(df['user1'][index])
y.append(df['user2'][index])
x = set(x)
y = set(y)
If you want to find all the transitive relationships then most likely you need to perform a recursion. Perhaps this following piece of code may help:
import pandas as pd
data={'user1':['A','A','B', 'C', 'E', 'F'],
'user2':['B', 'C','C','D','F','G']}
df=pd.DataFrame(data)
print(df)
# this method is similar to the commnon table expression (CTE) in SQL
def cte(df_anchor,df_ref,level):
if (level==0):
df_anchor.insert(0, 'user_root',df_anchor['user1'])
df_anchor['level']=0
df_anchor['relationship']=df_anchor['user1']+'-'+df_anchor['user2']
_df_anchor=df_anchor
if (level>0):
_df_anchor=df_anchor[df_anchor.level==level]
_df=pd.merge(_df_anchor, df_ref , left_on='user2', right_on='user1', how='inner', suffixes=('', '_x'))
if not(_df.empty):
_df['relationship']=_df['relationship']+'-'+_df['user2_x']
_df['level']=_df['level']+1
_df=_df[['user_root','user1_x', 'user2_x', 'level','relationship']].rename(columns={'user1_x': 'user1', 'user2_x': 'user2'})
df_anchor_new=pd.concat([df_anchor, _df])
return cte(df_anchor_new, df_ref, level+1)
else:
return df_anchor
df_rel=cte(df, df, 0)
print("\nall relationship=\n",df_rel)
print("\nall relationship related to A=\n", df_rel[df_rel.user_root=='A'])
user1 user2
0 A B
1 A C
2 B C
3 C D
4 E F
5 F G
all relationship=
user_root user1 user2 level relationship
0 A A B 0 A-B
1 A A C 0 A-C
2 B B C 0 B-C
3 C C D 0 C-D
4 E E F 0 E-F
5 F F G 0 F-G
0 A B C 1 A-B-C
1 A C D 1 A-C-D
2 B C D 1 B-C-D
3 E F G 1 E-F-G
0 A C D 2 A-B-C-D
all relationship related to A=
user_root user1 user2 level relationship
0 A A B 0 A-B
1 A A C 0 A-C
0 A B C 1 A-B-C
1 A C D 1 A-C-D
0 A C D 2 A-B-C-D
I have a file whose content looks like:
A B 2 4
C D 1 2
A D 3 4
A D 1 2
A B 4 7 and so on..
My objective is to get the final output as below:
A B 3 5.5
C D 1 2
A D 2 3
That is, for each unique combination of first two columns, the result should be column-wise average of other two columns of the file. I tried using loops and it is just increasing the complexity of the program. Is there any other way to achieve the objective.
Sample Code:
with open(r"C:\Users\priya\Desktop\test.txt") as f:
content = f.readlines()
content = [x.split() for x in content]
for i in range(len(content)):
valueofa=[content[i][2]]
valueofb=[content[i][3]]
for j in xrange(i+1,len(content)):
if content[i][0]==content[j][0] and content[i][1]==content[j][1]:
valueofa.append(content[j][2])
valueofb.append(content[j][3])
and I intended to take the average of both lists by index.
You can store each combination of letters as a tuple in a dictionary and then average at the end, e.g.:
d = {}
with open(r"C:\Users\priya\Desktop\test.txt") as f:
for line in f:
a, b, x, y = line.split()
d.setdefault((a, b), []).append((int(x), int(y)))
for (a, b), v in d.items():
xs, ys = zip(*v)
print("{} {} {:g} {:g}".format(a, b, sum(xs)/len(v), sum(ys)/len(v)))
Output:
A B 3 5.5
C D 1 2
A D 2 3
If you can use pandas, it will much simpler:
import pandas as pd
df = pd.read_csv(r"C:\Users\priya\Desktop\test.txt", names=['A','B','C','D'])
df
A B C D
0 A B 2 4
1 C D 1 2
2 A D 3 4
3 A D 1 2
4 A B 4 7
df.groupby(['A','B']).mean().reset_index()
A B C D
0 A B 3.0 5.5
1 A D 2.0 3.0
2 C D 1.0 2.0