splitting custom function output in pandas to multiple columns - python

I tried looking for similar answers, but solutions didn't work for me.
I have a dataframe with two columns: template(str) and content(str).
I also have a separate function, split_template_name that takes a string and returns a tuple of 5 values, eg:
split_template_name(some_string) will return a tuple of 5 strings ('str1', 'str2', 'str3', 'str4', 'str5')
I'm trying to process the df[template] with this function, so that the dataframe gets 5 more columns with the 5 outputs.
Tried
df[template].apply(split_template_name) and it returns full tuple as one column, which is not what I need.
Some stackoverflow answers suggest adding result_type='expand', So I tried df['template'].apply(split_template_name, axis = 1, result_type ='expand')
but that gives errors: split_template_name() got an unexpected keyword argument 'axis' or split_template_name() got an unexpected keyword argument 'result_type'
Basically the goal is to start with df['template', 'content'] and to end with dataframe that has df['template', 'content', 'str1', 'str2', 'str3', 'str4', 'str5']

This seems to work:
df[['str1', 'str2', 'str3', 'str4', 'str5']] = pd.DataFrame(
df['template'].apply(split_template_name).tolist(), index = df.index)

if it's possible to split the column with a regular expression you could use:
df.template.str.extract()
see this example:
import pandas as pd
df = pd.DataFrame({'sentences': ['how_are_you', 'hello_world_good']})
how this dataframe looks like:
sentences
0 how_are_you
1 hello_world_good
using Series.str.extract()
df['sentences'].str.extract(r'(?P<first>\w+)_(?P<second>\w+)_(?P<third>\w+)')
output:
first second third
0 how are you
1 hello world good

This worked for me.
df_dict = {"template" :["A B C D E","A B C D E","A B C D E","A B C D E","A
B C D E"], "content" : ["text1","text2","text3","text4","text5"]}
df = pd.DataFrame(df_dict)
print(df)
template content
0 A B C D E text1
1 A B C D E text2
2 A B C D E text3
3 A B C D E text4
4 A B C D E text5
def split_template_name(row):
return row.split()
df[['A','B','C','D','E']] = df['template'].apply(split_template_name)
print(df)
template content A B C D E
0 A B C D E text1 A A A A A
1 A B C D E text2 B B B B B
2 A B C D E text3 C C C C C
3 A B C D E text4 D D D D D
4 A B C D E text5 E E E E E

Related

Aggregate values pandas

I have a pandas dataframe like this:
Id A B C D
1 a b c d
2 a b d
2 a c d
3 a d
3 a b c
I want to aggregate the empty values for the columns B-C and D, using the values contained in the other rows, by using the information for the same Id.
The resulting data frame should be the following:
Id A B C D
1 a b c d
2 a b c d
3 a b c d
There can be the possibility to have different values in the first column (A), for the same Id. In this case instead of putting the first instance I prefer to put another value indicating this event.
So for e.g.
Id A B C D
1 a b c d
2 a b d
2 x c d
It becomes:
Id A B C D
1 a b c d
2 f b c d
IIUC, you can use groupby_agg:
>>> df.groupby('Id')
.agg({'A': lambda x: x.iloc[0] if len(x.unique()) == 1 else 'f',
'B': 'first', 'C': 'first', 'D': 'first'})
A B C D
Id
1 a b c d
2 f b c d
The best way I can think to do this is to iterate through each unique Id, slicing it out of the original dataframe, and constructing a new row as a product of merging the relevant rows:
def aggregate(df):
ids = df['Id'].unique()
rows = []
for id in ids:
relevant = df[df['Id'] == id]
newrow = {c: "" for c in df.columns}
for _, row in relevant.iterrows():
for col in newrow:
if row[col]:
if len(newrow[col]):
if newrow[col][-1] == row[col]:
continue
newrow[col] += row[col]
rows.append(newrow)
return pd.DataFrame(rows)

How to combine string from one column to another column at same index in pandas DataFrame?

I was doing a project in nlp.
My input is:
index name lst
0 a c
0 d
0 e
1 f
1 b g
I need output like this:
index name lst combine
0 a c a c
0 d a d
0 e a e
1 f b f
1 b g b g
How can I achieve this?
You can use groupby+transform('max') to replace the empty cells with the letter per group as the letters have precedence over space. The rest is a simple string concatenation per column:
df['combine'] = df.groupby('index')['name'].transform('max') + ' ' + df['lst']
Used input:
df = pd.DataFrame({'index': [0,0,0,1,1],
'name': ['a','','','','b'],
'lst': list('cdefg'),
})
NB. I considered "index" to be a column here, if this is the index you should use df.index in the groupby
Output:
index name lst combine
0 0 a c a c
1 0 d a d
2 0 e a e
3 1 f b f
4 1 b g b g

How to recognize repeated pattern over a data and make some recommendation?

I have a huge dataset having some repeated data(user log file) and would like to to do similar pattern occurrence recognition and recommendation based on the user download. Once the pattern recognition is made, I need to recommend best possible value to the user.
For example following are the download logs based on time:
A C D F A C D A B D A C D F A C D A B D
I would like to recognize the pattern that exists between this dataset and display the result as:
A -> C = 4
C -> D = 4
D -> F = 2
F -> A = 2
D -> A = 3
A -> B = 1
B -> D = 1
A -> C -> D = 2
C -> D -> F = 2
D -> F -> A = 1
F -> A -> C = 1
C -> D -> A = 1
D -> A -> B = 1
A -> B -> D = 1
The number at the end represents the number of repetition of that pattern.
When the user inputs "A", the best recommendation should be "C", And if the user input is "A -> C", then it should be "D".
Currently I am doing data cleaning using pandas in Python and for pattern recognition, I think scikit-learn might work (not sure though).
Is there any good library or algorithm that I can make a use for this problem or is there any good approach for this kind of problem ?
Since the data size is very big, I am implementing it using Python.
The current problem can be easily solved by n_grams. You can use CountVectorizer to find out n_grams and their frequency in the text and generate the output you want.
from sklearn.feature_extraction.text import CountVectorizer
# Changed the token_pattern to identify only single letter words
# n_gram = (2,5), to identify from 2 upto 5-grams
cv = CountVectorizer(ngram_range=(1,5), token_pattern=r"(?u)\b\w\b",
lowercase=False)
# Wrapped the data in a list, because CountVectorizer requires an iterable
data = ['A C D F A C D A B D A C D F A C D A B D']
# Learn about the data
cv.fit(data)
# This is just to prettify the printing
import pandas as pd
df = pd.DataFrame(cv.get_feature_names(), columns = ['pattern'])
# Add the frequencies
df['count'] = cv.transform(data).toarray()[0] #<== Changing to dense matrix
df
#Output
pattern count
A B 2
A B D 2
A B D A 1
A B D A C 1
A C 4
A C D 4
A C D A 2
A C D A B 2
A C D F 2
A C D F A 2
B D 2
B D A 1
B D A C 1
B D A C D 1
C D 4
C D A 2
C D A B 2
C D A B D 2
C D F 2
C D F A 2
C D F A C 2
D A 3
D A B 2
D A B D 2
D A B D A 1
D A C 1
D A C D 1
D A C D F 1
D F 2
D F A 2
D F A C 2
D F A C D 2
F A 2
F A C 2
F A C D 2
F A C D A 2
But I would recommend you to try recommender, pattern finding, association rule mining (apriori) algorithms etc. which will help you more.

finding transitive relation between two columns in pandas

I have a pandas data frame with 2 columns - user1 and user2
something like this
Now, I want to do a transitive relation such that if A is related to B and B is to C and C is to D, then I want the output as a list like "A-B-C-D" in one group and "E-F-G" in another group.
Thanks
If you have just 2 groups, you can do in this way. But it works only for 2 groups, and you cannot generalize:
x = []
y = []
x.append(df['user1'][0])
x.append(df['user2'][0])
for index, i in enumerate(df['user1']):
if df['user1'][index] in x:
x.append(df['user2'][index])
else:
y.append(df['user1'][index])
y.append(df['user2'][index])
x = set(x)
y = set(y)
If you want to find all the transitive relationships then most likely you need to perform a recursion. Perhaps this following piece of code may help:
import pandas as pd
data={'user1':['A','A','B', 'C', 'E', 'F'],
'user2':['B', 'C','C','D','F','G']}
df=pd.DataFrame(data)
print(df)
# this method is similar to the commnon table expression (CTE) in SQL
def cte(df_anchor,df_ref,level):
if (level==0):
df_anchor.insert(0, 'user_root',df_anchor['user1'])
df_anchor['level']=0
df_anchor['relationship']=df_anchor['user1']+'-'+df_anchor['user2']
_df_anchor=df_anchor
if (level>0):
_df_anchor=df_anchor[df_anchor.level==level]
_df=pd.merge(_df_anchor, df_ref , left_on='user2', right_on='user1', how='inner', suffixes=('', '_x'))
if not(_df.empty):
_df['relationship']=_df['relationship']+'-'+_df['user2_x']
_df['level']=_df['level']+1
_df=_df[['user_root','user1_x', 'user2_x', 'level','relationship']].rename(columns={'user1_x': 'user1', 'user2_x': 'user2'})
df_anchor_new=pd.concat([df_anchor, _df])
return cte(df_anchor_new, df_ref, level+1)
else:
return df_anchor
df_rel=cte(df, df, 0)
print("\nall relationship=\n",df_rel)
print("\nall relationship related to A=\n", df_rel[df_rel.user_root=='A'])
user1 user2
0 A B
1 A C
2 B C
3 C D
4 E F
5 F G
all relationship=
user_root user1 user2 level relationship
0 A A B 0 A-B
1 A A C 0 A-C
2 B B C 0 B-C
3 C C D 0 C-D
4 E E F 0 E-F
5 F F G 0 F-G
0 A B C 1 A-B-C
1 A C D 1 A-C-D
2 B C D 1 B-C-D
3 E F G 1 E-F-G
0 A C D 2 A-B-C-D
all relationship related to A=
user_root user1 user2 level relationship
0 A A B 0 A-B
1 A A C 0 A-C
0 A B C 1 A-B-C
1 A C D 1 A-C-D
0 A C D 2 A-B-C-D

Adding spaces between strings after sum()

Assuming that I have the following pandas dataframe:
>>> data = pd.DataFrame({ 'X':['a','b'], 'Y':['c','d'], 'Z':['e','f']})
X Y Z
0 a c e
1 b d f
The desired output is:
0 a c e
1 b d f
When I run the following code, I get:
>>> data.sum(axis=1)
0 ace
1 bdf
So how do I add columns of strings with space between them?
Use apply per rows by axis=1 and join:
a = data.apply(' '.join, axis=1)
print (a)
0 a c e
1 b d f
dtype: object
Another solution with add spaces, sum and last str.rstrip:
a = data.add(' ').sum(axis=1).str.rstrip()
#same as
#a = (data + ' ').sum(axis=1).str.rstrip()
print (a)
0 a c e
1 b d f
dtype: object
You can do as follow :
data.apply(lambda x:x + ' ').sum(axis=1)
The output is :
0 a c e
1 b d f
dtype: object

Categories