randomly shuffle multiple dataframes - python

I have a corpus of conversations (400) between two people as strings (or more precisely as plain text files) A small example of this might be:
my_textfiles = ['john: hello \nmary: hi there \njohn: nice weather \nmary: yes',
'nancy: hello \nbill: hi there \nnancy: nice weather \nbill: yes',
'ringo: hello \npaul: hi there \nringo: nice weather \npaul: yes',
'michael: hello \nbubbles: hi there \nmichael: nice weather \nbubbles: yes',
'steve: hello \nsally: hi there \nsteve: nice weather \nsally: yes']
In addition to speaker names, I have also noted each speakers' role in the conversation (as a leader or follower depending on whether they are the first or second speaker). I then have a simple script that converts each conversation into a data-frame by seperating speaker ID from the content:
import pandas as pd
import re
import numpy as np
import random
def convo_tokenize(tf):
turnTokenize = re.split(r'\n(?=.*:)', tf, flags=re.MULTILINE)
turnTokenize = [turn.split(':', 1) for turn in turnTokenize]
dataframe = pd.DataFrame(turnTokenize, columns = ['speaker','turn'])
return dataframe
df_list = [convo_tokenize(tf) for tf in my_textfiles]
The corresponding dataframe then forms the basis of a much longer piece of analysis. However, I would now like to be able to shuffle speakers so that I create entirely random (and likely nonsense) conversations. For instance, John, who is having a conversation with Mary in the fist string, might be randomly assigned Paul (the second speaker in the third string). Crucially, I would need to maintain the order of speech within each speaker. It is also important that, when randomly assigning new speakers, I preserve a mix of leader/follower, such that I am not creating conversations from two leaders or two followers.
To begin, my thinking was to create a standardized speaker label (where 1 = leader, 2 = follower), and separate each DF into a sub-DF and store in role_specific df lists
def speaker_role(dataframe):
leader = dataframe['speaker'].iat[0]
dataframe['sp_role'] = np.where(dataframe['speaker'].eq(leader), 1, 2)
return dataframe
df_list = [speaker_role(df) for df in df_list]
leader_df = []
follower_df = []
for df in df_list:
is_leader = df['sp_role'] == 1
is_follower = df['sp_role'] != 1
leader_df.append(df[is_leader])
follower_df.append(df[is_follower])
I have worked out that I can now simply shuffle the data-frame of one of the sub-dfs, in this case the follower_df
follower_rand = random.sample(follower_df, len(follower_df))
Having got to this stage I'm not sure where to turn next. I suspect I will need some sort of zip function, but am unsure exactly what. I'm also unsure how I go about merging the turns together such that they form the same dataframe structure I initially had. Assuming Ringo (leader) is randomly assigned to Bubbles (follower) for one of the DFs, I would hope to have something like this...
speaker | turn | sp_role
------------------------------------
ringo hello 1
bubbles hi there 2
ringo nice weather 1
bubbles yes it is 2

Related

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

Check words from a list in column of dataframe and print the words in another created column of the same dataframe [Python]

I have a list of keywords and I wish to write a Python program that can iterate each word in the list and check if the words from the list exist in each row of a column of data frame and print those words in another column of the same data frame.
e.g.
keywords = ['registration', 'al', 'branch']
df = pd.DataFrame({'message': ['wonderful registration process', 'i hate this branch', 'this branch has a great registration process','I don't like this place']})
I want to check the matched words in the list with each row of the message in the data frame and print the matched words in another created column named "keywords" of the data frame.
So the output of the above code should be
df
message
0 wonderful registration process
1 i hate this branch
2 this branch has a great registration process
3 I don't like this place
df
message keywords
0 wonderful registration process registration
1 i hate this branch branch
2 this branch has a great registration process registration, branch
3 I don't like this place none
It will be great if anyone could guide me.
this is your solution working like a charm.
import pandas as pd
keywords = ['registration', 'al', 'branch']
df = pd.DataFrame({'message': ["wonderful registration process", "i hate this branch", "this branch has a great registration process","I don't like this place"]})
# first of all when you have word like don't try to use ("") not ('') when defining string
#(keyword if (keyword in element)
def operation(element):
res=",".join([(keyword) for keyword in keywords if (keyword in element)])
if res=="":
return "none" #handling no keyword situation
else:
return res
df.insert(1, "keywords", list(map(operation,list(df.to_dict()['message'].values()))), True)#insert of new array
print(df)
Happy codding, any question you can text me on stackoverflow .

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Python concatenate values in rows till empty cell and continue

I am struggling a little to do something like that:
to get this output:
The purpose of it, is to separate a sentence into 3 parts to make some manipulations after.
Any help is welcome
Select from the dataframe only the second line of each pair, which is the line
containing the separator, then use astype(str).apply(''.join...) to restrain the word
that can be on any value column on the original dataframe to a single string.
Iterate over each row using split with the word[i] of the respective row, after split
reinsert the separator back on the list, and with the recently created list build the
desired dataframe.
Input used as data.csv
title,Value,Value,Value,Value,Value
Very nice blue car haha,Very,nice,,car,haha
Very nice blue car haha,,,blue,,
A beautiful green building,A,,green,building,lol
A beautiful green building,,beautiful,,,
import pandas as pd
df = pd.read_csv("data.csv")
# second line of each pair
d1 = df[1::2]
d1 = d1.fillna("").reset_index(drop=True)
# get separators
word = d1.iloc[:,1:].astype(str).apply(''.join, axis=1)
strings = []
for i in range(len(d1.index)):
word_split = d1.iloc[i, 0].split(word[i])
word_split.insert(1, word[i])
strings.append(word_split)
dn = pd.DataFrame(strings)
dn.insert(0, "title", d1["title"])
print(dn)
Output from dn
title 0 1 2
0 Very nice blue car haha Very nice blue car haha
1 A beautiful green building A beautiful green building

How do I create pairs from a data frame given constraints found in another column?

I need to randomly match two staff emails from a list of emails. The staff pairs can not have the same managers and can not have been paired before. Best way to go about achieving this? I'm not that great with Python so not even sure how to start. The other similar questions I found didn't help me much.
I have two datasets:
List of active members
Column A: Emails of staff
Column B: The staff's manager
Emails Managers
jessica#xyz.com Bob
alex#xyz.com Justin
lucy#xyz.com Justin
eric#xyz.com Zach
brandon#xyz.com Tony
dylan#xyz.com Patty
List of historical matches
Emails Managers
lucy#xyz.com Justin
eric#xyz.com Zach
What it might look like:
Emails1 Managers1 Emails2 Managers2
dylan#xyz.com Patty lucy#xyz.com Justin
eric#xyz.com Zach brandon#xyz.com Tony
...
What I have so far (lol):
# Dependencies and Setup
import pandas as pd
import numpy as np
import itertools
# Load file and read in the data
active_data = pd.read_csv("Active.csv")
historical_data = pd.read_csv("Historical.csv")
# Preview data
active_data.head(7)
traceback
dtypes
Try this and let me know if it works or not
df['if_duplicate'] = df.duplicated(subset=['managers'])
unique_incdices = [x for x in df.shape[0] if df.loc[x,'if_duplicated']==False]
unique_incdices = [x for x in unique_incdices if x not in historical_matches['emails'].values]
ab = np.random.randint(0,len(unique_incdices),size=2)
i,j = unique_incdices[ab[0]],unique_incdices[ab[1]]
i and j are indices of two rows who

Categories