Extract first element from multi dimensional array from pandas dataframe - python

I have a data frame consists of reviews column which is a multi dimensional array, I would like to extract first element as shown below,
Suppose df['Reviews'] consists of following rows
I want the output in a separate column as shown below,
Please find sample data 3 values for the column below:
df['Reviews'] =
[['Just like home', 'A Warm Welcome to Wintry Amsterdam'], ['01/03/2018', '01/01/2018']]
[['Great food and staff', 'just perfect'], ['01/06/2018', '01/04/2018']]
[['Satisfaction', 'Delicious old school restaurant'], ['01/04/2018', '01/04/2018']]
Please help

If need first lists use indexing by str[0]:
import ast
df['Reviews'] = df['Reviews'].apply(ast.literal_eval).str[0]
If need join lists by , to strings add Series.str.join:
import ast
df['Reviews'] = df['Reviews'].apply(ast.literal_eval).str[0].str.join(',')

You need to add in the following for your dataframe access as per your need.
This will create a new column named output with the appropriate requirement
Apply Function
df['output'] = df.Reviews.apply(lambda x: x[0])
Map Function
df.loc[:, 'output'] = df.Reviews.map(lambda x: x[0])

I guess this should help. This worked for me.
df['Reviews']=df['Reviews'].apply(lambda c: str(c[0]).strip('[]'))
This works fine if run once.If run again on same code, it ll further divide the text.So i suggest comment it out after using it.
Or make a new column.
P.S: You should include the code instead of screenshots so that it can be tested first.
EDIT
Looks fine to me.try once again and remember if you run it twice (incase not making separate column), it ll return none

If you receive error, you probably have some empty data in Reviews. You can drop them, if such data is useless for you:
df.dropna(subset='Reviews', inplace=True)
or add checking type of data:
a = [[['Just like home', 'A Warm Welcome to Wintry Amsterdam'], ['01/03/2018', '01/01/2018']], [['Great food and staff', 'just perfect'], ['01/06/2018', '01/04/2018']], [['Satisfaction', 'Delicious old school restaurant'], ['01/04/2018', '01/04/2018']]]
df = pd.DataFrame(columns=['Reviews', 'Review'])
df['Reviews'] = a
df
executed in 18ms, finished 07:39:04 2020-06-05
Reviews Review
0 [[Just like home, A Warm Welcome to Wintry Ams... NaN
1 [[Great food and staff, just perfect], [01/06/... NaN
2 [[Satisfaction, Delicious old school restauran... NaN
def get_review(reviews):
if type(reviews) == list:
return reviews[0]
else:
return None
df['Review'] = df['Reviews'].apply(get_review)
df
Reviews Review
0 [[Just like home, A Warm Welcome to Wintry Ams... [Just like home, A Warm Welcome to Wintry Amst...
1 [[Great food and staff, just perfect], [01/06/... [Great food and staff, just perfect]
2 [[Satisfaction, Delicious old school restauran... [Satisfaction, Delicious old school restaurant]
If you don't want column Review to be a list, simply covert it to string with some divider:
def get_review(reviews):
if type(reviews) == list:
return ', '.join(reviews[0])
else:
return ''
df['Review'] = df['Reviews'].apply(get_review)
df
Reviews Review
0 [[Just like home, A Warm Welcome to Wintry Ams... Just like home, A Warm Welcome to Wintry Amste...
1 [[Great food and staff, just perfect], [01/06/... Great food and staff, just perfect
2 [[Satisfaction, Delicious old school restauran... Satisfaction, Delicious old school restaurant
I your input data is not of type list (i.e. you reading it from CSV), you need to convert it to list first:
import ast
def get_review(reviews):
if pd.notna(reviews) and reviews != '':
r_list = ast.literal_eval(reviews)[0]
if len(r_list) > 0:
return ', '.join(r_list)
else:
return ''
else:
return ''
df2['Review'] = df2['Reviews'].apply(get_review)
df2
Reviews Review
Reviews Review
0 [['Just like home', 'A Warm Welcome to Wintry ... Just like home, A Warm Welcome to Wintry Amste...
1 [['Great food and staff', 'just perfect'], ['0... Great food and staff, just perfect
2 [['Satisfaction', 'Delicious old school restau... Satisfaction, Delicious old school restaurant

Related

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

searching keyword in text file and printing the entire row

ok so i used an api and arranged a text file that looks like this:
TITLE,YEAR,IMDB
'Money Plane','2000','tt7286966'
'Mulan','2020','tt4566758'
'Secret Society of Second Born Royals','2020','tt10324122'
'Train to Busan Presents: Peninsula', 'year': '2020', 'imdb_id': 'tt8850222'
'After We Collided','2020','tt10362466'
'One Night in Bangkok','2020','tt12192190'
'Phineas and Ferb The Movie: Candace Against the Universe','2020','tt1817232'
'Cats & Dogs 3: Paws Unite','2020','tt12745164'
'The Crimes That Bind','2020','tt10915060'
'Scoob!','2020','tt3152592'
'We Bare Bears: The Movie','2020','tt10474606'
'Trolls World Tour', 'year': '2020', 'imdb_id': 'tt6587640'
'Birds of Prey (and the Fantabulous Emancipation of One Harley Quinn)','2020','tt7713068'
'Bad Boys for Life','2020','tt1502397'
'Greyhound','2020','tt6048922'
'The Old Guard','2020','tt7556122'
'Sonic the Hedgehog','2020','tt3794354'
'Dad Wanted','2020','tt12721188'
'Barbie: Princess Adventure','2020','tt12767498'
now i want to be able so search a movie by either title,year or id
bassicaly i want to make it so that when i search for a title, it will show me the entire details about the movie,
example: search for mulan, and my output will be:
'Mulan','2020','tt4566758'
my code so far looks like this:
import pandas
data=pandas.read_csv(r"C:\Users\Home\Documents\studying\newproject\moviedata.txt")
title=list(data["TITLE"])
year=list(data["YEAR"])
imdbid=list(data["IMDB"])
ive tried to use
for t,y,imdb in zip(title,year,imdbid):
if word in title:
items=data[word]
print(items)
it runs but it does nothing
You can use here numpy.where to find value and pandas loc to print row
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('moviedata.txt')
row , col = np.where(df == "'Mulan'")
print(df.loc[row])
Output:
TITLE YEAR IMDB
1 'Mulan' '2020' 'tt4566758'
If you want output in form of list
row , col = np.where(df == "'Mulan'")
a = df.loc[row]
a = a.values.tolist()
print(a)
Output:
[["'Mulan'", "'2020'", "'tt4566758'"]]

Update one column's value based on another column's value in Pandas using regular expression

Suppose I have a dataframe like below:
>>> df = pd.DataFrame({'Category':['Personal Care', 'Home Care', 'Pharma', 'Pet'], 'SubCategory':['Shampoo', 'Floor Wipe', 'Veterinary', 'Animal Feed']})
>>> df
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pharma Veterinary
3 Pet Animal Feed
I'd like to update the value in 'Category' column whenever the 'Subcategory' column's value contains either 'Veterinary' or 'Animal' (case-insensitive). To do that, I devised a method like below:
def update_col1_values_based_on_values_in_col2_using_regex_mappings(
df,
col1_name: str,
col2_name: str,
dictionary_of_regex_mappings: dict):
for pattern, new_str_value in dictionary_of_regex_mappings.items():
mask = df[col2_name].str.contains(pattern)
df.loc[mask, col1_name] = new_str_value
return df
This method works as expected as shown below:
>>> df1 = update_col1_values_based_on_values_in_col2_using_regex_mappings(df, 'Category', 'SubCategory', {"(?i).*Veterinary.*": "Pet Related", "(?i).*Animal.*": "Pet Related"})
>>> df1
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed
In practice, there will be more than 'Veterinary' and 'Animal Feed' to map from, so some of the suggestions below, although they read elegant, are not going be practical for the actual use case. In other words, please assume that the mapping is going to be more like this:
{
"(?i).*Veterinary.*": "Pet Related",
"(?i).*Animal.*": "Pet Related"
"(?i).*Pharma.*": "Pharmaceutical",
"(?i).*Diary.*": "Other",
... # lots and lots more mapping here
}
I'm wondering if there's a more elegant (Pandas-ish) way to accomplish this. Thank you in advance for your suggestions!
EDIT: I didn't clarify in the beginning that the mapping between 'Category' and 'Subcategory' columns wouldn't be restricted to just 'Veterinary' and 'Animal'.
You can use the following code, which is intuitive.
df['Category'] = df['SubCategory'].map(lambda x: "Pet Related" if "Animal" in x or "Veterinary" in x else x)
You could do it with pd.DataFrame.where, and re to add the flag case-insensitive:
import re
df.Category.where(~df.SubCategory.str.contains('Veterinary|Animal',flags = re.IGNORECASE),'Pet Related',inplace=True)
Output:
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed
Not sure if this is the best way, but you can do this:
df.loc[df.SubCategory.str.contains('Veterinary|Animal'), 'Category']='Pet Related'
If you need to use regex, str.contains() does also support regex
pattern = r'(?i)veterinary|animal'
df.loc[df.SubCategory.str.contains(pattern, regex=True), 'Category']='Pet Related'
And this is the result
In [3]: df
Out[3]:
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed

randomly shuffle multiple dataframes

I have a corpus of conversations (400) between two people as strings (or more precisely as plain text files) A small example of this might be:
my_textfiles = ['john: hello \nmary: hi there \njohn: nice weather \nmary: yes',
'nancy: hello \nbill: hi there \nnancy: nice weather \nbill: yes',
'ringo: hello \npaul: hi there \nringo: nice weather \npaul: yes',
'michael: hello \nbubbles: hi there \nmichael: nice weather \nbubbles: yes',
'steve: hello \nsally: hi there \nsteve: nice weather \nsally: yes']
In addition to speaker names, I have also noted each speakers' role in the conversation (as a leader or follower depending on whether they are the first or second speaker). I then have a simple script that converts each conversation into a data-frame by seperating speaker ID from the content:
import pandas as pd
import re
import numpy as np
import random
def convo_tokenize(tf):
turnTokenize = re.split(r'\n(?=.*:)', tf, flags=re.MULTILINE)
turnTokenize = [turn.split(':', 1) for turn in turnTokenize]
dataframe = pd.DataFrame(turnTokenize, columns = ['speaker','turn'])
return dataframe
df_list = [convo_tokenize(tf) for tf in my_textfiles]
The corresponding dataframe then forms the basis of a much longer piece of analysis. However, I would now like to be able to shuffle speakers so that I create entirely random (and likely nonsense) conversations. For instance, John, who is having a conversation with Mary in the fist string, might be randomly assigned Paul (the second speaker in the third string). Crucially, I would need to maintain the order of speech within each speaker. It is also important that, when randomly assigning new speakers, I preserve a mix of leader/follower, such that I am not creating conversations from two leaders or two followers.
To begin, my thinking was to create a standardized speaker label (where 1 = leader, 2 = follower), and separate each DF into a sub-DF and store in role_specific df lists
def speaker_role(dataframe):
leader = dataframe['speaker'].iat[0]
dataframe['sp_role'] = np.where(dataframe['speaker'].eq(leader), 1, 2)
return dataframe
df_list = [speaker_role(df) for df in df_list]
leader_df = []
follower_df = []
for df in df_list:
is_leader = df['sp_role'] == 1
is_follower = df['sp_role'] != 1
leader_df.append(df[is_leader])
follower_df.append(df[is_follower])
I have worked out that I can now simply shuffle the data-frame of one of the sub-dfs, in this case the follower_df
follower_rand = random.sample(follower_df, len(follower_df))
Having got to this stage I'm not sure where to turn next. I suspect I will need some sort of zip function, but am unsure exactly what. I'm also unsure how I go about merging the turns together such that they form the same dataframe structure I initially had. Assuming Ringo (leader) is randomly assigned to Bubbles (follower) for one of the DFs, I would hope to have something like this...
speaker | turn | sp_role
------------------------------------
ringo hello 1
bubbles hi there 2
ringo nice weather 1
bubbles yes it is 2

Categories