searching keyword in text file and printing the entire row - python

ok so i used an api and arranged a text file that looks like this:
TITLE,YEAR,IMDB
'Money Plane','2000','tt7286966'
'Mulan','2020','tt4566758'
'Secret Society of Second Born Royals','2020','tt10324122'
'Train to Busan Presents: Peninsula', 'year': '2020', 'imdb_id': 'tt8850222'
'After We Collided','2020','tt10362466'
'One Night in Bangkok','2020','tt12192190'
'Phineas and Ferb The Movie: Candace Against the Universe','2020','tt1817232'
'Cats & Dogs 3: Paws Unite','2020','tt12745164'
'The Crimes That Bind','2020','tt10915060'
'Scoob!','2020','tt3152592'
'We Bare Bears: The Movie','2020','tt10474606'
'Trolls World Tour', 'year': '2020', 'imdb_id': 'tt6587640'
'Birds of Prey (and the Fantabulous Emancipation of One Harley Quinn)','2020','tt7713068'
'Bad Boys for Life','2020','tt1502397'
'Greyhound','2020','tt6048922'
'The Old Guard','2020','tt7556122'
'Sonic the Hedgehog','2020','tt3794354'
'Dad Wanted','2020','tt12721188'
'Barbie: Princess Adventure','2020','tt12767498'
now i want to be able so search a movie by either title,year or id
bassicaly i want to make it so that when i search for a title, it will show me the entire details about the movie,
example: search for mulan, and my output will be:
'Mulan','2020','tt4566758'
my code so far looks like this:
import pandas
data=pandas.read_csv(r"C:\Users\Home\Documents\studying\newproject\moviedata.txt")
title=list(data["TITLE"])
year=list(data["YEAR"])
imdbid=list(data["IMDB"])
ive tried to use
for t,y,imdb in zip(title,year,imdbid):
if word in title:
items=data[word]
print(items)
it runs but it does nothing

You can use here numpy.where to find value and pandas loc to print row
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('moviedata.txt')
row , col = np.where(df == "'Mulan'")
print(df.loc[row])
Output:
TITLE YEAR IMDB
1 'Mulan' '2020' 'tt4566758'
If you want output in form of list
row , col = np.where(df == "'Mulan'")
a = df.loc[row]
a = a.values.tolist()
print(a)
Output:
[["'Mulan'", "'2020'", "'tt4566758'"]]

Related

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

Extract first element from multi dimensional array from pandas dataframe

I have a data frame consists of reviews column which is a multi dimensional array, I would like to extract first element as shown below,
Suppose df['Reviews'] consists of following rows
I want the output in a separate column as shown below,
Please find sample data 3 values for the column below:
df['Reviews'] =
[['Just like home', 'A Warm Welcome to Wintry Amsterdam'], ['01/03/2018', '01/01/2018']]
[['Great food and staff', 'just perfect'], ['01/06/2018', '01/04/2018']]
[['Satisfaction', 'Delicious old school restaurant'], ['01/04/2018', '01/04/2018']]
Please help
If need first lists use indexing by str[0]:
import ast
df['Reviews'] = df['Reviews'].apply(ast.literal_eval).str[0]
If need join lists by , to strings add Series.str.join:
import ast
df['Reviews'] = df['Reviews'].apply(ast.literal_eval).str[0].str.join(',')
You need to add in the following for your dataframe access as per your need.
This will create a new column named output with the appropriate requirement
Apply Function
df['output'] = df.Reviews.apply(lambda x: x[0])
Map Function
df.loc[:, 'output'] = df.Reviews.map(lambda x: x[0])
I guess this should help. This worked for me.
df['Reviews']=df['Reviews'].apply(lambda c: str(c[0]).strip('[]'))
This works fine if run once.If run again on same code, it ll further divide the text.So i suggest comment it out after using it.
Or make a new column.
P.S: You should include the code instead of screenshots so that it can be tested first.
EDIT
Looks fine to me.try once again and remember if you run it twice (incase not making separate column), it ll return none
If you receive error, you probably have some empty data in Reviews. You can drop them, if such data is useless for you:
df.dropna(subset='Reviews', inplace=True)
or add checking type of data:
a = [[['Just like home', 'A Warm Welcome to Wintry Amsterdam'], ['01/03/2018', '01/01/2018']], [['Great food and staff', 'just perfect'], ['01/06/2018', '01/04/2018']], [['Satisfaction', 'Delicious old school restaurant'], ['01/04/2018', '01/04/2018']]]
df = pd.DataFrame(columns=['Reviews', 'Review'])
df['Reviews'] = a
df
executed in 18ms, finished 07:39:04 2020-06-05
Reviews Review
0 [[Just like home, A Warm Welcome to Wintry Ams... NaN
1 [[Great food and staff, just perfect], [01/06/... NaN
2 [[Satisfaction, Delicious old school restauran... NaN
def get_review(reviews):
if type(reviews) == list:
return reviews[0]
else:
return None
df['Review'] = df['Reviews'].apply(get_review)
df
Reviews Review
0 [[Just like home, A Warm Welcome to Wintry Ams... [Just like home, A Warm Welcome to Wintry Amst...
1 [[Great food and staff, just perfect], [01/06/... [Great food and staff, just perfect]
2 [[Satisfaction, Delicious old school restauran... [Satisfaction, Delicious old school restaurant]
If you don't want column Review to be a list, simply covert it to string with some divider:
def get_review(reviews):
if type(reviews) == list:
return ', '.join(reviews[0])
else:
return ''
df['Review'] = df['Reviews'].apply(get_review)
df
Reviews Review
0 [[Just like home, A Warm Welcome to Wintry Ams... Just like home, A Warm Welcome to Wintry Amste...
1 [[Great food and staff, just perfect], [01/06/... Great food and staff, just perfect
2 [[Satisfaction, Delicious old school restauran... Satisfaction, Delicious old school restaurant
I your input data is not of type list (i.e. you reading it from CSV), you need to convert it to list first:
import ast
def get_review(reviews):
if pd.notna(reviews) and reviews != '':
r_list = ast.literal_eval(reviews)[0]
if len(r_list) > 0:
return ', '.join(r_list)
else:
return ''
else:
return ''
df2['Review'] = df2['Reviews'].apply(get_review)
df2
Reviews Review
Reviews Review
0 [['Just like home', 'A Warm Welcome to Wintry ... Just like home, A Warm Welcome to Wintry Amste...
1 [['Great food and staff', 'just perfect'], ['0... Great food and staff, just perfect
2 [['Satisfaction', 'Delicious old school restau... Satisfaction, Delicious old school restaurant

Create a new excel column with the number of repetitions a value occurs in each row of a column with PANDAS

I have an excel file with the column A (names) and column B (description) in which i have a long description of the profile of the person.
It looks like:
Name Description
James R A good systems developer...
I'm trying to count how many times for example the word 'good' appears in each row of the column 'description' and create a new column with the number of repetitions. I have a lot of values so I prefer to use pandas than excel formulas.
The output should look like this:
Name Description Good
James R A good systems developer... 1
The python code that I develop is this:
In [1]: import collections
In [2]: import pandas as pd
In [3]: df=pd.read_excel('israel2013.xls')
In [4]: str1=df.description
In [5]: str2= 'good'
In [6]: for index, row in df.iterrows():
...: if str2 in str1:
...: counter=collections.Counter (r[0] for str2 in str1)
...: else:
...: print (0)
But I get all zeros from this, and I don't know whats wrong.
Thank you
Demo dataframe:
>>> data = [['James R', 'A good systems developer'], ['Bob C', 'a guy called Bob'], ['Alice R', 'Good teacher and a good runner']]
>>> df = pd.DataFrame(data, columns=['Name', 'Description'])
>>>
>>> df
Name Description
0 James R A good systems developer
1 Bob C a guy called Bob
2 Alice R Good teacher and a good runner
Solution:
>>> df['Good'] = df.Description.str.count(r'(?i)\bgood\b')
>>> df
Name Description Good
0 James R A good systems developer 1
1 Bob C a guy called Bob 0
2 Alice R Good teacher and a good runner 2
\b marks word boundaries, (?i) performs a case-insensitive search. Alternatively to using (?i), you could import re and supply flags=re.IGNORECASE as the second argument to count.
Try:
df['Good'] = df['description'].str.findall('good').str.len()

Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs

I have a dataframe of publications that have the following rows:
publication_ID , title, author_name, date
12344, Design style, Jake Kreath, 20071208
12334, Power of Why, Samantha Finn, 20150704
I ask the user for a string and use that string to search through the titles.
The goal: Search through the dataframe to see if the title contains the word the user provides and return the rows in a new dataframe with just the title and publication_ID.
This is my code so far:
import pandas as pd
from pandas import DataFrame
publications = pd.read_csv(filepath, sep= "|")
search_term = input('Enter the term you are looking for: ')
def stringDataFrame(publications, title, regex):
newdf = pd.DataFrame()
for idx, search_term in publications['title'].iteritems():
if re.search(regex, search_term):
newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)
return newdf
print(newdf.stringDataFrame)
Use a combination of .str.contains and .loc
publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]
Just be careful, because if your title is 'nightlife' and someone searches for 'night' this will return a match. If that's not your desired behavior then you may need .str.split instead.
As jpp points out, str.contains is case sensitive. One simple fix is to just ensure everything is lowercase.
title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]
now Lord, LoRD, lord and all other permutations will return a valid match, and your original DataFrame has the capitalization unchanged.
Full example but you should accept the answer above by #ALollz
import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term)]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5
per your error you can filter out all np.nan values by as part of the logic using the new code below:
import pandas as pd
import numpy as np
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5

Python/pandas - Add word in column based on words in another column

I'm working with an xlsx file with pandas and I would like to add the word "bodypart" in a column if the preceding column contains a word in a predefined list of bodyparts.
Original Dataframe:
Sentence Type
my hand NaN
the fish NaN
Result Dataframe:
Sentence Type
my hand bodypart
the fish NaN
Nothing I've tried works. I feel I'm missing something very obvious. Here's my last (failed) attempt:
import pandas as pd
import numpy as np
bodyparts = ['lip ', 'lips ', 'foot ', 'feet ', 'heel ', 'heels ', 'hand ', 'hands ']
df = pd.read_excel(file)
for word in bodyparts :
if word in df["Sentence"] : df["Type"] = df["Type"].replace(np.nan, "bodypart", regex = True)
I also tried this, with as variants "NaN" and NaN as the first argument of str.replace:
if word in df['Sentence'] : df["Type"] = df["Type"].str.replace("", "bodypart")
Any help would be greatly appreciated!
You can create a regex to search on word boundaries and then use that as an argument to str.contains, eg:
import pandas as pd
import numpy as np
import re
bodyparts = ['lips?', 'foot', 'feet', 'heels?', 'hands?', 'legs?']
rx = re.compile('|'.join(r'\b{}\b'.format(el) for el in bodyparts))
df = pd.DataFrame({
'Sentence': ['my hand', 'the fish', 'the rabbit leg', 'hand over', 'something', 'cabbage', 'slippage'],
'Type': [np.nan] * 7
})
df.loc[df.Sentence.str.contains(rx), 'Type'] = 'bodypart'
Gives you:
Sentence Type
0 my hand bodypart
1 the fish NaN
2 the rabbit leg bodypart
3 hand over bodypart
4 something NaN
5 cabbage NaN
6 slippage NaN
A dirty solution would involve checking the intersection of two sets.
set A is your list of body parts, set B is the set of words in the sentence
df['Sentence']\
.apply(lambda x: 'bodypart' if set(x.split()) \
.symmetric_difference(bodyparts) else None)
The simplest way :
df.loc[df.Sentence.isin(bodyparts),'Type']='Bodypart'
Before you must discard space in bodyparts:
bodyparts = {'lip','lips','foot','feet','heel','heels','hand','hands'}
df.Sentence.isin(bodyparts) select the good rows, and Type the column to set. .loc is the indexer which permit the modification.

Categories