Pandas: How to do analysis on array-like field? - python

I'm doing analysis on movies, and each movie have a genre attribute, it might be several specific genre, like drama, comedy, the data looks like this:
movie_list = [
{'name': 'Movie 1',
'genre' :'Action, Fantasy, Horror'},
{'name': 'Movie 2',
'genre' :'Action, Comedy, Family'},
{'name': 'Movie 3',
'genre' :'Biography, Drama'},
{'name': 'Movie 4',
'genre' :'Biography, Drama, Romance'},
{'name': 'Movie 5',
'genre' :'Drama'},
{'name': 'Movie 6',
'genre' :'Documentary'},
]
The problem is that, how do I do analysis on this? For example, how do I know how many action moviews are here, and how do I query for the category action? Specifically:
How do I get all the categories in this list? So I know each contains how many moviews
How do I query for a certain kind of movies, like action?
Do I need to turn the genre into array?
Currently I can get away the 2nd question with df[df['genre'].str.contains("Action")].describe(), but is there better syntax?

If your data isn't too huge, I would do some pre-processing and get 1 record per genre. That is, I would structure your data frame like this:
Name Genre
Movie 1 Action
Movie 1 Fantasy
Movie 1 Horor
...
Note the names should be repeated. While this may make your data set much bigger, if your system can handle it it can make data analysis very easy.
Use the following code to do the transformation:
import pandas as pd
def reformat_movie_list(movies):
name = []
genre = []
result = pd.DataFrame()
for movie in movies:
movie_name = movie["name"]
movie_genres = movie["genre"].split(",")
for movie_genre in movie_genres:
name.append(movie_name.strip())
genre.append(movie_genre.strip())
result["name"] = name
result["genre"] = genre
return result
In this format, your 3 questions become
How do I get all the categories in this list? So I know each contains how many movies?
movie_df.groupby("genre").agg("count")
see How to count number of rows in a group in pandas group by object?
How do I query for a certain kind of movies, like action?
horror_movies = movie_df[movie_df["genre"] == "horror"]
see pandas: filter rows of DataFrame with operator chaining
Do I need to turn the genre into array?
Your de-normalization of the data should take care of it.

Related

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

Populate 2 columns of dataframe at the same time using apply function [duplicate]

This question already has answers here:
Apply Python function to one pandas column and apply the output to multiple columns
(4 answers)
Closed 1 year ago.
I have some code which is (simplified) like this. The actual data lists are tens of thousands in size, not just 3.
There is a dictionary of staff which I make a DataFrame from.
There is a list of dictionary objects which contain additional staff information.
Also:
The staff list and the extra staff information (master_info_list) overlap but each has items that are unique to them.
The "index" I am using (StaffNumber) is actually prefixed with "SN_" in the extra staff information, so I can't compare them directly.
The duplication of StaffNumber in the master_info_list is intended (that's just how I receive it!).
What I want to do is populate two new columns into the dataframe which get their data from the extra staff information. I can do this by making 2 separate calls to get_department_and_manager, one for Department and one for Manager. That works. But, it "feels" like I should be able to take 2 fields from the output of get_department_and_manager and populate the dataframe in one go, but I'm struggling to get the syntax right. What is the correct syntax (if possible)? Also, iterating through the list the way I do (with a for loop) seems inefficient. Is there a better way?
The examples I have seen all seem to create new columns from existing data in the dataframe, or they are simple examples where no mashing of data is required before comparing the two "lists" (or list and dictionary).
import pandas as pd
def get_department_and_manager(row, master_list):
dept = 'bbb'
manager = 'aaa'
for i in master_list:
if i['StaffNumber'] == 'SN_' + row['StaffNumber']:
dept = i['data']['Department']
manager = i['data']['Manager']
break
return [dept, manager]
staff = {'Name': ['Alice', 'Bob', 'Dave'],
'StaffNumber': ['001', '002', '004']}
master_info_list = [{'StaffNumber': 'SN_001', 'data': {'StaffNumber': 'SN_001', 'Department': 'Sales', 'Manager': 'Luke' }},
{'StaffNumber': 'SN_002', 'data': {'StaffNumber': 'SN_002', 'Department': 'Marketing', 'Manager': 'Mary' }},
{'StaffNumber': 'SN_003', 'data': {'StaffNumber': 'SN_003', 'Department': 'IT', 'Manager': 'Neal' }}]
df = pd.DataFrame(data=staff)
df[['Department']['Manager']] = df.apply(get_department_and_manager, axis='columns', args=[master_info_list])
print(df)
If I understand you correctly, you can use .merge:
x = pd.DataFrame([v["data"] for v in master_info_list])
x["StaffNumber"] = x["StaffNumber"].str.split("_").str[-1]
print(df.merge(x, on="StaffNumber", how="left"))
Prints:
Name StaffNumber Department Manager
0 Alice 001 Sales Luke
1 Bob 002 Marketing Mary
2 Dave 004 NaN NaN

searching keyword in text file and printing the entire row

ok so i used an api and arranged a text file that looks like this:
TITLE,YEAR,IMDB
'Money Plane','2000','tt7286966'
'Mulan','2020','tt4566758'
'Secret Society of Second Born Royals','2020','tt10324122'
'Train to Busan Presents: Peninsula', 'year': '2020', 'imdb_id': 'tt8850222'
'After We Collided','2020','tt10362466'
'One Night in Bangkok','2020','tt12192190'
'Phineas and Ferb The Movie: Candace Against the Universe','2020','tt1817232'
'Cats & Dogs 3: Paws Unite','2020','tt12745164'
'The Crimes That Bind','2020','tt10915060'
'Scoob!','2020','tt3152592'
'We Bare Bears: The Movie','2020','tt10474606'
'Trolls World Tour', 'year': '2020', 'imdb_id': 'tt6587640'
'Birds of Prey (and the Fantabulous Emancipation of One Harley Quinn)','2020','tt7713068'
'Bad Boys for Life','2020','tt1502397'
'Greyhound','2020','tt6048922'
'The Old Guard','2020','tt7556122'
'Sonic the Hedgehog','2020','tt3794354'
'Dad Wanted','2020','tt12721188'
'Barbie: Princess Adventure','2020','tt12767498'
now i want to be able so search a movie by either title,year or id
bassicaly i want to make it so that when i search for a title, it will show me the entire details about the movie,
example: search for mulan, and my output will be:
'Mulan','2020','tt4566758'
my code so far looks like this:
import pandas
data=pandas.read_csv(r"C:\Users\Home\Documents\studying\newproject\moviedata.txt")
title=list(data["TITLE"])
year=list(data["YEAR"])
imdbid=list(data["IMDB"])
ive tried to use
for t,y,imdb in zip(title,year,imdbid):
if word in title:
items=data[word]
print(items)
it runs but it does nothing
You can use here numpy.where to find value and pandas loc to print row
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('moviedata.txt')
row , col = np.where(df == "'Mulan'")
print(df.loc[row])
Output:
TITLE YEAR IMDB
1 'Mulan' '2020' 'tt4566758'
If you want output in form of list
row , col = np.where(df == "'Mulan'")
a = df.loc[row]
a = a.values.tolist()
print(a)
Output:
[["'Mulan'", "'2020'", "'tt4566758'"]]

Pandas will not input data into dataframe in the correct order

I was initially trying to run the following code:
import pandas as pd
import numpy as np
f = ['Austria', '11m/18d/19yyy', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 1, 'JVI Residences', 2, 1, 'Brief additional comments.']
test= pd.DataFrame(f, columns ={'Country' , 'Date of last observation' , 'ResRep/RTAC/RTC (Yes/No?)' ,
'Available Facilities (Yes/No?)' , 'Funds Transferrable (Yes/No?)' ,
'Local Flight Help (Yes/No?)' , 'Visa Help (Yes/No?)' ,
'Local Support Rating 1 Great - 3 Bad' , 'Hotel Name' ,
'Hotel Rating 1 Great - 3 Bad' ,
'Travel Route Rating 1 Great - 3 Bad' , 'Additional Overall Comments'})
that kept resulting in the following error:
ValueError: Shape of passed values is (12, 1), indices imply (12, 12)
So I attempted to correct that by putting brackets around F when creating the dataframe. However, this resulted in pandas returning a datframe with the column names out of order and the data in the categories it should not be in:
Available Facilities (Yes/No?) ... Country
0 Austria ... Brief additional comments.
[1 rows x 12 columns]
Index(['Available Facilities (Yes/No?)', 'Hotel Name',
'Local Support Rating 1 Great - 3 Bad', 'Additional Overall Comments',
'Date of last observation', 'Local Flight Help (Yes/No?)',
'ResRep/RTAC/RTC (Yes/No?)', 'Travel Route Rating 1 Great - 3 Bad',
'Visa Help (Yes/No?)', 'Hotel Rating 1 Great - 3 Bad',
'Funds Transferrable (Yes/No?)', 'Country'],
dtype='object')
Can some one advise how to get my data to go into the dataframe I am trying to make and keep it in the order I want?
You have given a set as the value of the parameter columns. Pandas will take the set and try to convert it to a list. But, as set is an unordered data structure, the order of yours column names could not be preserved while converting it from set to list. Simply, use a list instead of a set.

Mapping list of stings to 0 in Pandas Python

I am trying to study the effects of Alcohol and Drugs in car accidents using an Open BigQuery dataset. I have my dataset ready to go and am just refining it further. I want to categorize the string entries in the pandas columns.
The data frame is over 11,000 entries and there are about 44 unique values in each column. However, I just want to categorize only the entries which say 'Alcohol Involvement' and 'Drugs (Illegal)' to 1 and respectively. I want to map any other entry to 0.
I have created a list of all the entries which I don't care about and want to get rid of and they are in a list as follows:
list_ign = ['Backing Unsafely',
'Turning Improperly', 'Other Vehicular',
'Driver Inattention/Distraction', 'Following Too Closely',
'Oversized Vehicle', 'Driver Inexperience', 'Brakes Defective',
'View Obstructed/Limited', 'Passing or Lane Usage Improper',
'Unsafe Lane Changing', 'Failure to Yield Right-of-Way',
'Fatigued/Drowsy', 'Prescription Medication',
'Failure to Keep Right', 'Pavement Slippery', 'Lost Consciousness',
'Cell Phone (hands-free)', 'Outside Car Distraction',
'Traffic Control Disregarded', 'Fell Asleep',
'Passenger Distraction', 'Physical Disability', 'Illness', 'Glare',
'Other Electronic Device', 'Obstruction/Debris', 'Unsafe Speed',
'Aggressive Driving/Road Rage',
'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
'Reaction to Other Uninvolved Vehicle', 'Steering Failure',
'Traffic Control Device Improper/Non-Working',
'Tire Failure/Inadequate', 'Animals Action',
'Driverless/Runaway Vehicle']
What could I do to just map 'Alcohol Involvement' and 'Drugs (Illegal)' to 1 and respectively and set everything in the list shown to 0
Say your source column is named Crime:
import numpy as np
df['Illegal'] = np.where(df['Crime'].isin(['Alcohol Involvement', 'Drugs']), 1, 0)
Or,
df['Crime'] = df['Crime'].isin(['Alcohol Involvement', 'Drugs']).astype(int)
So, while the above-mentioned methods work fine. However, they were not tagging all the categories I wanted to remove later on. So, I used this method,
for word in list_ign:
df = df.replace(str(word), 'Replace')

Categories