I am trying to study the effects of Alcohol and Drugs in car accidents using an Open BigQuery dataset. I have my dataset ready to go and am just refining it further. I want to categorize the string entries in the pandas columns.
The data frame is over 11,000 entries and there are about 44 unique values in each column. However, I just want to categorize only the entries which say 'Alcohol Involvement' and 'Drugs (Illegal)' to 1 and respectively. I want to map any other entry to 0.
I have created a list of all the entries which I don't care about and want to get rid of and they are in a list as follows:
list_ign = ['Backing Unsafely',
'Turning Improperly', 'Other Vehicular',
'Driver Inattention/Distraction', 'Following Too Closely',
'Oversized Vehicle', 'Driver Inexperience', 'Brakes Defective',
'View Obstructed/Limited', 'Passing or Lane Usage Improper',
'Unsafe Lane Changing', 'Failure to Yield Right-of-Way',
'Fatigued/Drowsy', 'Prescription Medication',
'Failure to Keep Right', 'Pavement Slippery', 'Lost Consciousness',
'Cell Phone (hands-free)', 'Outside Car Distraction',
'Traffic Control Disregarded', 'Fell Asleep',
'Passenger Distraction', 'Physical Disability', 'Illness', 'Glare',
'Other Electronic Device', 'Obstruction/Debris', 'Unsafe Speed',
'Aggressive Driving/Road Rage',
'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
'Reaction to Other Uninvolved Vehicle', 'Steering Failure',
'Traffic Control Device Improper/Non-Working',
'Tire Failure/Inadequate', 'Animals Action',
'Driverless/Runaway Vehicle']
What could I do to just map 'Alcohol Involvement' and 'Drugs (Illegal)' to 1 and respectively and set everything in the list shown to 0
Say your source column is named Crime:
import numpy as np
df['Illegal'] = np.where(df['Crime'].isin(['Alcohol Involvement', 'Drugs']), 1, 0)
Or,
df['Crime'] = df['Crime'].isin(['Alcohol Involvement', 'Drugs']).astype(int)
So, while the above-mentioned methods work fine. However, they were not tagging all the categories I wanted to remove later on. So, I used this method,
for word in list_ign:
df = df.replace(str(word), 'Replace')
Related
I have two dataframes:
One with a single column of business names that I call 'bus_names_2' with a column name of 'BUSINESS_NAME'
One with an array of records and fields that was pulled from a RSS feed that I call 'df_newsfeed'. The import field is 'Description_2' field which represents the RSS feeds contents after scrubbing stopwords and symbols. This was also conducted on the 'bus_names_2' dataframe as well.
I am trying to look through each record in the 'df_newsfeed's 'Description_2' field to see if any array of words contains a business name from the 'bus_names_2' dataframe. This is easily done using the following:
def IdentityResolution_demo(bus_names, df, col='Description_2', upper=True):
n_rows = df.shape[0]
description_col = df.columns.get_loc(col)
df['Company'] = ''
company_col = df.columns.get_loc('Company')
if upper:
df.loc[:,col] = df.loc[:,col].str.upper()
for ind in range(n_rows):
businesses = []
description = df.iloc[ind,description_col]
for bus_name in bus_names:
if bus_name in description:
businesses.append(bus_name)
if len(businesses) > 0:
company = '|'.join(businesses)
df.iloc[ind,company_col] = company
df = df[['Source', 'RSS', 'Company', 'Title', 'PublishedDate', 'Description', 'Link']].drop_duplicates()
return df
bus_names_3 = list(set(bus_names_2['BUSINESS_NAME'].tolist()))
test = IdentityResolution_demo(bus_names_3, df_newsfeed.iloc[:10])
test[test['Company']!='']
This issue with this asides from the length of time it takes is that it is bringing back everything in a contains manner. I only want full word matches. Meaning if I have a company in my 'bus_names_2' dataframe called 'Bank of A' that it only brings back that name into the company category if the full word of 'Bank of A' exist in the 'Description_2' column of the 'df_newsfeed' dataframe and not when 'Bank of America' shows up.
Essentially, I need something like this ingrained in my function to produce the proper output for the 'Company' column but I don't know how to implement it. The below code gets the point accross.
Description_2 = 'GUARDFORCE AI CO LIMITED AI GFAIW RIVERSOFT INC PEAKWORK COMPANY GFAIS CONCIERGE GUARDFORCE AI RIVERSOFT ROBOT TRAVEL AGENCY'
bus_name_2 = ['GUARDFORCE AI CO']
for i in bus_name_2:
bus_name = re.compile(fr'\b{i}\b')
print(f"{i if bus_name.match(Description_2) else ''}")
This would produce an output of 'GUARDFORCE AI CO' but if I change the bus_name_2 to:
bus_name_2 = ['GUARDFORCE AI C']
It would produce a null output.
This function is written in the way it is because comparing two dataframes turned into a very long query and so optimization required a non-dataframe format.
I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series
I was initially trying to run the following code:
import pandas as pd
import numpy as np
f = ['Austria', '11m/18d/19yyy', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 1, 'JVI Residences', 2, 1, 'Brief additional comments.']
test= pd.DataFrame(f, columns ={'Country' , 'Date of last observation' , 'ResRep/RTAC/RTC (Yes/No?)' ,
'Available Facilities (Yes/No?)' , 'Funds Transferrable (Yes/No?)' ,
'Local Flight Help (Yes/No?)' , 'Visa Help (Yes/No?)' ,
'Local Support Rating 1 Great - 3 Bad' , 'Hotel Name' ,
'Hotel Rating 1 Great - 3 Bad' ,
'Travel Route Rating 1 Great - 3 Bad' , 'Additional Overall Comments'})
that kept resulting in the following error:
ValueError: Shape of passed values is (12, 1), indices imply (12, 12)
So I attempted to correct that by putting brackets around F when creating the dataframe. However, this resulted in pandas returning a datframe with the column names out of order and the data in the categories it should not be in:
Available Facilities (Yes/No?) ... Country
0 Austria ... Brief additional comments.
[1 rows x 12 columns]
Index(['Available Facilities (Yes/No?)', 'Hotel Name',
'Local Support Rating 1 Great - 3 Bad', 'Additional Overall Comments',
'Date of last observation', 'Local Flight Help (Yes/No?)',
'ResRep/RTAC/RTC (Yes/No?)', 'Travel Route Rating 1 Great - 3 Bad',
'Visa Help (Yes/No?)', 'Hotel Rating 1 Great - 3 Bad',
'Funds Transferrable (Yes/No?)', 'Country'],
dtype='object')
Can some one advise how to get my data to go into the dataframe I am trying to make and keep it in the order I want?
You have given a set as the value of the parameter columns. Pandas will take the set and try to convert it to a list. But, as set is an unordered data structure, the order of yours column names could not be preserved while converting it from set to list. Simply, use a list instead of a set.
My dataframe has a column called Borough which contains values like these:
"east toronto", "west toronto", "central toronto" and "west toronto", along with other region names.
Now I want a regular expression which gets me the data of every entry that ends with "toronto". How do I do that?
I tried this:
tronto_data = df_toronto[df_toronto['Borough'] = .*Toronto$].reset_index(drop=True)
tronto_data.head(7)
If the data is well formatted you can split the string on the space and access the final word, comparing it to Toronto. For example
df = pd.DataFrame({'column': ['west toronto', 'central toronto', 'some place']})
mask_df = df['column'].str.split(' ', expand=True)
which returns:
0 1
0 west toronto
1 central toronto
2 some place
you can then access the final column to work out the rows that end with Toronto.
toronto_df = df[mask_df[1]=='toronto']
Edit:
Did not know there was a string method .endswith which is the better way to do this. However, this solution does provide two columns which maybe useful.
Like #Code_10 refers in a comment that you can use string.endswith.. try below->
df = pd.DataFrame({'city': ['east toronto', 'west toronto', 'other', 'central toronto']})
df_toronto = df[df['city'].str.endswith('toronto')]
#df_toronto.head()
I'm doing analysis on movies, and each movie have a genre attribute, it might be several specific genre, like drama, comedy, the data looks like this:
movie_list = [
{'name': 'Movie 1',
'genre' :'Action, Fantasy, Horror'},
{'name': 'Movie 2',
'genre' :'Action, Comedy, Family'},
{'name': 'Movie 3',
'genre' :'Biography, Drama'},
{'name': 'Movie 4',
'genre' :'Biography, Drama, Romance'},
{'name': 'Movie 5',
'genre' :'Drama'},
{'name': 'Movie 6',
'genre' :'Documentary'},
]
The problem is that, how do I do analysis on this? For example, how do I know how many action moviews are here, and how do I query for the category action? Specifically:
How do I get all the categories in this list? So I know each contains how many moviews
How do I query for a certain kind of movies, like action?
Do I need to turn the genre into array?
Currently I can get away the 2nd question with df[df['genre'].str.contains("Action")].describe(), but is there better syntax?
If your data isn't too huge, I would do some pre-processing and get 1 record per genre. That is, I would structure your data frame like this:
Name Genre
Movie 1 Action
Movie 1 Fantasy
Movie 1 Horor
...
Note the names should be repeated. While this may make your data set much bigger, if your system can handle it it can make data analysis very easy.
Use the following code to do the transformation:
import pandas as pd
def reformat_movie_list(movies):
name = []
genre = []
result = pd.DataFrame()
for movie in movies:
movie_name = movie["name"]
movie_genres = movie["genre"].split(",")
for movie_genre in movie_genres:
name.append(movie_name.strip())
genre.append(movie_genre.strip())
result["name"] = name
result["genre"] = genre
return result
In this format, your 3 questions become
How do I get all the categories in this list? So I know each contains how many movies?
movie_df.groupby("genre").agg("count")
see How to count number of rows in a group in pandas group by object?
How do I query for a certain kind of movies, like action?
horror_movies = movie_df[movie_df["genre"] == "horror"]
see pandas: filter rows of DataFrame with operator chaining
Do I need to turn the genre into array?
Your de-normalization of the data should take care of it.