Collapsing categories in a csv python - python

I have a dataframe 'locations' which contains the genre of some stores, and it is very cluttered with lots of different categories, so I want to combine some of the categories so there are less and more simple categories. How do I do this?
Example:
store type
mcdonalds fast-food
nandos sit-down-food
wetherspoons tech-pub
southsider pub-and-dine
Id like to combine categories fast-food and sit-down-food to become just 'food', and tech-pub and pub-and-dine to become just 'pub'. How do I do this?

My first instinct would be to use the pandas apply function to map the values as desired. Something along the lines of:
import pandas as pd
def nameMapper(name):
if "food" in name:
return "food"
elif "pub" in name:
return "pub"
else:
return "something else"
data = [
["mcdonalds", "fast-food"],
["nandos","sit-down-food"],
["wetherspoons","tech-pub"],
["southsider","pub-and-dine"]
]
df = pd.DataFrame(data, columns={"store", "type"})
print(df)
print("---------------------------")
df["type"] = df["type"].apply(nameMapper)
print(df)
When I ran this the following output was produced
$ python3 answer.py
store type
0 mcdonalds fast-food
1 nandos sit-down-food
2 wetherspoons tech-pub
3 southsider pub-and-dine
---------------------------
store type
0 mcdonalds food
1 nandos food
2 wetherspoons pub
3 southsider pub

You can use a dict keyed by the types you want to replace with the desired types as the values. Then set the column to a list comprehension that replaces the types but keeps the ones you want.
# Dict specifying the types to replace
type_dict = {'fast-food':'food','sit-down-food':'food',
'tech-pub':'pub','pub-and-dine':'pub'}
# Replace types that are dict keys but keep the values that aren't dict keys
df['type'] = [type_dict.get(i,i) for i in df['type']]

Related

Pandas dataframe: select list items in a column, then transform string on the items

One of the columns I'm importing into my dataframe is structured as a list. I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe. Before:
Name
Listed_Items
Tom
["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"]
Steve
["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"]
Phil
["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"]
From what I've read I can turn the column into a list
df["listed_items"] = df["listed_items"].apply(eval)
But then I cannot see how to find any list items that start dr_md, extract the item, remove the starting dr_md, replace any underscores, capitalize the first letter and add that to a new MD column in the row. Then same again for dr_od. There is only one item in the list that starts dr_md and dr_od in each row. Desired output
Name
MD
OD
Tom
Coca Cola
Water
Steve
Pepsi
Orange Juice
Phil
Dr Pepper
Coffee
What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map). Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns). Because you only have one input column, you could use map instead of apply.
def process_dr_md(l:list):
for s in l:
if s.startswith("dr_md_"):
# You can process your string further here
return l[6:]
def process_dr_od(l:list):
for s in l:
if s.startswith("dr_od_"):
# You can process your string further here
return l[6:]
df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)
I hope that gets you on your way!
Use pivot_table
df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]
df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD',
False: 'OD'})
df.pivot_table(values='Listed_Items',
columns='Type',
index='Name',
aggfunc='first')
Type MD OD
Name
Phil dr_md_dr_pepper dr_od_coffee
Steve dr_md_pepsi dr_od_orange_juice
Tom dr_md_coca_cola dr_od_water
From here it's just a matter of beautifying your dataset as your wish.
I took a slightly different approach from the previous answers.
given a df of form:
Name Items
0 Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1 Steve [dr_od_orange_juice, potatoes, grass, ot_other...
2 Phil [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...
and making the following assumptions:
only one item in a list matches the target mask
the target mask always appears at the start of the entry string
I created the following function to parse the list:
import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
p = re.compile(tgt_mask)
for itm in itmList:
if p.match(itm):
return itm[p.search(itm).span()[1]:].replace('_', ' ')
Then you can modify your original data farme by use of the following:
df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')
This produces the following:
Name MD OD
0 Tom coca cola water
1 Steve pepsi orange juice
2 Phil dr pepper coffee
I would normalize de data before to put in a dataframe:
import pandas as pd
from typing import Dict, List, Tuple
def clean_stuff(text: str):
clean_text = text[6:].replace('_', ' ')
return " ".join([
word.capitalize()
for word in clean_text.split(" ")
])
def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
md_od = sorted(md_od)
print(md_od)
return clean_stuff(md_od[0]), clean_stuff(md_od[1])
dirty_stuffs = [{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]},
{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]}
]
normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
md, od = get_md_od(stuff['Listed_Items'])
normalized_stuff.append({
'Name': stuff['Name'],
'MD': md,
'OD': od,
})
df = pd.DataFrame(normalized_stuff)
print(df)

Displaying attribute from a csv file in Python when it contains 1 or 2 other named attributes

I'm new to python and would really appreciate some help please. I've created a file of car attributes in Excel and saved it as a csv file, called cars.csv like this:
Car make, colour, price, number of seats, automatic, petrol
Ford, black, 40000,5,yes,no
Tesla, white, 90000,4,yes,no
After the headings, I have 20 lines with different cars and their attributes.
Could someone help me with the python code which returns all the makes of cars which have say 4 seats, or say cost 40000, or say both these attributes? Thankyou.
You can use Pandas:
# pip install pandas
import pandas as pd
df = pd.read_csv('cars.csv', skipinitialspace=True)
print(df)
# Output
Car make colour price number of seats automatic petrol
0 Ford black 40000 5 yes no
1 Tesla white 90000 4 yes no
Filter your dataframe
out = df[(df['number of seats'] == 4) | (df['price'] == 40000)]
print(out)
# Output
Car make colour price number of seats automatic petrol
0 Ford black 40000 5 yes no
1 Tesla white 90000 4 yes no
You can also read this
If you dont want to use any library you can use this approach
# The return value of this method can be handled like a list, but its called a generator object
# This method loads all cars from the csv file and yield this data
def load_cars():
with open('Cars.csv', 'r') as f:
for counter, line in enumerate(f):
if counter > 0:# I skipped the header row
line_content = line.split(", ")
producer, colour, price, number_of_seats, automatic, petrol = line_content
yield producer, colour, price, number_of_seats, automatic, petrol # This is like one item in the list
# The statement below is called a list comprehension
# the load_cars method returns the list like object and it iterates over each object, which is put in car
# The car is put in the list if the condition after the if keyword is true
# Each car which fulfills the condition is in the new created list
filtered_cars = [car_data for car_data in load_cars() if car_data[3] == "4" and car_data[2]=="40000"]
filtered_car_names = [car_data[0] for car_data in filtered_cars]
You should use the pandas.loc[] function here. If you load the csv with pandas, you can use the loc function to select only the rows that match your conditions.
import pandas as pd
# if index was dropped when saving the csv
car_atts = pd.read_csv('path to file as string')
# if index was not dropped
car_atts = pd.read_csv('path to file as string', index_col=0)
4_seater_cars = car_atts.loc[car_atts['number of seats'] == 4]
# if col name is type string
fourty_k_cars = car_atts.loc[car_atts['40000'] == True]
# if col name is type int
fourty_k_cars = car_atts.loc[car_atts[40000] == True]
# you can use the & (AND) to find rows that match both conditions
4_seater_fourty_k_cars = car_atts.loc[
(car_atts['number of seats'] == 4) &
(car_atts['40000'] == True)
]
# you can use the | (OR) to find rows that match either condition
4_seater_fourty_k_cars = car_atts.loc[
(car_atts['number of seats'] == 4) |
(car_atts['40000'] == True)
]
Hope this answers your question.
Happy coding!!

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

Check if a string is present in multiple lists

I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan

how to replace characters in a dataframe where column may have different data types entries

new to python want to ask a quick question on how to replace multiple characters simultaneously given that the entries may have different data types. I just want to change the strings and keep everything else as it is:
import pandas as pd
def test_me(text):
replacements = [("ID", ""),("u", "a")] #
return [text.replace(a, b) for a, b in replacements if type(text) == str]
cars = {'Brand': ['HonduIDCivic', 1, 3.2,'CarIDA4'],
'Price': [22000,25000,27000,35000]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Brand'] = df['Brand'].apply(test_me)
resulting in
Brand Price
0 [HonduCivic, HondaIDCivic] 22000
1 [] 25000
2 [] 27000
3 [CarA4, CarIDA4] 35000
rather than
Brand Price
0 HondaCivic 22000
1 1 25000
2 3.2 27000
3 CarA4 35000
Appreciate any suggestions!
If the replacements never have identical search phrases, it will be easier to convert the list of tuples into a dictionary and then use
import re
#...
def test_me(text):
replacements = dict([("ID", ""),("u", "a")])
if type(text) == str:
return re.sub("|".join(sorted(map(re.escape, replacements.keys()),key=len,reverse=True)), lambda x: replacements[x.group()], text)
else:
return text
The "|".join(sorted(map(re.escape, replacements.keys()),key=len,reverse=True)) part will create a regular expression out of re.escaped dictionary keys starting with the longest so as to avoid issues when handling nested search phrases that share the same prefix.
Pandas test:
>>> df['Brand'].apply(test_me)
0 HondaCivic
1 1
2 3.2
3 CarA4
Name: Brand, dtype: object

Categories