I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan
Related
The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.
You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10
I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.
I have a dataframe 'locations' which contains the genre of some stores, and it is very cluttered with lots of different categories, so I want to combine some of the categories so there are less and more simple categories. How do I do this?
Example:
store type
mcdonalds fast-food
nandos sit-down-food
wetherspoons tech-pub
southsider pub-and-dine
Id like to combine categories fast-food and sit-down-food to become just 'food', and tech-pub and pub-and-dine to become just 'pub'. How do I do this?
My first instinct would be to use the pandas apply function to map the values as desired. Something along the lines of:
import pandas as pd
def nameMapper(name):
if "food" in name:
return "food"
elif "pub" in name:
return "pub"
else:
return "something else"
data = [
["mcdonalds", "fast-food"],
["nandos","sit-down-food"],
["wetherspoons","tech-pub"],
["southsider","pub-and-dine"]
]
df = pd.DataFrame(data, columns={"store", "type"})
print(df)
print("---------------------------")
df["type"] = df["type"].apply(nameMapper)
print(df)
When I ran this the following output was produced
$ python3 answer.py
store type
0 mcdonalds fast-food
1 nandos sit-down-food
2 wetherspoons tech-pub
3 southsider pub-and-dine
---------------------------
store type
0 mcdonalds food
1 nandos food
2 wetherspoons pub
3 southsider pub
You can use a dict keyed by the types you want to replace with the desired types as the values. Then set the column to a list comprehension that replaces the types but keeps the ones you want.
# Dict specifying the types to replace
type_dict = {'fast-food':'food','sit-down-food':'food',
'tech-pub':'pub','pub-and-dine':'pub'}
# Replace types that are dict keys but keep the values that aren't dict keys
df['type'] = [type_dict.get(i,i) for i in df['type']]
I have thousands of rows in a list like the one below that I would like to convert into a pandas table consisting of different columns.
2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe
Ideally I would like to to create a pandas structure with the following columns: Date, Sale, ID, Region, and then fill it with values that fit the values.
E.g. so in the first row I have sales = 120, ID = 534343, region = North America and date = 2018-12-03 21:15:24.
Given that I have thousands of rows, what code could make this work?
Supposing your list is in a file, read it first into a string (or into a list already, in which case following code will differ) and then apply code.
To read into a string:
with open('/file/path/myfile.txt','r') as f:
s = f.read()
Code for parsing:
import re
import pandas as pd
s = """2018-12-03 21:15:24 Sales:120 ID:534343 North America
2018-12-03 21:15:27 Sales:65 ID:534344 Europe"""
sales_re = "Sales:([0-9]+)"
id_re = "ID:([0-9]+)"
lst = []
for line in s.split('\n'):
date = line[0:19]
sale = re.search(sales_re, line).groups()[0]
id = re.search(id_re, line).groups()[0]
region = line[line.rfind(":")+1+len(id)+1:] # Search from last ":", add one to go over ":" and 1 to skip space
x = [date, sale, id, region]
lst.append(x)
df = pd.DataFrame(lst)
df.columns = ['date', 'sale', 'id', 'region']
In the example above, I assume everything is loaded into a string. Then I use regular expressions to extract harder part of each line and append everything into a list that. Then I use the pandas.DataFrame constructor to convert into a dataframe.
I am working with the sklearn.datasets.fetch_20newsgroups() dataset. Here, there are some documents that belong to more than one news group. I want to treat those documents as two different entities that each belong to one news group. To do this, I've brought the document IDs and group names into a dataframe.
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
For each Document_ID, I want to change its value if it has more than one Group number assigned. Example,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
I want this to become:
>> Document_ID Group
5392 76139 6
5680 12345 17
Here, 12345 is a random new ID that is not already in keys list.
How can I do this?
You can find all the rows that contain duplicate Document_ID after the first with the duplicated methdod. Then create a list of new id's beginning with one more than the max id. Use the loc indexing operator to overwrite the duplicate keys with the new ids.
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
Test case
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
Ensure that no duplicates remain.
groups.Document_ID.duplicated(keep='first').any()
False
Kinda Hacky, but why not!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
Hope that helps...