How to Check is Firstnames, and Lastnames are English? - python

I have a csv file which has two columns and about 9,000 rows. Column 1 contains the firstname of a respondent in a survey, column 2 contains the lastname of a respondent in a survey, so each row is an observation.
These surveys were conducted in a very diverse place. I am trying to find a way to tell, whether a respondent's firstname is of English (British or American) origin or not. Same for his lastname.
This task is very far away from my area of expertise. After reading interesting discussions online here, and here. I have thought about three way:
1- Take a dataset of the most common triplets (families of 3 letters often found together in English) or quadruplets (families of 4 letters often found together in English) and to check for each firstname, and lastname, whether it contains these families of letters.
2- Use a dataset of British names (say the most X common names in the UK in the early XX Century, and match these names based on proximity to my dataset. These datasets could be good I think, data1, data2, data3.
3- Use python and an interface to detect what is (most likely) English from what is not.
If anyone has advise on this, can share experience etc that would be great!
I am attaching an example of the data (I made up the names) and of the expected output.
NB: Please note that I am perfectly aware that classifying names according to an English/Non English dichotomy is not without drawbacks and semantic issues.

I built something a while back that is quite similar. Summary below.
Created 2 Source lists a Firstname list, and a lastname
Created 4+ Comparison lists (English Firstname list, English Last name list, et. al)
Then used an in_array function to compare a source first name to comparison first name
Then I used a big if statement to check lists against eachother. Eng.First vs Src.First, American.First vs Src.First, Irish.First vs src.First.
and so on. If you are thinking of using your first bullet as an option (e.g. parts and pieces of a name, I wrote a paper which includes some source code as well that may be able to help.
Ordered Match Ratio as a Method for Detecting Program Abuse / Fraud

Although the best solution would probably be to train a classification model on top of BERT or a similar language model, a crude solution would be to use zero-shot classification. The example below uses transformers. It does a fairly decent job, although you see some semantic issues pop up: the classification of the name Black, for example, is likely distorted due to it also being a color.
import pandas as pd
from transformers import pipeline
data = [['James', 'Brown'], ['Gerhard', 'Schreuder'], ['Musa', 'Bemba'], ['Morris D.', 'Kemba'], ['Evelyne', 'Fontaine'], ['Max D.', 'Kpali Jr.'], ['Musa', 'Black']]
df = pd.DataFrame(data, columns=['firstname', 'name'])
classifier = pipeline("zero-shot-classification")
firstnames = df['firstname'].tolist()
lastnames = df['name'].tolist()
candidate_labels = ["English or American", "not English or American"]
hypothesis_template = "This name is {}."
results_firstnames = classifier(firstnames, candidate_labels, hypothesis_template=hypothesis_template)
results_lastnames = classifier(lastnames, candidate_labels, hypothesis_template=hypothesis_template)
df['f_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_firstnames ]
df['n_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_lastnames]
df
Output:
| | firstname | name | f_english | n_english |
|---:|:------------|:----------|------------:|------------:|
| 0 | James | Brown | 1 | 1 |
| 1 | Gerhard | Schroeder | 0 | 0 |
| 2 | Musa | Bemba | 0 | 0 |
| 3 | Morris D. | Kemba | 1 | 0 |
| 4 | Evelyne | Fontaine | 1 | 0 |
| 5 | Max D. | Kpali Jr. | 1 | 0 |
| 6 | Musa | Black | 0 | 0 |

Related

Python for Excel: While unique value in column is the same then grab the remaining row data to the right of the column and append to different WS

I am somewhat of a beginner to python and have encountered the following problem working with openpyxl. For example I have the sample worksheet below:
Worksheet
| Boat ID | Emp ID | Emp Name | Start Date | Manager |
------------------------------------------------------
| 1 16044 Derrick ASAP Anthony |
| 1 16045 John ASAP Anthony |
| 1 16046 Bill ASAP Anthony |
| 1 16047 Joe ASAP Anthony |
| 2 16048 Justin ASAP Jacob |
| 2 16049 Sandy ASAP Jacob |
| 2 16050 Omar ASAP Jacob |
| 3 16051 Michael ASAP Nathan |
| 3 16052 Bill ASAP Nathan |
What I am trying to do is loop through the Boat ID column and while the values of the cell are the equivalent I want to take the respective row data to the right and open a new worksheet/workbook and copy paste rows in Col B:E.
So in theory, for every Boat ID = 1 we would take every row unique to ID 1 from Cols B:E open a new workbook and paste them accordingly. Next, for every Boat ID = 2 we would take the rows with ID = 2 in cols B:E, open a new workbook and paste accordingly. Similarly, we would repeat the process for every Boat ID = 3.
P.S. To keep it simple I have ordered the table by Boat ID in ascending order, but if someone wants bonus points they could opine on how it would be done if the table was not ordered.
Any help here would be appreciated as I am still learning and a complex problem like this would be beneficial to further enhance my skills.
I know I am way off but this is the logic I have so far.
f
WS
I used a dictionary in order to categorize all of the boats that I read from the original file according to their "boat id"
the keys were the boats IDs
and the values were the boats
as you can see the code will work even if the original XML file is not sorted by the boat IDs
wb = openpyxl.load_workbook("boats.xlsx", read_only=True)
ws = wb.active
boat_dict = {}
for row_index in range(1,ws.max_row+1):
row = [cell.value for cell in ws[row_index]]
boat_id = row[0]
if boat_id in boat_dict:
boat_dict[boat_id].append(row[1:])
else:
boat_dict[boat_id] = [row[1:]]
new_wb=openpyxl.Workbook()
for boat_id , boats in boat_dict.items():
ws = new_wb.create_sheet(title="Boat id %s"%boat_id)
for boat in boats:
ws.append(boat)
new_wb.save("boats_ans.xlsx")
hope I could help :)

cleaning my dataframe (similar lines and \xc3\x28 in the field)

I am working on dataframe with python.
in my first dataframe df1 i have :
+------+---------+-------------+-------------------------------+
| ID | PUBLICATION TITLE | DATE | JOURNAL |
+------+---------------------+--------------+------------------+
| 1 "a" "01/10/2000" "book1" |
| 2 "b" "09/03/2005" NaN |
| NaN "b" "09/03/2005" "book2 |
| 5 "z" "21/08/1995" "book4" |
| 6 "n" "15/04/1993" "book9\xc3\x28" |
+--------------------------------------------------------------+
Here I would like to clean my dataframe but I don't know how to do it in this case.
Indeed there are two points which block me.
The first one is that lines 2 and 3 seems to be the same line because the title of the publication is the same and because I think that the title of the publication is unique to a newspaper
The second point is for the last line one to \xc3\x28.
How can I clean my dataframe in a smart way, to be able to use this code for other daataframe if possible?
First you should remove the row with ID = NaN. This can be done by:
df1 = df1[df1['ID'].notna()]
Then update the journal of the 2nd row:
df1.iloc[1, df1.columns.get_loc('JOURNAL')] = 'book2'
Finally, for the entry of 'book9\xc3\x28', you can update it by:
df1.iloc[4, df1.columns.get_loc('JOURNAL')] = 'book9'
What type of encoding are you using.
I recommend using "utf8" encoding for this purpose.

How to find similarity distance between user's preference vector and items description table (matrices that are not the same size) in python?

I have a two different datasets:
users's "taste" table:
+-------+---------+--------+---------+---------+-----+--
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+---------+---------+--------+-----+--
| 100 | 0 | 1 | 2 | 1 | 0 |
| 101 | 1 | 4 | 0 | 3 | 0 |
+-------+---------+---------+---------+--------+-----+--
movie's genre table:
+--------+---------+---- ----+---------+---------+-----+--
|movie_id| Action |Adventure|Animation| Children|Drama|
+--------+---------+---- ----+---------+---------+-----+--
| 1001 | 0 | 1 | 1 | 1 | 0 |
| 1001 | 0 | 1 | 0 | 1 | 0 |
+--------+---------+---------+---------+---------+-----+--
I am trying to recommend to user the most similar N movies based on his taste. What I thought is to measure the similarity distance (cosine similarity/dot product) between user and each movie and return top N most similar ones. What is the right way to implement it in python?
It is a easy questions but depending on type of distance & size of data the answer can be complex. I give you some hooks to start with.
Sklearn has distance metrics implemented, which you can use right away to calculate the distance between items, and for instance with help of argmax find the best match. This would be the naive approach, but works fine on small data sets and you have flexibility to use any metric you want.
Distances have the pairwise implemented, made to easily calculate distance matrices to quickly find the best match. But you can imagine for larger data-sets this strategy wont work anymore.
When the data grows- you can use the BallTree algorithm to quickly find the (1) k-nearest, or (2) all movies within certain threshold. This algorithm is well implemented in sklearn, and I would advise to start with this approach, as it is a good balance between fast and easy to implement.
Other option is to use specialised package like faiss or ann. Both would only use if above fails in terms of speed/size of data.

How can I make the groups in a specific column appear when the groupby method is used in order to merge?

x=df.groupby(['id_gamer'])[['sucess', 'nb_games']].shift(periods=1).cumsum()
.apply(lambda row: row.sucess/row.nb_games, axis=1)
In the code above, I make a groupby on a pandas.DataFrame in order to obtain a shifted column of results represented as ratio, for each gamer, and each game. Actually his rate of success considering the number of games he played.
It returns a pandas.core.series.Series object as:
+---------------+----------------+
| Index | Computed_ratio |
+---------------+----------------+
| id_game_date | NaN |
| id_game2_date | 0.30 |
| id_game3_date | 0.40 |
| id_game_date | NaN |
| id_game4_date | 0.50 |
| ... | ... |
+---------------+----------------+
So, you may see the NaN as the delimitation between gamers. As you may see the first gamer and the second one met in one game: id_game_date. And this is why I would prefer the column of gamer from id_gamer to appear in order to merge it with the dataframe where data are from.
To be honest I have an idea of solution: just do not use the id of games as index, then each row will be indexed correctly and there is no conflict when I proceed a merge, I guess. But I would like to know if it is possible with this current pattern shown here.
NB: I already tried with the solutions presented in this topic. But none of these work, certainly because the functions shown are aggregations and not mine: cumsum(). If I used an aggregating function like sum() (with a different pattern of code, do not try with the one I gave you or it will return an error) the id_gamer appears. But it is not corresponding to my expectations.

Pandas mapping to multiple dictionary items for categorising data

I have a large dataframe containing a 'Description' column.
I've compiled a sizeable dictionary of lists, where the key is basically the Category, and the items are lists of possible (sub)strings contained in the description column.
I want to use the dictionary to classify each entry in the dataframe based on this description... Unfortunately I can't figure out how to apply a dictionary of lists to map to a dataframes (feels like it would be some sort of concoction of map, isin and str.contains but I have had no joy). I've included code to generate a model dataset below:
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
This model dataset would then ideally be somehow mapped against the following dictionary:
dict = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
to generate a new column in the dataframe, with the result as such:
| | A | Description | Type |
|---|----------------------|------------------------|--------|
| 0 | -1.4120290137842615 | White Ford Escort | Car |
| 1 | -0.3141036399049358 | Irish Draft Horse | Animal |
| 2 | 0.49374344901643896 | Springer spaniel (dog) | Animal |
| 3 | 0.013654965767323723 | Green Vauxhall Corsa | Car |
| 4 | -0.18271952280002862 | White Van | Car |
| 5 | 0.9519081000007026 | Labrador dog | Animal |
| 6 | 0.403258571154998 | Black horse | Animal |
| 7 | -0.8647792960494813 | Blue Van | Car |
| 8 | -0.12429427259820519 | Red Vauxhall Corsa | Car |
| 9 | 0.7695980616520571 | Bear | - |
The numbers are obviously irrelevant here, but there are other columns in the dataframes and I wanted this reflecting.
I'm happy to use regex, or a perhaps change my dictionary to a dataframe and do a join (i've considered multiple routes).
This feels similar to a recent question, but it's not the same and certainly the answer hasn't helped me.
Sorry if I've been stupid somewhere and this is really simple - it does feel like it should be, but i'm missing something.
Thanks
You can use fuzzywuzzy library to solve this. Make sure to install it via pip install fuzzywuzzy
from fuzzywuzzy import process
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
d = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
# Construct a dataframe from the dictionary
df1 = pd.DataFrame([*d.values()], index=d.keys()).T.melt().dropna()
# Get relevant matches using the library.
m = df.Description.apply(lambda x: process.extract(x, df1.value)[0])
# concat the matches with original df
df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)
df2.columns = [*df.columns, 'matches']
# After merge it with df1
df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')
# Drop columns that are not required and rename.
df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'Type'})
print (df2)
A Description Type
0 -0.423555 White Ford Escort Car
1 0.294092 Irish Draft Horse Animal
2 1.949626 Springer spaniel (dog) Animal
3 -1.315937 Green Vauxhall Corsa Car
4 -0.250184 White Van Car
5 0.186645 Labrador dog Animal
6 -0.052433 Black horse Animal
7 -0.003261 Blue Van Car
8 0.418292 Red Vauxhall Corsa Car
9 0.241607 Bear NaN
Consider inverting your dictionary first, while making everything lowercase
Then per row, split Description into words and make them lowercase
e.g., 'Springer spaniel (dog)' -> ['springer', 'spaniel', '(', 'dog', ')']
For each lower case word from (2), look it up in the inverted dictionary from (1); using apply

Categories