Python - function similar to VLOOKUP (Excel) - python

i am trying to join two data frames but cannot get my head around the possibilities Python has to offer.
First dataframe:
ID MODEL REQUESTS ORDERS
1 Golf 123 4
2 Passat 34 5
3 Model 3 500 8
4 M3 5 0
Second dataframe:
MODEL TYPE MAKE
Golf Sedan Volkswagen
M3 Coupe BMW
Model 3 Sedan Tesla
What I want is to add another column in the first dataframe called "make" so that it looks like this:
ID MODEL MAKE REQUESTS ORDERS
1 Golf Volkswagen 123 4
2 Passat Volkswagen 34 5
3 Model 3 Tesla 500 8
4 M3 BMW 5 0
I already looked at merge, join and map but all examples just appended the required information at the end of the dataframe.

I think you can use insert with map by Series created with df2 (if some value in column MODEL in df2 is missing get NaN):
df1.insert(2, 'MAKE', df1['MODEL'].map(df2.set_index('MODEL')['MAKE']))
print (df1)
ID MODEL MAKE REQUESTS ORDERS
0 1 Golf Volkswagen 123 4
1 2 Passat NaN 34 5
2 3 Model 3 Tesla 500 8
3 4 M3 BMW 5 0

Although not in this case, but there might be scenarios where df2 has more than two columns and you would just want to add one out of those to df1 based on a specific column as key. Here is a generic code that you may find useful.
df = pd.merge(df1, df2[['MODEL', 'MAKE']], on = 'MODEL', how = 'left')

The join method acts very similarly to a VLOOKUP. It joins a column in the first dataframe with the index of the second dataframe so you must set MODEL as the index in the second dataframe and only grab the MAKE column.
df.join(df1.set_index('MODEL')['MAKE'], on='MODEL')
Take a look at the documentation for join as it actually uses the word VLOOKUP.

I always found merge to be an easy way to do this:
df1.merge(df2[['MODEL', 'MAKE']], how = 'left')
However, I must admit it would not be as short and nice if you wanted to call the new column something else than 'MAKE'.

Related

Creating ID for every row based on the observations in variable

A want to create a system where the observations in a variable refer to a number using Python. All the numbers from the (in this case) 5 different variables together form a unique code. The first number corresponds to the first variable. When an observations in a different row is the same as the first, the same number applies. As illustrated in the example, If apple appears in row 1 and 3, both ID's get a '1' as first number.
The output should give a new column with the ID. If all the observations in a row are the same, the ID's will be the same. In the picture below you see 5 variables leading to the unique ID on the right, which should be the output.
You can use pd.factorize:
df['UniqueID'] = (df.apply(lambda x: (1+pd.factorize(x)[0]).astype(str))
.agg(''.join, axis=1))
print(df)
# Output
Fruit Toy Letter Car Country UniqueID
0 Apple Bear A Ferrari Brazil 11111
1 Strawberry Blocks B Peugeot Chile 22222
2 Apple Blocks C Renault China 12333
3 Orange Bear D Saab China 31443
4 Orange Bear D Ferrari India 31414

Pandas Dataframe : Using same category codes on different existing dataframes with same category

I have two pandas dataframes with some columns in common. These columns are of type category but unfortunately the category codes don't match for the two dataframes. For example I have:
>>> df1
artist song
0 The Killers Mr Brightside
1 David Guetta Memories
2 Estelle Come Over
3 The Killers Human
>>> df2
artist date
0 The Killers 2010
1 David Guetta 2012
2 Estelle 2005
3 The Killers 2006
But:
>>> df1['artist'].cat.codes
0 55
1 78
2 93
3 55
Whereas:
>>> df2['artist'].cat.codes
0 99
1 12
2 23
3 99
What I would like is for my second dataframe df2 to take the same category codes as the first one df1 without changing the category values. Is there any way to do this?
(Edit)
Here is a screenshot of my two dataframes. Essentially I want the song_tags to have the same cat codes for artist_name and track_name as the songs dataframe. Also song_tags is created from a merge between songs and another tag dataframe (which contains song data and their tags, without the user information) and then saved and loaded through pickle. Also it might be relevant to add that I had to cast artist_name and track_name in song_tags to type category from type object.
I think essentially my question is: how to modify category codes of an existing dataframe column?

How to detect a string in dataframe column from a list of names in another dataframe column

I am trying to find whether a news article contains a specific name of the company which I already have established a list as a dataframe column. I have one dataframe that contains the text of article as a column, and another dataframe with the names of the companies. I would like to search each article text to detect whether any name from the list exists, and create separate variable containing that name of the company found within the text. Someone recommended me using 'merge', but since I do not have the common identifier, it was not possible. I hope following example illustrates the idea.
First Dataframe (Article):
Index
Text
1
Apple decided to launch new product....
2
Tesla is ...
3
IBM is paying dividend......
4
Amazon is relocating.....
......
........
Second Dataframe with company name (Compname):
Index
Name
1
BP
2
Tesla
3
Bank of America
4
Amazon
5
JP Morgan
6
Apple
.....
......
What I want to see in the end would be the following:
Index
Text
Name_found
1
Apple decided to launch new product....
Apple
2
Tesla is ...
Tesla
3
IBM is paying dividend......
NaN
4
Amazon is relocating.....
Amazon
....
.....
......
I tried something like the following, but didn't quite get the job done
for x in compname['Name']:
Article['Name_found']=Article['Text'].str.contains(x, na=False)
Thank you for your help. Truly appreciate it.
Do you want this - >
pattern = r'(' + '|'.join(df1['Name'].to_list()) + ')'
df2['Text'] = df2['Text'].str.extract(pat= pattern)
print(df2)
Idea is to make a regex pattern with multiple or conditions - Here, for this case pattern will look like this -
'(BP|Tesla|Bank of America|Amazon|JP Morgan|Apple)'
Output- >
Index Text
0 1 Apple
1 2 Tesla
2 3 NaN
3 4 Amazon

How to link two dataframes based on the string similarity of one column

I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:
Dataframes:
DF-1 DF-2
--------------------- ---------------------
ID Name ID Name
1 56 aaeessa 1 12 H.P paRt 1
2 98 1o7v9sM 2 76 aa3esza
3 175 HP. part 1 3 762 stakoverfl
4 2 stackover 4 2 lo7v9Sm
I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:
Result:
Result
-----------------
ID1 ID2
1 56 76
2 98 2
3 175 12
4 2 762
This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.
EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).
Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/
You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.
first pip install pandas-dedupe and try this:
import pandas as pd
import pandas_dedupe
df1=pd.DataFrame({'ID':[56,98,175],
'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})
df2=pd.DataFrame({'ID':[12,76,762,2],
'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})
#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])
# reset index
df_final = df_final.reset_index(drop=True)
# print result
print(df_final)
ID Name cluster id confidence
0 98 1o7v9sm 0.0 1.000000
1 2 lo7v9sm 0.0 1.000000
2 175 hp. part 1 1.0 0.999999
3 12 h.p part 1 1.0 0.999999
4 56 aaeessa 2.0 0.999967
5 76 aa3esza 2.0 0.999967
6 762 stakoverfl NaN NaN
you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.
I suggest you a library called Python Record Linkage Toolkit.
Once you import the library, you must index the sources you intend to compare, something like this:
indexer = recordlinkage.Index()
#using url as intersection
indexer.block('id')
candidate_links = indexer.index(df_1, df_2)
c = recordlinkage.Compare()
Let's say you want to compare based on the similiraties of strings, but they don't match exactly:
c.string('name', 'name', method='jarowinkler', threshold=0.85)
And if you want an exact match you should use:
c.exact('name')
Using my fuzzy_wuzzy function from the linked answer:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
mrg = fuzzy_merge(df1, df2, 'Name', 'Name', threshold=70)\
.merge(df2, left_on='matches', right_on='Name', suffixes=['1', '2'])\
.filter(like='ID')
Output
ID1 ID2
0 56 76
1 98 2
2 175 12
3 2 762

sort pandas dataframe based on list

I would like to sort the following dataframe:
Region LSE North South
0 Cn 33.330367 9.178917
1 Develd -36.157025 -27.669988
2 Wetnds -38.480206 -46.089908
3 Oands -47.986764 -32.324991
4 Otherg 323.209834 28.486310
5 Soys 34.936147 4.072872
6 Wht 0.983977 -14.972555
I would like to sort it so the LSE column is reordered based on the list:
lst = ['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht']
of, course the other columns will need to be reordered accordingly as well. Is there any way to do this in pandas?
The improved support for Categoricals in pandas version 0.15 allows you to do this easily:
df['LSE_cat'] = pd.Categorical(
df['LSE'],
categories=['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht'],
ordered=True
)
df.sort('LSE_cat')
Out[5]:
Region LSE North South LSE_cat
3 3 Oands -47.986764 -32.324991 Oands
2 2 Wetnds -38.480206 -46.089908 Wetnds
1 1 Develd -36.157025 -27.669988 Develd
0 0 Cn 33.330367 9.178917 Cn
5 5 Soys 34.936147 4.072872 Soys
4 4 Otherg 323.209834 28.486310 Otherg
6 6 Wht 0.983977 -14.972555 Wht
If this is only a temporary ordering then keeping the LSE column as
a Categorical may not be what you want, but if this ordering is
something that you want to be able to make use of a few times
in different contexts, Categoricals are a great solution.
In later versions of pandas, sort, has been replaced with sort_values, so you would need instead:
df.sort_values('LSE_cat')

Categories