Doing vlookup-like things on Python with multiple lookup values - python

Many of us know that the syntax for a Vlookup function on Excel is as follows:
=vlookup([lookup value], [lookup table/range], [column selected], [approximate/exact match (optional)])
I want to do something on Python with a lookup table (in dataframe form) that looks something like this:
Name Date of Birth ID#
Jack 1/1/2003 0
Ryan 1/8/2003 1
Bob 12/2/2002 2
Jack 3/9/2003 3
...and so on. Note how the two Jacks are assigned different ID numbers because they are born on different dates.
Say I have something like a gradebook (again, in dataframe form) that looks like this:
Name Date of Birth Test 1 Test 2
Jack 1/1/2003 89 91
Ryan 1/8/2003 92 88
Jack 3/9/2003 93 79
Bob 12/2/2002 80 84
...
How do I make it so that the result looks like this?
ID# Name Date of Birth Test 1 Test 2
0 Jack 1/1/2003 89 91
3 Ryan 1/8/2003 92 88
1 Jack 3/9/2003 93 79
2 Bob 12/2/2002 80 84
...
It seems to me that the "lookup value" would involve multiple columns of data ('Name' and 'Date of Birth'). I kind of know how to do this in Excel, but how do I do it in Python?

Turns out that I can just do
pd.merge([lookup value], [lookup table], on = ['Name', 'Date of Birth']
which produces
Name Date of Birth Test 1 Test 2 ID#
Jack 1/1/2003 89 91 0
Ryan 1/8/2003 92 88 3
Jack 3/9/2003 93 79 1
Bob 12/2/2002 80 84 2
...
Then everything needed is to move the last column to the front.

Related

i want to separate data frame based on marks and download it then

This is my data frame:
Name Age Stream Percentage
0 A 21 Math 88
1 B 19 Commerce 92
2 C 20 Arts 95
3 D 18 Biology 70
0 E 21 Math 88
1 F 19 Commerce 92
2 G 20 Arts 95
3 H 18 Biology 70
I want to download different excel file for each subject in one loop so basically, I should get 4 excel files for each subject
i tried this but didn't work:
n=0
for subjects in df.stream:
df.to_excel("sub"+ str(n)+".xlsx")
n+=1
I think groupby is helpful here. and you can use enumerate to keep track of the index.
for i, (group, group_df) in enumerate(df.groupby('stream')):
group_df.to_excel('sub{}.xlsx'.format(i))
# Alternatively, to name the file based on the stream...
group_df.to_excel('sub{}.xlsx'.format(group))
group is going to be the name of the stream.
group_df is going to be a sub-dataframe containing all the data in that group.

Removing from pandas dataframe all rows having less than 3 characters

I have this dataframe
Word Frequency
0 : 79
1 , 60
2 look 26
3 e 26
4 a 25
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
I would like to remove all words having less than 3 characters.
I tried as follows:
df['Word']=df['Word'].str.findall('\w{3,}').str.join(' ')
but it does not remove them from my datataset.
Can you please tell me how to remove them?
My expected output would be:
Word Frequency
2 look 26
... ... ...
95 trump 2
96 election 2
97 step 2
98 day 2
99 university 2
Try with
df = df[df['Word'].str.len()>=3]
Instead of attempting a regular expression, you can use .str.len() to get the length of each string of your column. Then you can simply filter based on that length for >= 3
Should look like:
df.loc[df["Word"].str.len() >= 3]
Please Try
df[df.Word.str.len()>=3]

divide a dataframe on special threshold

I got a DataFrame as an example:
name age
Ashe 12
Ashe 13
Ashe 23
John 33
John 45
Karin 55
David 84
Zaki 34
Mano 45
my threshold is I need to divide this on distinct names like I need 3 distinct names so I need the output to be :
name age
Ashe 12
Ashe 13
Ashe 23
John 33
John 45
Karin 55
and the second DF :
name age
David 84
Zaki 34
Zaki 23
Zaki 35
Mano 45
what can I do?
from itertools import islice
def chunk(lst, size):
lst = iter(lst)
return iter(lambda: tuple(islice(lst, size)), ())
name_groups = list(chunk(df.name.unique(),3))
data = {}
for i, group in enumerate(name_groups):
data[f'df{i}'] = df[df.name.isin(group)]
The chunk function splits an array to chunks of size n (in our case - 3)
You can read more here : https://stackoverflow.com/a/22045226/13104290
name_groups contains a list of tuples with up to 3 elements each one:
[('Ashe', 'John', 'Karin'), ('David', 'Zaki', 'Mano')]
Since we sent df.name.unique(), there are no duplications.
Now we need to dynamically create each new dataframe, we'll do this by creating a dictionary and adding each new partition one at a time.
The dictionary now contains two dataframes, df0 and df1.
data['df0'] :
name age
0 Ashe 12
1 Ashe 13
2 Ashe 23
3 John 33
4 John 45
5 Karin 55
data['df1']:
name age
6 David 84
7 Zaki 34
8 Mano 45

How to sum multiple values in a dataframe column, if they are corresponding to 1 value in an other column

I have a data frame like this:
Code Group Name Number
ABC Group_1_ABC Mike 40
Amber 60
Group_2_ABC Rachel 90
XYZ Group_1_XYZ Bob 30
Peter 75
Nikki 55
Group_2_XYZ Julia 23
Ross 80
LMN Group_1_LMN Paul 95
. . . .
. . . .
I have created this data frame by grouping by code, group, name and summing the number.
Now i want to calculate the percentage of each name for a particular code. For that i want to sum all the numbers that are part of one code. I was doing this to calculate the percentage.
df['Percentage']= (df['Number']/df['??'])*100
Now for the total sum part for each group, I can`t figure out how to calculate it? I want the total sum for each code category, in order to calculate the percentage.
So for example for Code: ABC the total should be 40+60+90=190. This 190 would than be divided with all the number for each user in ABC to calculate their percentage for their respective code category. So technically the column group and name don`t have any role in calculating the total sum for each code category.
Use GroupBy.transform by first level or by level name Code:
df['Percentage']= (df['Number']/df.groupby(level=0)['Number'].transform('sum'))*100
df['Percentage']= (df['Number']/df.groupby(level=['Code'])['Number'].transform('sum'))*100
Or in last pandas versions is not necessary specified level parameter:
df['Percentage']= (df['Number']/df.groupby('Code')['Number'].transform('sum'))*100
print (df)
Number Percentage
Code Group Name
ABC Group_1_ABC Mike 40 21.052632
Amber 60 31.578947
Group_2_ABC Rachel 90 47.368421
XYZ Group_1_XYZ Bob 30 11.406844
Peter 75 28.517110
Nikki 55 20.912548
Group_2_XYZ Julia 23 8.745247
Ross 80 30.418251
LMN Group_1_LMN Paul 95 100.000000
Detail:
print (df.groupby(level=0)['Number'].transform('sum'))
Code Group Name
ABC Group_1_ABC Mike 190
Amber 190
Group_2_ABC Rachel 190
XYZ Group_1_XYZ Bob 263
Peter 263
Nikki 263
Group_2_XYZ Julia 263
Ross 263
LMN Group_1_LMN Paul 95
Name: Number, dtype: int64

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Categories