Pandas series replace values - python

I have a pandas series with values below:
Bachelors Degree 639
Diploma 291
O - Level 264
Masters Degree 149
Certificate 126
A - Level 69
PGD 40
Bachelors Degree 28
A-Level 20
O-Level 15
Masters 10
Bachelors 6
diploma 5
certificate 5
Ph.D 4
A- Level 2
Post Graduate Diploma 1
Msc Environment 1
BBA 1
O- Level 1
Masters 1
PhD 1
I got data from excel.
I want to use pandas to do the data cleaning by say replacing all cases which has Masters with Master's degree (i can do it in excel but i am learning pandas).
I have tried
mapp={"Bachelor's Degree":["Bachelors Degree","Bachelors","BBA","Bachelors Degree"],
"Ordinary Diploma":"diploma",
"Ordinary Level":["O - Level","O-Level","O- Level"],
"Master's Degree":["Masters Degree","Masters","Msc Environment","Masters"],
"Certificate":"certificate",
"Advanced Level":["A - Level","A-Level","- Level"],
"Post Graduate Diploma":["Post Graduate Diploma","PGD"],
"PHD":["Ph.D","PhD"]
}
df['EDUCATION_LEVEL']=df['EDUCATION_LEVEL'].map(mapp)
The results are returned only for the Certificate key which has only one value.
It seems i cant use a list as values for a dictionary key.
Any suggestion on how to replace the values will be highly appreciated.
Ronald
This is how actual data appear in the excel column.
I have added an image of how data is in the column.
The challenge is how replace the various varriations of say "Masters Degree".

First make a slight change to your mapp dict by setting all the values as list:
mapp={"Bachelor's Degree":["Bachelors Degree","Bachelors","BBA","Bachelors Degree"],
"Ordinary Diploma":["diploma"],
"Ordinary Level":["O - Level","O-Level","O- Level"],
"Master's Degree":["Masters Degree","Masters","Msc Environment","Masters"],
"Certificate":["certificate"],
"Advanced Level":["A - Level","A-Level","- Level"],
"Post Graduate Diploma":["Post Graduate Diploma","PGD"],
"PHD":["Ph.D","PhD"]
}
mapp_new = [{l:k for l in v} for k,v in mapp.items()]
mapp_new = {k.lower(): v for d in mapp_new for k, v in d.items()}
df.EDUCATION_LEVEL.apply(lambda x: mapp_new.get(x.lower(), x))
0 Bachelor's Degree
1 Ordinary Diploma
2 Ordinary Level
3 Master's Degree
4 Certificate
5 Advanced Level
6 Post Graduate Diploma
7 Bachelor's Degree
8 Advanced Level
9 Ordinary Level
10 Master's Degree
11 Bachelor's Degree
12 Ordinary Diploma
13 Certificate
14 PHD
15 A- Level
16 Post Graduate Diploma
17 Master's Degree
18 Bachelor's Degree
19 Ordinary Level
20 Master's Degree
21 PHD

One idea is convert one element values to one element lists like "diploma" to ["diploma"]:
mapp1={"Bachelor's Degree":["Bachelors Degree","Bachelors","BBA","Bachelors Degree"],
"Ordinary Diploma":["diploma"],
"Ordinary Level":["O - Level","O-Level","O- Level"],
"Master's Degree":["Masters Degree","Masters","Msc Environment","Masters"],
"Certificate":["certificate"],
"Advanced Level":["A - Level","A-Level","- Level"],
"Post Graduate Diploma":["Post Graduate Diploma","PGD"],
"PHD":["Ph.D","PhD"]
}
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d = {k.lower(): oldk for oldk, oldv in mapp1.items() for k in oldv}
df['EDUCATION_LEVEL']=df['EDUCATION_LEVEL'].str.lower().map(d)
print (df)
EDUCATION_LEVEL VAL
0 Bachelor's Degree 639
1 Ordinary Diploma 291
2 Ordinary Level 264
3 Master's Degree 149
4 Certificate 126
5 Advanced Level 69
6 Post Graduate Diploma 40
7 Bachelor's Degree 28
8 Advanced Level 20
9 Ordinary Level 15
10 Master's Degree 10
11 Bachelor's Degree 6
12 Ordinary Diploma 5
13 Certificate 5
14 PHD 4
15 NaN 2
16 Post Graduate Diploma 1
17 Master's Degree 1
18 Bachelor's Degree 1
19 Ordinary Level 1
20 Master's Degree 1
21 PHD 1
If not possible then use:
d = {}
for k, v in mapp.items():
if isinstance(v, list):
for x in v:
d[x.lower()] = k
else:
d[v.lower()] = k
df['EDUCATION_LEVEL']=df['EDUCATION_LEVEL'].str.lower().map(d)
print (df)
EDUCATION_LEVEL VAL
0 Bachelor's Degree 639
1 Ordinary Diploma 291
2 Ordinary Level 264
3 Master's Degree 149
4 Certificate 126
5 Advanced Level 69
6 Post Graduate Diploma 40
7 Bachelor's Degree 28
8 Advanced Level 20
9 Ordinary Level 15
10 Master's Degree 10
11 Bachelor's Degree 6
12 Ordinary Diploma 5
13 Certificate 5
14 PHD 4
15 NaN 2
16 Post Graduate Diploma 1
17 Master's Degree 1
18 Bachelor's Degree 1
19 Ordinary Level 1
20 Master's Degree 1
21 PHD 1

Related

Trying to access variables while scraping website; trying to get var in script

Trying to web scrape info from this website: http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed=.
For context, I'm trying to find the Tyre brand (Bridgestone, Michelin), pattern (e.g Turanza T001, Ecopia EP500), Tyre Size (205/55. 16 V (91), 225/50. 16 W (100) XL), Seasonality (if available) (Summer, Winter) and price.
My measurements for the tyre are Width – 205, Aspect Ratio – 55, Rim Size - 16.
I found all the info I need here at the var allTyres section. The problem is, I am struggling with how to extract the "manufacturer" (brand), "description" (description has the pattern and size), "winter" (it would have 0 for no and 1 for yes), "summer" (same as winter) and "price".
Afterwards, I want to export the data in CSV format.
Thanks
To create a pandas dataframe from the allTyres data you can do (from the DataFrame you can select columns you want, save it to CSV etc..):
import re
import json
import requests
import pandas as pd
url = "http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed="
data = json.loads(
re.search(r"allTyres = (.*);", requests.get(url).text).group(1)
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data)
print(df.head())
Prints:
id ManufacturerID width profile rim speed load description part_no pattern manufacturer extra_load run_flat winter summer OEList price tyre_class rolling_resistance wet_grip Graphic noise_db noise_rating info pattern_name recommended rating
0 1881920 647 205 55 16 V 91 205/55VR16 BUDGET VR 2055516VBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
1 3901788 647 205 55 16 H 91 205/55R16 BUDGET 91H 2055516HBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
2 1881957 647 205 55 16 W 91 205/55ZR16 BUDGET ZR 2055516ZBUD Economy N N 0 1 53.54 C1 G F BUD 73 3 0 1
3 6022423 129 205 55 16 H 91 205/55R16 91H UROYAL RAINSPORT 5 2055516HUN09BGS RainSport 5 Uniroyal N N 0 1 70.46 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4
4 6022424 129 205 55 16 V 91 205/55R16 91V UROYAL RAINSPORT 5 2055516VUN09BGR RainSport 5 Uniroyal N N 0 1 70.81 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4

New column based on existing string column in Python

My dataframe looks like:
School
Term
Students
A
summer 2020
324
B
spring 21
101
A
summer/spring
201
F
wintersem
44
C
fall trimester
98
E
23
I need to add a new column Termcode that assumes any of the 6 values:
summer, spring, fall, winter, multiple, none based on corresponding value in the Term Column, viz:
School
Term
Students
Termcode
A
summer 2020
324
summer
B
spring 21
101
spring
A
summer/spring
201
multiple
F
wintersem
44
winter
C
fall trimester
98
fall
E
23
none
You can use a regex with str.extractall and filling of the values depending on the number of matches:
terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'
# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)
# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')
# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')
output:
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 none
Series.findall
l = ['summer', 'spring', 'fall', 'winter']
s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 NaN

compare 2 dataframes simultaneously - 2 itertuples?

Im comparing 2 dataframes and Id like see if the the name matches on the address then to pull the unique ID. otherwise, continue on and search for the best match. (Im using fuzzy matcher for that part)
I was exploring itertools and wondered if using the itertools.zip_longest option would work simultaneously to compare 2 items togther. rather than using 2 for loops (example for x in df1.itertuples: do something... for y in df2.itertuples: do something) would something like this work?
result = itertools.zip_longest(df1.itertuples(), df2.itertuples())
Here's my 2 dataframes -
Here's my DF1:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 123 road 2222277 84 170 265
1 OFFICE 2 15 lane 2222289 7 167 288
2 OFFICE 3 3 highway 1111111 1 2 286
3 OFFICE 4 2 street 1111171 95 193 299
4 OFFICE 5 1 place 1111191 9 193 298
DF2:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER UNIQUE ID
0 OFFICE 1 123 road 2222277 014168
1 OFFICE 2 15 lane 2222289 131989
2 OFFICE 3 3 highway 1111111 149863
3 OFFICE 4 2 street 1111171 198664
4 OFFICE 5 1 place 1111191 198499
5 OFFICE 6 zzzz rd 165198 198791
6 OFFICE 7 5z st 19844 298791
7 OFFICE 8 34 hwy 981818 398791
8 OFFICE 9 81290 rd 899811 498791
9 OFFICE 10 59 rd 699161 598791
10 OFFICE 11 5141 bldvd 33211 698791
Then perform a for loop and do some comparison if statements. I can access both items side by side but how would I then loop over the items to do the check?
Right now im getting: "
TypeError: 'NoneType' object is not subscriptable"
for yy in result:
if yy[0][1]== yy[1][1]:
print(yy) ......
If your headers are the same in both df´s, just apply merge:
dfmerge=pd.merge(df1,df2)
the output should be:

Search through a text file in between to specified characters

I want to search a specified section of the attached text file based on the '#' character. Basically, I want to look at all the data starting at a found '#' and end at a line with '#'. In these sections, I'd also like to look for a specified string. I am coding in python.
TEXT FILE (The bolded events have a '#' symbol in front of the number ex: the real file reads '#1 Women 1000 Yd Free')
FAST - CA
HY-TEK's MEET MANAGER 7.0 - 11:32 PM 2/11/2018 Page 1
2018 Presidents' Day Senior Swimming Classic
Psych Sheet
#1 Women 1000 Yd Free
10:39.39 SECT
Name
Age
Team
1 Zamora Gallegos, Mariana
15 BC
2 Arzave, Juli
16
SBA-SI
Seed Time
1:00.21 SECT
10:02.58 SECT
3 Aguilar Ortega, Martha
18 Ruth
BC
10:05.72 SECT
4 Nowaski, Danielle 17
10:13.74 SECT
CAST-SI
5 Miranda Aguilar, Danitza
17 BC
10:23.30 SECT
6 Moreno Osuna, Ashely
16 Dariela
BC
10:23.70 SECT
7 Motekaitis, Mia
17
UCD-SN
10:24.96 SECT
8 Gardner, Amber
15
CROW-PC
10:27.86 SECT
9 Urias Quijas, Sophie14
AprilBC
53 Macias Ruiz, Alejandra
17 BC
11:23.81
45 Suehiro, Alex
54 Fuller, Monica
SMSC-CA
11:29.23
14 BC
46 Gallegos Portugal, Hanson
10:26.01
MP-PC
11:30.00
47 O'Connell, Daniel 17
CROW-PC
10:27.13
13 BC
56 Mendoza Camilo, Arely
11:30.58
48 Monge, Colin
PS-SI
10:27.63
57 DePaco, Lexi
15
CROW-PC
11:31.13
17 BC
49 Mejia Matamoros, Miguel
10:28.50
58 Morgan, Chloe
12
MRA-SI
11:57.36
50 Cebreros Gracia, Jorge
16
BC
10:30.06
59 Ferguson, Lizzie
18
MP-PC
12:12.40
51 Simpson, Aidan
ICAC-SI
10:30.20
60 Vera, Daniela
19
BERIM
52 Cordova Medina, Ivar
17
BC
10:30.57
53 Rascon, Esteban
SBA-SI
10:33.35
54 Mota Ezpinoza, Leonel
13
BC
10:34.09
55 Marsalek, Asa
16
SMSC-CA
10:34.14
56 Shitole, Viraj
17
DACA-PC
10:39.53
57 Friedrich, Aaron
15
SMSC-CA
10:40.67
58 Sokalzuk, Samuel 14
MP-PC
10:40.89
59 Jin, Lei
ICAC-SI
10:56.89
14
55 Duro, Dominique 18
9:12.23L SECT
#2 Men 1000 Yd Free
Begin by using either the string.find("#"). It will return -1 if the hashtag doesn't exist. Once you find it, make an outer loop that iterates until you find the next hashtag. If you do find a hashtag, search the next 4-5 elements. For example, search if it says "women" or "men", contains a number for the distance. For example, for the distance, the isDigit method is perfect.
Inside that loop, check with your list of names and confirm when you have a match. For example, try using the == to check if the name matches. Once you do find the name, check the next 4 pieces of information and copy those down as a string.
Given about 30 names, and a list of about 3000 swimmers, it makes close to 100,000 comparisons meaning it could take a long time to run the program.
Outer loop (checks or #)
if(#) --> check event, gender, etc.., and reset the global variable to the event so that the swimmers can be saved to it
else --> do name comparisons

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Categories