My dataframe looks like:
School
Term
Students
A
summer 2020
324
B
spring 21
101
A
summer/spring
201
F
wintersem
44
C
fall trimester
98
E
23
I need to add a new column Termcode that assumes any of the 6 values:
summer, spring, fall, winter, multiple, none based on corresponding value in the Term Column, viz:
School
Term
Students
Termcode
A
summer 2020
324
summer
B
spring 21
101
spring
A
summer/spring
201
multiple
F
wintersem
44
winter
C
fall trimester
98
fall
E
23
none
You can use a regex with str.extractall and filling of the values depending on the number of matches:
terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'
# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)
# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')
# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')
output:
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 none
Series.findall
l = ['summer', 'spring', 'fall', 'winter']
s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 NaN
Related
Trying to web scrape info from this website: http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed=.
For context, I'm trying to find the Tyre brand (Bridgestone, Michelin), pattern (e.g Turanza T001, Ecopia EP500), Tyre Size (205/55. 16 V (91), 225/50. 16 W (100) XL), Seasonality (if available) (Summer, Winter) and price.
My measurements for the tyre are Width – 205, Aspect Ratio – 55, Rim Size - 16.
I found all the info I need here at the var allTyres section. The problem is, I am struggling with how to extract the "manufacturer" (brand), "description" (description has the pattern and size), "winter" (it would have 0 for no and 1 for yes), "summer" (same as winter) and "price".
Afterwards, I want to export the data in CSV format.
Thanks
To create a pandas dataframe from the allTyres data you can do (from the DataFrame you can select columns you want, save it to CSV etc..):
import re
import json
import requests
import pandas as pd
url = "http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed="
data = json.loads(
re.search(r"allTyres = (.*);", requests.get(url).text).group(1)
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data)
print(df.head())
Prints:
id ManufacturerID width profile rim speed load description part_no pattern manufacturer extra_load run_flat winter summer OEList price tyre_class rolling_resistance wet_grip Graphic noise_db noise_rating info pattern_name recommended rating
0 1881920 647 205 55 16 V 91 205/55VR16 BUDGET VR 2055516VBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
1 3901788 647 205 55 16 H 91 205/55R16 BUDGET 91H 2055516HBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
2 1881957 647 205 55 16 W 91 205/55ZR16 BUDGET ZR 2055516ZBUD Economy N N 0 1 53.54 C1 G F BUD 73 3 0 1
3 6022423 129 205 55 16 H 91 205/55R16 91H UROYAL RAINSPORT 5 2055516HUN09BGS RainSport 5 Uniroyal N N 0 1 70.46 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4
4 6022424 129 205 55 16 V 91 205/55R16 91V UROYAL RAINSPORT 5 2055516VUN09BGR RainSport 5 Uniroyal N N 0 1 70.81 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4
Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5
I have a pandas series with values below:
Bachelors Degree 639
Diploma 291
O - Level 264
Masters Degree 149
Certificate 126
A - Level 69
PGD 40
Bachelors Degree 28
A-Level 20
O-Level 15
Masters 10
Bachelors 6
diploma 5
certificate 5
Ph.D 4
A- Level 2
Post Graduate Diploma 1
Msc Environment 1
BBA 1
O- Level 1
Masters 1
PhD 1
I got data from excel.
I want to use pandas to do the data cleaning by say replacing all cases which has Masters with Master's degree (i can do it in excel but i am learning pandas).
I have tried
mapp={"Bachelor's Degree":["Bachelors Degree","Bachelors","BBA","Bachelors Degree"],
"Ordinary Diploma":"diploma",
"Ordinary Level":["O - Level","O-Level","O- Level"],
"Master's Degree":["Masters Degree","Masters","Msc Environment","Masters"],
"Certificate":"certificate",
"Advanced Level":["A - Level","A-Level","- Level"],
"Post Graduate Diploma":["Post Graduate Diploma","PGD"],
"PHD":["Ph.D","PhD"]
}
df['EDUCATION_LEVEL']=df['EDUCATION_LEVEL'].map(mapp)
The results are returned only for the Certificate key which has only one value.
It seems i cant use a list as values for a dictionary key.
Any suggestion on how to replace the values will be highly appreciated.
Ronald
This is how actual data appear in the excel column.
I have added an image of how data is in the column.
The challenge is how replace the various varriations of say "Masters Degree".
First make a slight change to your mapp dict by setting all the values as list:
mapp={"Bachelor's Degree":["Bachelors Degree","Bachelors","BBA","Bachelors Degree"],
"Ordinary Diploma":["diploma"],
"Ordinary Level":["O - Level","O-Level","O- Level"],
"Master's Degree":["Masters Degree","Masters","Msc Environment","Masters"],
"Certificate":["certificate"],
"Advanced Level":["A - Level","A-Level","- Level"],
"Post Graduate Diploma":["Post Graduate Diploma","PGD"],
"PHD":["Ph.D","PhD"]
}
mapp_new = [{l:k for l in v} for k,v in mapp.items()]
mapp_new = {k.lower(): v for d in mapp_new for k, v in d.items()}
df.EDUCATION_LEVEL.apply(lambda x: mapp_new.get(x.lower(), x))
0 Bachelor's Degree
1 Ordinary Diploma
2 Ordinary Level
3 Master's Degree
4 Certificate
5 Advanced Level
6 Post Graduate Diploma
7 Bachelor's Degree
8 Advanced Level
9 Ordinary Level
10 Master's Degree
11 Bachelor's Degree
12 Ordinary Diploma
13 Certificate
14 PHD
15 A- Level
16 Post Graduate Diploma
17 Master's Degree
18 Bachelor's Degree
19 Ordinary Level
20 Master's Degree
21 PHD
One idea is convert one element values to one element lists like "diploma" to ["diploma"]:
mapp1={"Bachelor's Degree":["Bachelors Degree","Bachelors","BBA","Bachelors Degree"],
"Ordinary Diploma":["diploma"],
"Ordinary Level":["O - Level","O-Level","O- Level"],
"Master's Degree":["Masters Degree","Masters","Msc Environment","Masters"],
"Certificate":["certificate"],
"Advanced Level":["A - Level","A-Level","- Level"],
"Post Graduate Diploma":["Post Graduate Diploma","PGD"],
"PHD":["Ph.D","PhD"]
}
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d = {k.lower(): oldk for oldk, oldv in mapp1.items() for k in oldv}
df['EDUCATION_LEVEL']=df['EDUCATION_LEVEL'].str.lower().map(d)
print (df)
EDUCATION_LEVEL VAL
0 Bachelor's Degree 639
1 Ordinary Diploma 291
2 Ordinary Level 264
3 Master's Degree 149
4 Certificate 126
5 Advanced Level 69
6 Post Graduate Diploma 40
7 Bachelor's Degree 28
8 Advanced Level 20
9 Ordinary Level 15
10 Master's Degree 10
11 Bachelor's Degree 6
12 Ordinary Diploma 5
13 Certificate 5
14 PHD 4
15 NaN 2
16 Post Graduate Diploma 1
17 Master's Degree 1
18 Bachelor's Degree 1
19 Ordinary Level 1
20 Master's Degree 1
21 PHD 1
If not possible then use:
d = {}
for k, v in mapp.items():
if isinstance(v, list):
for x in v:
d[x.lower()] = k
else:
d[v.lower()] = k
df['EDUCATION_LEVEL']=df['EDUCATION_LEVEL'].str.lower().map(d)
print (df)
EDUCATION_LEVEL VAL
0 Bachelor's Degree 639
1 Ordinary Diploma 291
2 Ordinary Level 264
3 Master's Degree 149
4 Certificate 126
5 Advanced Level 69
6 Post Graduate Diploma 40
7 Bachelor's Degree 28
8 Advanced Level 20
9 Ordinary Level 15
10 Master's Degree 10
11 Bachelor's Degree 6
12 Ordinary Diploma 5
13 Certificate 5
14 PHD 4
15 NaN 2
16 Post Graduate Diploma 1
17 Master's Degree 1
18 Bachelor's Degree 1
19 Ordinary Level 1
20 Master's Degree 1
21 PHD 1
I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)
Table
Roll Class Country Rights CountryAcc
1 x IND 23 US
1 x1 IND 32 Ind
2 s US 12 US
3 q IRL 33 CA
4 a PAK 12 PAK
4 e PAK 12 IND
5 f US 21 CA
5 g US 31 PAK
6 h US 21 BAN
I want to display only those Rolls whose CountryAcc is not in US or CA. For example: if Roll 1 has one CountryAcc in US then I don't want its other row with CountryAcc Ind and same goes with Roll 5 as it is having one row with CountryAcc as CA. So my final output would be:
Roll Class Country Rights CountryAcc
4 a PAK 12 PAK
4 e PAK 12 IND
6 h US 21 BAN
I tried getting that output following way:
Home_Country = ['US', 'CA']
#First I saved two countries in a variable
Account_Other_Count = df.loc[~df.CountryAcc.isin(Home_Country)]
Account_Other_Count_Var = df.loc[~df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
# Then I made two variables one with CountryAcc in US or CA and other variable with remaining and I got their Roll
Account_Home_Count = df.loc[df.CountryAcc.isin(Home_Country)]
Account_Home_Count_Var = df.loc[df.CountryAcc.isin(Home_Country)][['Roll']].values.ravel()
#Here I got the common Rolls
Common_ROLL = list(set(Account_Home_Count_Var).intersection(list(Account_Other_Count_Var)))
Final_Output = Account_Other_Count.loc[~Account_Other_Count.Roll.isin(Common_ROLL)]
Is there any better and more pandas or pythonic way to do it.
One solution could be
In [37]: df.ix[~df['Roll'].isin(df.ix[df['CountryAcc'].isin(['US', 'CA']), 'Roll'])]
Out[37]:
Roll Class Country Rights CountryAcc
4 4 a PAK 12 PAK
5 4 e PAK 12 IND
8 6 h US 21 BAN
This is one way to do it:
sortdata = df[~df['CountryAcc'].isin(['US', 'CA'])].sort(axis=0)