Finding values based on specific categories

Finding values based on specific categories - python

I was wondering how would find estimated values based on several different categories. Two of the columns are categorical, one of the other columns contains two strings of interest and the last contain numeric values
I have a csv file called sports.csv
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
I'm trying to find a suggested price for a Gym that have both baseball and basketball as well as enrollment from 240 to 260 given they are from region 4 and of type 1
Region Type enroll estimates price Gym
2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
4 2 100 0.26 37 Baseball|Tennis
4 1 347 0.65 61 Basketball|Baseball|Ballet
4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
1 1 286 0.74 78 Swimming|Basketball
0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
0 1 263 0.91 31 Tennis
2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
0 1 109 0.12 17 Football|Hockey|Volleyball
I don't know how to piece everything together. I apologize if the syntax is incorrect I'm just beginning Python. So far I have:
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
#group 4th region and type 1 together where enrollment is in between 240 and 260
group = df[df['Region'] == 4] df[df['Type'] == 1] df[240>=df['Enrollment'] <=260 ]
#split by pipe chars to find gyms that contain both Baseball and Basketball
df['Gym'] = df['Gym'].str.split('|')
df['Gym'] = df['Gym'].str.contains('Baseball'& 'Basketball')
price = df.loc[df['Gym'], 'Price']
Should I do a groupby instead? If so, how would I include the columns Type==1 Region ==4 and enrollment from 240 to 260 ?

You can create a mask with all your conditions specified and then use the mask for subsetting:
mask = (df['Region'] == 4) & (df['Type'] == 1) & \
(df['enroll'] <= 260) & (df['enroll'] >= 240) & \
df['Gym'].str.contains('Baseball') & df['Gym'].str.contains('Basketball')
df['price'][mask]
# Series([], name: price, dtype: int64)
which returns empty, since there is no record satisfying all conditions as above.

I had to add an instance that would actually meet your criteria, or else you will get an empty result. You want to use df.loc with conditions as follows:
In [1]: import pandas as pd, numpy as np, io
In [2]: in_string = io.StringIO("""Region Type enroll estimates price Gym
...: 2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
...: 4 2 100 0.26 37 Baseball|Tennis
...: 4 1 247 0.65 61 Basketball|Baseball|Ballet
...: 4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
...: 1 1 286 0.74 78 Swimming|Basketball
...: 0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
...: 0 1 263 0.91 31 Tennis
...: 2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
...: 3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
...: 0 1 109 0.12 17 Football|Hockey|Volleyball""")
In [3]: df = pd.read_csv(in_string,delimiter=r"\s+")
In [4]: df.loc[df.Gym.str.contains(r"(?=.*Baseball)(?=.*Basketball)")
...: & (df.enroll <= 260) & (df.enroll >= 240)
...: & (df.Region == 4) & (df.Type == 1), 'price']
Out[4]:
2 61
Name: price, dtype: int64
Note I used a regex pattern for contains that essentially acts as an AND operator for regex. You could simply have done another conjunction of .contains conditions for Basketball and Baseball.

Related

Calculating Distances and Filtering and Summing Values Pandas

I currently have data which contains a location name, latitude, longitude and then a number value associated locations. The final goal for me would to get a dataframe that has the sum of the values of each location within specific distance ranges. A sample dataframe is below:
IDVALUE,Latitude,Longitude,NumberValue
ID1,44.968046,-94.420307,1
ID2,44.933208,-94.421310,10
ID3,33.755787,-116.359998,15
ID4,33.844843,-116.54911,207
ID5,44.92057,-93.44786,133
ID6,44.240309,-91.493619,52
ID7,44.968041,-94.419696,39
ID8,44.333304,-89.132027,694
ID9,33.755783,-116.360066,245
ID10,33.844847,-116.549069,188
ID11,44.920474,-93.447851,3856
ID12,44.240304,-91.493768,189
Firstly, I managed to get the distances between each of them using the haversine function. Using the code below I turned the latlongs into radians and then created a matrix where the diagonals are infinite values.
df_latlongs['LATITUDE'] = np.radians(df_latlongs['LATITUDE'])
df_latlongs['LONGITUDE'] = np.radians(df_latlongs['LONGITUDE'])
dist = DistanceMetric.get_metric('haversine')
latlong_df = pd.DataFrame(dist.pairwise(df_latlongs[['LATITUDE','LONGITUDE']].to_numpy())*6373, columns=df_latlongs.IDVALUE.unique(), index=df_latlongs.IDVALUE.unique())
np.fill_diagonal(latlong_df.values, math.inf)
This distance matrix is then in kilometres. What I'm struggling with next is to be able to filter the distances of each of the locations and get the total number of values within a range and link this to the original dataframe.
Below is the code I have used to filter the distance matrix to get all of the locations within 500 meters:
latlong_df_rows = latlong_df[latlong_df < 0.5]
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=0)
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=1)
My attempt was to them get a list for each location of the locations that were in this value using the code below:
within_range_df = latlong_df_rows.apply(lambda row: row[row < 0.05].index.tolist(), axis=1)
within_range_df = within_range_df.to_frame()
within_range_df = within_range_df.dropna(how='all', axis=0)
within_range_df = within_range_df.dropna(how='all', axis=1)
From here I was going to try and get the NumberValue from the original dataframe by looping through the list of values to obtain another column for the number for that location. Then sum all of them. The final dataframe would ideally look like the following:
ID VALUE,<500m,500-1000m,>100m
ID1,x1,y1,z1
ID2,x2,y2,z2
ID3,x3,y3,z3
ID4,x4,y4,z4
ID5,x5,y5,z5
ID6,x6,y6,z6
ID7,x7,y7,z7
ID8,x8,y8,z8
ID9,x9,y9,z9
ID10,x10,y10,z10
ID11,x11,y11,z11
ID12,x12,y12,z12
Where x y and z are the total number values for the nearest locations for different distances. I know this is probably really weird and overcomplicated so any tips to change the question or anything else that is needed I'll be happy to provide. Cheers

I would define a helper function, making use of BallTree, e.g.
from sklearn.neighbors import BallTree
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv')
We use query_radius() to get the IDs and use list comprehension to get the values and sum them;
locations_radians = np.radians(df[["Latitude","Longitude"]].values)
tree = BallTree(locations_radians, leaf_size=12, metric='haversine')
def summed_numbervalue_for_radius( radius_in_m=100):
distance_in_meters = radius_in_m
earth_radius = 6371000
radius = distance_in_meters / earth_radius
ids_within_radius = tree.query_radius(locations_radians, r=radius, count_only=False)
values_as_array = np.array(df.NumberValue)
summed_values = [values_as_array[ix].sum() for ix in ids_within_radius]
return np.array(summed_values)
With the helper function you can do for instance;
df = df.assign( sum_100=summed_numbervalue_for_radius(100))
df = df.assign( sum_500=summed_numbervalue_for_radius(500))
df = df.assign( sum_1000=summed_numbervalue_for_radius(1000))
df = df.assign( sum_1000_to_5000=summed_numbervalue_for_radius(5000)-summed_numbervalue_for_radius(1000))
Will give you
IDVALUE Latitude Longitude NumberValue sum_100 sum_500 sum_1000 \
0 ID1 44.968046 -94.420307 1 40 40 40
1 ID2 44.933208 -94.421310 10 10 10 10
2 ID3 33.755787 -116.359998 15 260 260 260
3 ID4 33.844843 -116.549110 207 395 395 395
4 ID5 44.920570 -93.447860 133 3989 3989 3989
5 ID6 44.240309 -91.493619 52 241 241 241
6 ID7 44.968041 -94.419696 39 40 40 40
7 ID8 44.333304 -89.132027 694 694 694 694
8 ID9 33.755783 -116.360066 245 260 260 260
9 ID10 33.844847 -116.549069 188 395 395 395
10 ID11 44.920474 -93.447851 3856 3989 3989 3989
11 ID12 44.240304 -91.493768 189 241 241 241
sum_1000_to_5000
0 10
1 40
2 0
3 0
4 0
5 0
6 10
7 0
8 0
9 0
10 0
11 0

Check if a row in one DataFrame exist in another, BASED ON SPECIFIC COLUMNS ONLY

I have two Pandas DataFrame with different columns number.
df1 is a single row DataFrame:
a X0 b Y0 c
0 233 100 56 shark -23
df2, instead, is multiple rows Dataframe:
d X0 e f Y0 g h
0 snow 201 32 36 cat 58 336
1 rain 176 99 15 tiger 63 845
2 sun 193 81 42 dog 48 557
3 storm 100 74 18 shark 39 673 # <-- This row
4 cloud 214 56 27 wolf 66 406
I would to verify if the df1's row is in df2, but considering X0 AND Y0 columns only, ignoring all other columns.
In this example the df1's row match the df2's row at index 3, that have 100 in X0 and 'shark' in Y0.
The output for this example is:
True
Note: True/False as output is enough for me, I don't care about index of matched row.
I founded similar questions but all of them check the entire row...

Use df.merge with an if condition check on len:
In [219]: if len(df1[['X0', 'Y0']].merge(df2)):
...: print(True)
...:
True
OR:
In [225]: not (df1[['X0', 'Y0']].merge(df2)).empty
Out[225]: True

Try this:
df2[(df2.X0.isin(df1.X0))&(df2.Y0.isin(df1.Y0))]
Output:
d X0 e f Y0 g h
3 storm 100 74 18 shark 39 673

duplicated
df2.append(df1).duplicated(['X0', 'Y0']).iat[-1]
True
Save a tad bit of time
df2[['X0', 'Y0']].append(df1[['X0', 'Y0']]).duplicated().iat[-1]

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')

You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.

Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

Change normalized integer values to categories for classification

I'm working on this dataset with the following columns, N/A counts and example of a record:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
0: 1 337 118 4 4.5 4.5 9.65 1 0.92
1: 2 324 107 4 4.0 4.5 8.87 1 0.76
The column Chance of admit is a normalised intergar value ranging from 0 to 1, what i wanted to do was take this column and output a corrosponding ordered values where chance would be bins (low medium high) (unlikely doable likely) ect
What i have come across is that pandas has a built in function named to_categorical however, i don't understand it enough and what i read i still don't exactly get.
This dataset would be used for a decision tree where the labels would be the chance of admit
Thank you for your help

Since they are "normalized" values...why would you need to categorize them? A simple threshould should work right?
i.e.
0-0.33 low
0.33-0.66 medium
0.66-1.0 high
The only reason you would want to use an automated method would probably be if your number of categories keeps changing?
To do the category, you could use pandas to_categorical but you will need to determine the range and the number of bins (categories). From the docs this should work I think.
In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
You can then replace df['group'] with your chance of admit column and fill up the necessary ranges for your discrete bins by threshold or automatic based on number of bins.
For your reference:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

IIUC, you want to map a continuous variable to a categorical value based on ranges, for example:
0.96 -> high,
0.31 -> low
...
So pandas provides with a function for just that, cut, from the documentation:
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Setup
Serial No. GRE Score TOEFL Score ... CGPA Research Chance of Admit
0 1 337 118 ... 9.65 1 0.92
1 2 324 107 ... 8.87 1 0.76
2 2 324 107 ... 8.87 1 0.31
3 2 324 107 ... 8.87 1 0.45
[4 rows x 9 columns]
Assuming the above setup, you could use cut like this:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(labels)
Output
0 high
1 high
2 low
3 medium
Name: Chance of Admit, dtype: category
Categories (3, object): [low < medium < high]
Notice that we are use 3 bins: [(0, 0.33], (0.33, 0.66], (0.66, 1.0]] and that the values of the column Chance of Admit are [0.92, 0.76, 0.31, 0.45]. If you want to change the label names just change the value of the labels parameter, for example: labels=['unlikely', 'doable', 'likely']. If you need an ordinal value do:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=list(range(3)))
print(labels)
Output
0 2
1 2
2 0
3 1
Name: Chance of Admit, dtype: category
Categories (3, int64): [0 < 1 < 2]
Finally to put all in perspective you could do the following to add it to your DataFrame:
df['group'] = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(df)
Output
Serial No. GRE Score TOEFL Score ... Research Chance of Admit group
0 1 337 118 ... 1 0.92 high
1 2 324 107 ... 1 0.76 high
2 2 324 107 ... 1 0.31 low
3 2 324 107 ... 1 0.45 medium
[4 rows x 10 columns]

Why i'm not getting my whole output in the run module?

I'm not getting my whole output as well as my column names in my Screen.
import sqlite3
import pandas as pd
hello = sqlite3.connect(r"C:\Users\ravjo\Downloads\Chinook.sqlite")
rs = hello.execute("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000")
df = pd.DataFrame(rs.fetchall())
hello.close()
print(df.head())
actual result:
0 1 2 3 4 ... 6 7 8 9 10
0 1 3390 3390 One and the Same 271 ... 23 None 217732 3559040 0.99
1 1 3392 3392 Until We Fall 271 ... 23 None 230758 3766605 0.99
2 1 3393 3393 Original Fire 271 ... 23 None 218916 3577821 0.99
3 1 3394 3394 Broken City 271 ... 23 None 228366 3728955 0.99
4 1 3395 3395 Somedays 271 ... 23 None 213831 3497176 0.99
[5 rows x 11 columns]
expected result:
PlaylistId TrackId TrackId Name AlbumId MediaTypeId \
0 1 3390 3390 One and the Same 271 2
1 1 3392 3392 Until We Fall 271 2
2 1 3393 3393 Original Fire 271 2
3 1 3394 3394 Broken City 271 2
4 1 3395 3395 Somedays 271 2
GenreId Composer Milliseconds Bytes UnitPrice
0 23 None 217732 3559040 0.99
1 23 None 230758 3766605 0.99
2 23 None 218916 3577821 0.99
3 23 None 228366 3728955 0.99
4 23 None 213831 3497176 0.99

The ... in the middle actually says that some of the data have been omitted from display. If you want to see the entire data, you should modify the pandas options. You can do so by using pandas.set_option() method. Documentation here.
In your case, you should set display.max_columns to None so that pandas displays unlimited number of columns. You will have to read in the column names from the database of set it manually. Refer here on how to read in the column names from the database itself.

To display all the columns please use below mentioned code snippet.
pd.set_option("display.max_columns",None)

By default, pandas limits number of rows for display. However you can change it to as per your need. Here is helper function I use, whenever I need to print full data-frame
def print_full(df):
import pandas as pd
pd.set_option('display.max_rows', len(df))
print(df)
pd.reset_option('display.max_rows')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding values based on specific categories - python

Related

Calculating Distances and Filtering and Summing Values Pandas

Check if a row in one DataFrame exist in another, BASED ON SPECIFIC COLUMNS ONLY

How can I merge these two datasets on 'Name' and 'Year'?

Change normalized integer values to categories for classification

Why i'm not getting my whole output in the run module?

Categories

Resources