I'm scraping multiple sports betting websites in order to compare the odds for each match across the websites.
My question is how to identify match_id from a match that already exists in the DB but has team names written in a different way.
Please feel free to add any approaches even if they don't use dataframes or SQLite.
The columns for matches table are:
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)
So for each new match I want to verify if it already exists in the DB.
New match = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
My pseudo-code is like:
SELECT match_id FROM matches
WHERE sport = (sport_to_check)
AND date = (date_to_check)
AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation
If no match is found the new row would be inserted.
I believe there's no way to mix python with SQL like that so that's why I refer to it as "pseudo-code".
I can also pull the matches table into a Pandas dataframe and do the evaluation with it, if that works (how?...).
At any given time it isn't expected for matches table to have above a couple of thousand records.
Let me give you some examples of expected outputs. Where the solution is represented by "find(row)"
Having matches table in DB as:
+----------+------------+-----------------------------+----------------------+------------+
| match_id | sport | home_team | visitor_team | date |
+----------+------------+-----------------------------+----------------------+------------+
| 84 | football | confianca | cuiaba esporte clube | 24/11/2020 |
| 209 | football | cs alagoana | operario pr | 24/11/2020 |
| 184 | football | grenoble foot 38 | as nancy lorraine | 24/11/2020 |
| 7 | football | sv turkgucu-ataspor munchen | saarbrucken | 24/11/2020 |
| 414 | handball | dinamo bucareste | usam nimes | 24/11/2020 |
| 846 | handball | benidorm | naturhouse la rioja | 25/11/2020 |
| 874 | handball | cegledi | ferencvarosi tc | 25/11/2020 |
| 418 | handball | lemvig-thyboron | kif kolding | 25/11/2020 |
| 740 | ice hockey | tps | kookoo | 25/11/2020 |
| 385 | football | stevenage | hull | 29/11/2020 |
+----------+------------+-----------------------------+----------------------+------------+
And new matches to evaluate:
+----------------+------------+---------------------+---------------------+------------+
| row (for demo) | sport | home_team | visitor_team | date |
+----------------+------------+---------------------+---------------------+------------+
| A | football | confianca-se | cuiaba mt | 24/11/2020 |
| B | football | csa | operario | 24/11/2020 |
| C | football | grenoble | nancy | 24/11/2020 |
| D | football | sv turkgucu ataspor | 1 fc saarbrucken | 24/11/2020 |
| E | handball | dinamo bucuresti | nimes | 24/11/2020 |
| F | handball | bm benidorm | bm logrono la rioja | 25/11/2020 |
| G | handball | cegledi kkse | ftc budapest | 25/11/2020 |
| H | handball | lemvig | kif kobenhavn | 25/11/2020 |
| I | ice hockey | turku ps | kookoo kouvola | 25/11/2020 |
| J | football | stevenage borough | hull city | 29/11/2020 |
| K | football | west brom | sheffield united | 28/11/2020 |
+----------------+------------+---------------------+---------------------+------------+
Outputs:
find(A) returns: 84
find(B) returns: 209
find(C) returns: 184
find(D) returns: 7
find(E) returns: 414
find(F) returns: 846
find(G) returns: 874
find(H) returns: 418
find(I) returns: 740
find(J) returns: 385
find(K) returns: (something like "not found" => I would then insert the new row)
Thanks!
Basically I filter down the original table by the given date and sport. then use fuzzywuzzy to find the best match between the home and visitors between the rows remaining:
Setup:
import pandas as pd
cols = ['match_id','sport','home_team','visitor_team','date']
df1 = pd.DataFrame([
['84','football','confianca','cuiaba esporte clube','24/11/2020'],
['209','football','cs alagoana','operario pr','24/11/2020'],
['184','football','grenoble foot 38','as nancy lorraine','24/11/2020'],
['7','football','sv turkgucu-ataspor munchen','saarbrucken','24/11/2020'],
['414','handball','dinamo bucareste','usam nimes','24/11/2020'],
['846','handball','benidorm','naturhouse la rioja','25/11/2020'],
['874','handball','cegledi','ferencvarosi tc','25/11/2020'],
['418','handball','lemvig-thyboron','kif kolding','25/11/2020'],
['740','ice hockey','tps','kookoo','25/11/2020'],
['385','football','stevenage','hull','29/11/2020']], columns=cols)
cols = ['row','sport','home_team','visitor_team','date']
df2 = pd.DataFrame([
['A','football','confianca-se','cuiaba mt','24/11/2020'],
['B','football','csa','operario','24/11/2020'],
['C','football','grenoble','nancy','24/11/2020'],
['D','football','sv turkgucu ataspor','1 fc saarbrucken','24/11/2020'],
['E','handball','dinamo bucuresti','nimes','24/11/2020'],
['F','handball','bm benidorm','bm logrono la rioja','25/11/2020'],
['G','handball','cegledi kkse','ftc budapest','25/11/2020'],
['H','handball','lemvig','kif kobenhavn','25/11/2020'],
['I','ice hockey','turku ps','kookoo kouvola','25/11/2020'],
['J','football','stevenage borough','hull city','29/11/2020'],
['K','football','west brom','sheffield united','28/11/2020']], columns=cols)
Code:
import pandas as pd
from fuzzywuzzy import fuzz
import string
def calculate_ratio(row):
return fuzz.token_set_ratio(row['col1'],row['col2'] )
def find(df1, df2, row_search):
alpha = df2.query('row == "{row_search}"'.format(row_search=row_search))
sport = alpha.iloc[0]['sport']
date = alpha.iloc[0]['date']
home_team = alpha.iloc[0]['home_team']
visitor_team = alpha.iloc[0]['visitor_team']
beta = df1.query('sport == "{sport}" & date == "{date}"'.format(sport=sport,date=date))
if len(beta) == 0:
return 'Not found.'
else:
temp = pd.DataFrame({'match_id':list(beta['match_id']),'col1':list(beta['home_team'] + ' ' + beta['visitor_team']), 'col2':[home_team + ' ' + visitor_team]*len(beta)})
temp['score'] = temp.apply(calculate_ratio, axis=1)
temp = temp.sort_values('score', ascending=False)
outcome = temp.head(1).iloc[0]['match_id']
return outcome
for row_alpha in string.ascii_uppercase[0:11]:
outcome = find(df1, df2, row_alpha)
print ('{row_alpha} --> {outcome}'.format(row_alpha=row_alpha, outcome=outcome))
Output:
A --> 84
B --> 209
C --> 184
D --> 7
E --> 414
F --> 846
G --> 874
H --> 418
I --> 740
J --> 385
K --> Not found.
Related
My table in postgis is this with name samplecol:
vessel_hash | name_city | station | speed | latitude | longitude | course | heading | timestamp | the_geom
--------------+-----------+---------+-------+-------------+-------------+--------+---------+--------------------------+----------------------------------------------------
103079215239 |newyork | 841 | 5 | -5.41844510 | 36.12160900 | 314 | 511 | 2016-06-12T06:31:04.000Z | 0101000020E61000001BF33AE2900F424090AF4EDF7CAC15C0
103079215239 |washangton | 3008 | 0 | -5.41778710 | 36.12144900 | 117 | 511 | 2016-06-12T06:43:27.000Z | 0101000020E6100000E2900DA48B0F424042C3AC61D0AB15C0
103079215239 |paris | 841 | 17 | -5.42236900 | 36.12356900 | 259 | 511 | 2016-06-12T06:50:27.000Z | 0101000020E610000054E6E61BD10F42407C60C77F81B015C0
103079215239 |room | 841 | 17 | -5.41781710 | 36.12147900 | 230 | 511 | 2016-06-12T06:27:03.000Z | 0101000020E61000004D13B69F8C0F424097D6F03ED8AB15C0
103079215239 |pensilvenia| 841 | 61 | -5.42201900 | 36.13256100 | 157 | 511 | 2016-06-12T06:08:04.000Z | 0101000020E6100000CFDC43C2F71042409929ADBF25B015C0
103079215239 |jorjia | 841 | 9 | -5.41834020 | 36.12225000 | 359 | 511 | 2016-06-12T06:33:03.000Z | 0101000020E6100000CFF753E3A50F42408D68965F61AC15C0
Output query will must be like this:
name_city | latitude | longitude
-----------+-------------+----------
newyork |-5.41844510 | 36.12160900
pensilvenia|-5.42201900 | 36.13256100
jorjia |-5.41834020 | 36.12225000
My code is this:
poisInpolygon = """SELECT samplecol.latitude,samplecol.name_city samplecol.longitude
FROM samplecol
WHERE ST_Contains(samplecol.the_geom,('POLYGON((-15.0292969 47.6357836, -15.2050781 47.5172007,
-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836, -15.0292969 47.6357836))'));"""
cursor.execute(poisInpolygon)
exists1 = cursor.fetchall()
count1 = 0;
for ex1 in exists1:
count1 = count1+1
print ex1,"\n"
print "points", count1
I try this code and query but return 0. What is problem? What is the correct query?
I believe it is the classic misplace of long,lat. Based on your data sample, if we create the_geom using longitude, latitude your points land somewhere in Tanzania and therefore far away from your polygon:
SELECT ST_SetSRID(ST_MakePoint(longitude,latitude),4326) FROM samplecol
UNION ALL SELECT 'SRID=4326;POLYGON((-15.0292969 47.6357836, -15.2050781 47.5172007,
-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836, -15.0292969 47.6357836))'::geometry;
If we switch the order to latitude,longitude your points land close to Gibraltar and the geometries do overlap:
SELECT ST_SetSRID(ST_MakePoint(latitude,longitude),4326) FROM samplecol
UNION ALL SELECT 'SRID=4326;POLYGON((-15.0292969 47.6357836, -15.2050781 47.5172007,
-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836, -15.0292969 47.6357836))'::geometry;
So my guess is that you stored longitude and latitude values in the wrong columns ;)
I have to compare 2 different sources and identify all the mismatches for all IDs
Source_excel table
+-----+-------------+------+----------+
| id | name | City | flag |
+-----+-------------+------+----------+
| 101 | Plate | NY | Ready |
| 102 | Back washer | NY | Sold |
| 103 | Ring | MC | Planning |
| 104 | Glass | NMC | Ready |
| 107 | Cover | PR | Ready |
+-----+-------------+------+----------+
Source_dw table
+-----+----------+------+----------+
| id | name | City | flag |
+-----+----------+------+----------+
| 101 | Plate | NY | Planning |
| 102 | Nut | TN | Expired |
| 103 | Ring | MC | Planning |
| 104 | Top Wire | NY | Ready |
| 105 | Bolt | MC | Expired |
+-----+----------+------+----------+
Expected result
+-----+-------------+----------+------------+----------+------------+---------+------------------+
| ID | excel_name | dw_name | excel_flag | dw_flag | excel_city | dw_city | RESULT |
+-----+-------------+----------+------------+----------+------------+---------+------------------+
| 101 | Plate | Plate | Ready | Planning | NY | NY | FLAG_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | NAME_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | FLAG_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | CITY_MISMATCH |
| 103 | Ring | Ring | Planning | Planning | MC | MC | ALL_MATCH |
| 104 | Glass | Top Wire | Ready | Ready | NMC | NY | NAME_MISMATCH |
| 104 | Glass | Top Wire | Ready | Ready | NMC | NY | CITY_MISMATCH |
| 107 | Cover | | Ready | | PR | | MISSING IN DW |
| 105 | | Bolt | | Expired | | MC | MISSING IN EXCEL |
+-----+-------------+----------+------------+----------+------------+---------+------------------+
I'm new to python and I have tried the below query but it not giving the expected result.
import pandas as pd
source_excel = pd.read_csv('C:/Mypython/Newyork/excel.csv',encoding = "ISO-8859-1")
source_dw = pd.read_csv('C:/Mypython/Newyork/dw.csv',encoding = "ISO-8859-1")
comparison_result = pd.merge(source_excel,source_dw,on='ID',how='outer',indicator=True)
comparison_result.loc[(comparison_result['_merge'] == 'both') & (name_x != name_y), 'Result'] = 'NAME_MISMATCH'
comparison_result.loc[(comparison_result['_merge'] == 'both') & (city_x != city_y), 'Result'] = 'CITY_MISMATCH'
comparison_result.loc[(comparison_result['_merge'] == 'both') & (flag_x != flag_y), 'Result'] = 'FLAG_MISMATCH'
comparison_result.loc[comparison_result['_merge'] == 'left_only', 'Result'] = 'Missing in dw'
comparison_result.loc[comparison_result['_merge'] == 'right_only', 'Result'] = 'Missing in excel'
comparison_result.loc[comparison_result['_merge'] == 'both', 'Result'] = 'ALL_Match'
csv_column = comparison_result[['ID','name_x','name_y','city_x','city_y','flag_x','flag_y','Result']]
print(csv_column)
Is there any other way I can check all the condition and report each in separate row. If separate row not possible, atleast i need in same column separated by all mismatches. something like FLAG_MISMATCH,CITY_MISMATCH
You could do:
df = pd.merge(Source_excel, Source_dw, on = 'ID', how = 'left', suffixes = (None, '_dw'))
This will create a new dataframe like the one you want, although you'll have to reorder the columns as you want. Note that the '_dw' is a suffix and not a prefix in this case.
You can reorder the columns as you like by using this code:
#Complement with the order you want
df = df[['ID', 'excel_name']]
For the result column I think you'll have to create a column for each condition you're trying to check (at least that's the way I know how to). Here's an example:
#This will return 1 if there's a match and 0 otherwise
df['result_flag'] = df.apply(lambda x: 1 if x.excel_flag == x.flag_dw else 0, axis = 1)
Here is a way to do the scoring:
df['result'] = 0
# repeated mask / df.loc statements suggests a loop, over a list of tuples
mask = df['excel_flag'] != df['df_flag']
df.loc[mask, 'result'] += 1
mask = df['excel_name'] != df['dw_name']
df.loc[mask, 'result'] += 10
df['result'] = df['result'].map({ 0: 'all match',
1: 'flag mismatch',
10: 'name mismatch',
11: 'all mismatch',})
I am conducting a data analysis in both R and Python to compare their differences. Currently I am struggling to translate
data %>%
mutate(pct_leader = ballotsLeader/validBallots) %>%
group_by(community) %>%
mutate(mean_pct_leader = mean(pct_leader),
sd_pct_leader = sd(pct_leader),
up_pct_leader = mean_pct_leader+2*sd_pct_leader) %>%
filter(pct_leader > up_pct_leader) %>%
top_n(5, pct_leader)
into Python.
I have tried the following python code
grouped = data.assign(pct_leader = lambda x: x['ballotsLeader']/x['validBallots']).groupby('community').assign(mean_pct_leader = lambda x: mean(x['pct_leader']),
sd_pct_leader = lambda x: stdev(x['pct_leader']),
up_pct_leader = lambda x: x['mean_pct_leader']+2*x['sd_pct_leader']).query('pct_leader > up_pct_leader').pct_leader.nlargest(5)
but get a AttributeError: 'DataFrameGroupBy' object has no attribute 'assign' error.
I realize this is because the DataFrameGroupBy object does not have the assign method.
How can preserve the order of the R code but translate it into python?
Edit: Here is the data I am working with
| community | province | municipality | precinct | registeredVoters | emptyBallots | invalidBallots | validBallots | ballotsLeader |
|-----------|-----------|--------------|----------|------------------|--------------|----------------|--------------|---------------|
| GALICIA | Coruña, A | Ames | 001 B | 270 | 3 | 7 | 206 | 129 |
| GALICIA | Coruña, A | Ames | 004 A | 356 | 2 | 7 | 257 | 136 |
| GALICIA | Coruña, A | Ames | 002 C | 296 | 1 | 2 | 214 | 149 |
| GALICIA | Coruña, A | Ames | 010 U | 646 | 15 | 10 | 507 | 189 |
| GALICIA | Coruña, A | Ames | 012 B | 695 | 6 | 8 | 479 | 247 |
Without seeing some data, it's hard to have this correct, but this should work:
(data.assign(pct_leader=data['ballotsLeader'] / data['validBallots'])
.groupby('community').agg(
mean_pct_leader=('pct_leader', 'mean')
sd_pct_leader=('pct_leader', 'std'),
up_pct_leader=('pct_leader', lambda x: (x['pct_leader'].mean()+2) * x['pct_leader'].std())
)
.query('pct_leader > up_pct_leader')
.nlargest(5, 'pct_leader')
)
Using datar, it will be easy for you to translate your R code into python:
from datar.all import f, mutate, group_by, mean, sd, filter, slice_max
data >> \
mutate(pct_leader = f.ballotsLeader/f.validBallots) >> \
group_by(f.community) >> \
mutate(mean_pct_leader = mean(f.pct_leader),
sd_pct_leader = sd(f.pct_leader),
up_pct_leader = f.mean_pct_leader+2*f.sd_pct_leader) >> \
filter(f.pct_leader > f.up_pct_leader) >> \
slice_max(f.pct_leader, n=5)
# top_n() has been superseded in favour of slice_min()/slice_max()
I am the author of the package. Feel free to submit issues if you have any questions.
Is it possible with pandas to marge two (or three...) header-rows with united excel-cells into one?
For example table:
| Report for ..... | Income | Ordered |
|-----------------|--------------------|---------------------|--------------|---------------|-----------|-------------------------------|
| Brend | Art of supplier | Contract | pcs | income price | pcs | rub (prize for selling) |
| Elena Chezelle | Y0060 | 0400-6752 Agent | 85 | 245,00 | 226 | 785,00 |
| Amour Bridal | ALWE-1199-WHITE | 0400-6752 Agent | 47 | 56,00 | 163 | 857,00 |
into:
| Brend | Art of supplier | Contract | Income pcs | Income price | Ordered pcs | Ordered rub (prize for selling) |
|----------------|--------------------|---------------------|-----------------|---------------------------|-------------|----------------------------------------|
| Elena Chezelle | Y0060 | 0400-6752 Agent | 85 | 245,00 | 226 | 785,00 |
| Amour Bridal | ALWE-1199-WHITE | 0400-6752 Agent | 47 | 56,00 | 163 | 857,00 |
until now made it with code using cycles:
file = 'static/ExportToEXCELOPENXML - 2020-05-11T180206.635.xlsx'
df = pd.read_excel(file)
df_list = []
for x in df.columns:
if 'Unnamed' in x or 'Report' in x:
df_list.append('')
else:
df_list.append(x + ' ')
i = 0
while i < len(df_list):
if df_list[i] == '':
df_list[i] = df_list[i - 1]
i += 1
print(df_list)
df.columns = df_list + df.iloc[0]
df_new = df.iloc[1:].reset_index(drop=True)
df_new.to_excel('static/test.xlsx', index=None)
Is it possible make it only with pandas (without cycles)?
When you read the excel you can give two rows to be added as header
df = pd.read_excel('static/ExportToEXCELOPENXML - 2020-05-11T180206.635.xlsx', header=[0,1])
If you want to join them together with a separator then you can do it after reading
df.columns = df.columns.map('_'.join)
What is the best way to compare 2 dataframes w/ the same column names, row by row, if a cell is different have the Before & After value and which cellis different in that dataframe.
I know this question has been asked a lot, but none of the applications fit my use case. Speed is important. There is a package called datacompy but it is not good if I have to compare 5000 dataframes in a loop (i'm only comparing 2 at a time, but around 10,000 total, and 5000 times).
I don't want to join the dataframes on a column. I want to compare them row by row. Row 1 with row 1. Etc. If a column in row 1 is different, I only need to know the column name, the before, and the after. Perhaps if it is numeric I could also add a column w/ the abs val. of the dif.
The problem is, there is sometimes an edge case where rows are out of order (only by 1 entry), and don’t want these to come up as false positives.
Example:
These dataframes would be created when I pass in race # (there are 5,000 race numbers)
df1
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | Location | |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.1 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.8 | | Bob | | 8 | | New York | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
+-----+-------+--+------+--+----------+----------+-------------+--+
df2:
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | | Location |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.4 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.5 | | Bob | | 8 | | New York | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
+-----+-------+--+------+--+----------+----------+-------------+--+
diff:
+-------+----------+--------+-------+
| Race# | Diff_col | Before | After |
+-------+----------+--------+-------+
| 123 | Speed | 9.1 | 9.4 |
| 123 | Speed | 1.8 | 1.5 |
An example of a false positive is with the last 2 rows, Ken + Mark.
I could summarize the differences in one line per race, but if the dataframe has 3000 records and there are 1,000 differences (unlikely, but possible) than I will have tons of columns. I figured this was was easier as I could export to excel and then sort by race #, see all the differences, or by diff_col, see which columns are different.
def DiffCol2(df1, df2, race_num):
is_diff = False
diff_cols_list = []
row_coords, col_coords = np.where(df1 != df2)
diffDf = []
alldiffDf = []
for y in set(col_coords):
col_df1 = df1.iloc[:,y].name
col_df2 = df2.iloc[:,y].name
for index, row in df1.iterrows():
if df1.loc[index, col_df1] != df2.loc[index, col_df2]:
col_name = col_df1
if col_df1 != col_df2: col_name = (col_df1, col_df2)
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
try:
check_edge_case = df1.loc[index, col_df1] == df2.loc[index+1, col_df1]
except:
check_edge_case = False
try:
check_edge_case_two = df1.loc[index, col_df1] == df2.loc[index-1, col_df1]
except:
check_edge_case_two = False
if not (check_edge_case or check_edge_case_two):
col_name = col_df1
if col_df1 != col_df2:
col_name = (col_df1, col_df2) #if for some reason column name isn’t the same, which should never happen but in case, I want to know both col names
is_diff = True
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
return diffDf, alldiffDf, is_diff
[apologies in advance for weirdly formatted tables, i did my best given how annoying pasting tables into s/o is]
The code below works if dataframes have the same number and names of columns and the same number of rows, so comparing only values in the tables
Not sure where you want to get Race# from
df1 = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df2 = df1.copy(deep=True)
df2['B'][5] = 100 # Creating difference
df2['C'][6] = 100 # Creating difference
dif=[]
for col in df1.columns:
for bef, aft in zip(df1[col], df2[col]):
if bef!=aft:
dif.append([col, bef, aft])
print(dif)
Results below
Alternative solution without loops
df = df1.melt()
df.columns=['Column', 'Before']
df.insert(2, 'After', df2.melt().value)
df[df.Before!=df.After]