I have this massive dataset and I need to subset the data by using criteria. This is for illustration:
| Group | Name | Value |
|--------------------|-------------|-----------------|
| A | Bill| 256 |
| A | Jack| 268 |
| A | Melissa| 489 |
| B | Amanda | 787 |
| B | Eric| 485 |
| C | Matt| 1236 |
| C | Lisa| 1485 |
| D | Ben | 785 |
| D | Andrew| 985 |
| D | Cathy| 1025 |
| D | Suzanne| 1256 |
| D | Jim| 1520 |
I know how to handle this problem manually, such as:
import pandas as pd
df=pd.read_csv('Test.csv')
A=df[df.Group =="A "].to_numpy()
B=df[df.Group =="B "].to_numpy()
C=df[df.Group =="C "].to_numpy()
D=df[df.Group =="D "].to_numpy()
But considering the size of the data, it will take a lot of time if I handle it in this way.
With that in mind, I would like to know if it is possible to build an iteration with an IF statement that can look at the values in column “Group”(table above) . I was thinking, IF statement to see if the first value is the same with one the below if so, group them and create a new array/ dataframe.
I have the following Django code that is being run on PostgreSQL and Huey (an automatic scheduler). The problem is that, whenever the periodic task is run, Django removes the previous rows in a table instead of adding on to existent ones.
Scheduled code:
#periodic_task(crontab(minute='*/1'))
def scheduled():
team = nitrotype.Team('PR2W')
team_id = team.data["info"]["teamID"]
timestamp = datetime.datetime.now()
for members in team.data["members"]:
racer_id = members["userID"]
races = members["played"]
time = members["secs"]
typed = members["typed"]
errs = members["errs"]
rcd = RaceData(
racer_id=racer_id,
team_id=team_id,
timestamp=timestamp,
races=races,
time=time,
typed=typed,
errs=errs
)
rcd.save()
Basically, the above code is going to run every minute. Here's the database (PSQL) data that I started with:
nttracker=# TABLE data_racedata;
racer_id | team_id | timestamp | races | time | typed | errs
----------+---------+------------+--------+---------+----------+--------
35051013 | 765879 | 1623410530 | 4823 | 123226 | 793462 | 42975
35272676 | 765879 | 1623410530 | 8354 | 211400 | 1844434 | 38899
36690038 | 765879 | 1623410530 | 113 | 2849 | 16066 | 995
38486084 | 765879 | 1623410530 | 34448 | 903144 | 8043345 | 586297
38625235 | 765879 | 1623410530 | 108 | 2779 | 20919 | 1281
39018052 | 765879 | 1623410530 | 1908 | 48898 | 395187 | 24384
39114823 | 765879 | 1623410530 | 2441 | 64170 | 440503 | 32594
...
(50 rows)
Afterward, I run Huey, which executes scheduled() every one minute. Here's what I end up with after two minutes (in other words, two iterations):
nttracker=# TABLE data_racedata;
racer_id | team_id | timestamp | races | time | typed | errs
----------+---------+------------+--------+---------+----------+--------
35051013 | 765879 | 1623410992 | 4823 | 123226 | 793462 | 42975
35272676 | 765879 | 1623410992 | 8354 | 211400 | 1844434 | 38899
36690038 | 765879 | 1623410992 | 113 | 2849 | 16066 | 995
38486084 | 765879 | 1623410992 | 34448 | 903144 | 8043345 | 586297
38625235 | 765879 | 1623410992 | 108 | 2779 | 20919 | 1281
39018052 | 765879 | 1623410992 | 1908 | 48898 | 395187 | 24384
39114823 | 765879 | 1623410992 | 2441 | 64170 | 440503 | 32594
...
(50 rows)
Note: most of the data just happened to be the same, the timestamp's always different among data generated from automated executions.
I'd like 150 rows instead of 50 rows as I want the data to accumulate rather than replace previous ones. Can anyone please tell me where I've got wrong?
If anyone needs additional log outputs, please comment below.
EDIT Model
class RaceData(models.Model):
racer_id = models.IntegerField(primary_key=True)
team_id = models.IntegerField()
timestamp = UnixDateTimeField()
races = models.IntegerField()
time = models.IntegerField()
typed = models.IntegerField()
errs = models.IntegerField()
Thanks in advance.
I'm scraping multiple sports betting websites in order to compare the odds for each match across the websites.
My question is how to identify match_id from a match that already exists in the DB but has team names written in a different way.
Please feel free to add any approaches even if they don't use dataframes or SQLite.
The columns for matches table are:
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)
So for each new match I want to verify if it already exists in the DB.
New match = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
My pseudo-code is like:
SELECT match_id FROM matches
WHERE sport = (sport_to_check)
AND date = (date_to_check)
AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation
If no match is found the new row would be inserted.
I believe there's no way to mix python with SQL like that so that's why I refer to it as "pseudo-code".
I can also pull the matches table into a Pandas dataframe and do the evaluation with it, if that works (how?...).
At any given time it isn't expected for matches table to have above a couple of thousand records.
Let me give you some examples of expected outputs. Where the solution is represented by "find(row)"
Having matches table in DB as:
+----------+------------+-----------------------------+----------------------+------------+
| match_id | sport | home_team | visitor_team | date |
+----------+------------+-----------------------------+----------------------+------------+
| 84 | football | confianca | cuiaba esporte clube | 24/11/2020 |
| 209 | football | cs alagoana | operario pr | 24/11/2020 |
| 184 | football | grenoble foot 38 | as nancy lorraine | 24/11/2020 |
| 7 | football | sv turkgucu-ataspor munchen | saarbrucken | 24/11/2020 |
| 414 | handball | dinamo bucareste | usam nimes | 24/11/2020 |
| 846 | handball | benidorm | naturhouse la rioja | 25/11/2020 |
| 874 | handball | cegledi | ferencvarosi tc | 25/11/2020 |
| 418 | handball | lemvig-thyboron | kif kolding | 25/11/2020 |
| 740 | ice hockey | tps | kookoo | 25/11/2020 |
| 385 | football | stevenage | hull | 29/11/2020 |
+----------+------------+-----------------------------+----------------------+------------+
And new matches to evaluate:
+----------------+------------+---------------------+---------------------+------------+
| row (for demo) | sport | home_team | visitor_team | date |
+----------------+------------+---------------------+---------------------+------------+
| A | football | confianca-se | cuiaba mt | 24/11/2020 |
| B | football | csa | operario | 24/11/2020 |
| C | football | grenoble | nancy | 24/11/2020 |
| D | football | sv turkgucu ataspor | 1 fc saarbrucken | 24/11/2020 |
| E | handball | dinamo bucuresti | nimes | 24/11/2020 |
| F | handball | bm benidorm | bm logrono la rioja | 25/11/2020 |
| G | handball | cegledi kkse | ftc budapest | 25/11/2020 |
| H | handball | lemvig | kif kobenhavn | 25/11/2020 |
| I | ice hockey | turku ps | kookoo kouvola | 25/11/2020 |
| J | football | stevenage borough | hull city | 29/11/2020 |
| K | football | west brom | sheffield united | 28/11/2020 |
+----------------+------------+---------------------+---------------------+------------+
Outputs:
find(A) returns: 84
find(B) returns: 209
find(C) returns: 184
find(D) returns: 7
find(E) returns: 414
find(F) returns: 846
find(G) returns: 874
find(H) returns: 418
find(I) returns: 740
find(J) returns: 385
find(K) returns: (something like "not found" => I would then insert the new row)
Thanks!
Basically I filter down the original table by the given date and sport. then use fuzzywuzzy to find the best match between the home and visitors between the rows remaining:
Setup:
import pandas as pd
cols = ['match_id','sport','home_team','visitor_team','date']
df1 = pd.DataFrame([
['84','football','confianca','cuiaba esporte clube','24/11/2020'],
['209','football','cs alagoana','operario pr','24/11/2020'],
['184','football','grenoble foot 38','as nancy lorraine','24/11/2020'],
['7','football','sv turkgucu-ataspor munchen','saarbrucken','24/11/2020'],
['414','handball','dinamo bucareste','usam nimes','24/11/2020'],
['846','handball','benidorm','naturhouse la rioja','25/11/2020'],
['874','handball','cegledi','ferencvarosi tc','25/11/2020'],
['418','handball','lemvig-thyboron','kif kolding','25/11/2020'],
['740','ice hockey','tps','kookoo','25/11/2020'],
['385','football','stevenage','hull','29/11/2020']], columns=cols)
cols = ['row','sport','home_team','visitor_team','date']
df2 = pd.DataFrame([
['A','football','confianca-se','cuiaba mt','24/11/2020'],
['B','football','csa','operario','24/11/2020'],
['C','football','grenoble','nancy','24/11/2020'],
['D','football','sv turkgucu ataspor','1 fc saarbrucken','24/11/2020'],
['E','handball','dinamo bucuresti','nimes','24/11/2020'],
['F','handball','bm benidorm','bm logrono la rioja','25/11/2020'],
['G','handball','cegledi kkse','ftc budapest','25/11/2020'],
['H','handball','lemvig','kif kobenhavn','25/11/2020'],
['I','ice hockey','turku ps','kookoo kouvola','25/11/2020'],
['J','football','stevenage borough','hull city','29/11/2020'],
['K','football','west brom','sheffield united','28/11/2020']], columns=cols)
Code:
import pandas as pd
from fuzzywuzzy import fuzz
import string
def calculate_ratio(row):
return fuzz.token_set_ratio(row['col1'],row['col2'] )
def find(df1, df2, row_search):
alpha = df2.query('row == "{row_search}"'.format(row_search=row_search))
sport = alpha.iloc[0]['sport']
date = alpha.iloc[0]['date']
home_team = alpha.iloc[0]['home_team']
visitor_team = alpha.iloc[0]['visitor_team']
beta = df1.query('sport == "{sport}" & date == "{date}"'.format(sport=sport,date=date))
if len(beta) == 0:
return 'Not found.'
else:
temp = pd.DataFrame({'match_id':list(beta['match_id']),'col1':list(beta['home_team'] + ' ' + beta['visitor_team']), 'col2':[home_team + ' ' + visitor_team]*len(beta)})
temp['score'] = temp.apply(calculate_ratio, axis=1)
temp = temp.sort_values('score', ascending=False)
outcome = temp.head(1).iloc[0]['match_id']
return outcome
for row_alpha in string.ascii_uppercase[0:11]:
outcome = find(df1, df2, row_alpha)
print ('{row_alpha} --> {outcome}'.format(row_alpha=row_alpha, outcome=outcome))
Output:
A --> 84
B --> 209
C --> 184
D --> 7
E --> 414
F --> 846
G --> 874
H --> 418
I --> 740
J --> 385
K --> Not found.
I have to compare 2 different sources and identify all the mismatches for all IDs
Source_excel table
+-----+-------------+------+----------+
| id | name | City | flag |
+-----+-------------+------+----------+
| 101 | Plate | NY | Ready |
| 102 | Back washer | NY | Sold |
| 103 | Ring | MC | Planning |
| 104 | Glass | NMC | Ready |
| 107 | Cover | PR | Ready |
+-----+-------------+------+----------+
Source_dw table
+-----+----------+------+----------+
| id | name | City | flag |
+-----+----------+------+----------+
| 101 | Plate | NY | Planning |
| 102 | Nut | TN | Expired |
| 103 | Ring | MC | Planning |
| 104 | Top Wire | NY | Ready |
| 105 | Bolt | MC | Expired |
+-----+----------+------+----------+
Expected result
+-----+-------------+----------+------------+----------+------------+---------+------------------+
| ID | excel_name | dw_name | excel_flag | dw_flag | excel_city | dw_city | RESULT |
+-----+-------------+----------+------------+----------+------------+---------+------------------+
| 101 | Plate | Plate | Ready | Planning | NY | NY | FLAG_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | NAME_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | FLAG_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | CITY_MISMATCH |
| 103 | Ring | Ring | Planning | Planning | MC | MC | ALL_MATCH |
| 104 | Glass | Top Wire | Ready | Ready | NMC | NY | NAME_MISMATCH |
| 104 | Glass | Top Wire | Ready | Ready | NMC | NY | CITY_MISMATCH |
| 107 | Cover | | Ready | | PR | | MISSING IN DW |
| 105 | | Bolt | | Expired | | MC | MISSING IN EXCEL |
+-----+-------------+----------+------------+----------+------------+---------+------------------+
I'm new to python and I have tried the below query but it not giving the expected result.
import pandas as pd
source_excel = pd.read_csv('C:/Mypython/Newyork/excel.csv',encoding = "ISO-8859-1")
source_dw = pd.read_csv('C:/Mypython/Newyork/dw.csv',encoding = "ISO-8859-1")
comparison_result = pd.merge(source_excel,source_dw,on='ID',how='outer',indicator=True)
comparison_result.loc[(comparison_result['_merge'] == 'both') & (name_x != name_y), 'Result'] = 'NAME_MISMATCH'
comparison_result.loc[(comparison_result['_merge'] == 'both') & (city_x != city_y), 'Result'] = 'CITY_MISMATCH'
comparison_result.loc[(comparison_result['_merge'] == 'both') & (flag_x != flag_y), 'Result'] = 'FLAG_MISMATCH'
comparison_result.loc[comparison_result['_merge'] == 'left_only', 'Result'] = 'Missing in dw'
comparison_result.loc[comparison_result['_merge'] == 'right_only', 'Result'] = 'Missing in excel'
comparison_result.loc[comparison_result['_merge'] == 'both', 'Result'] = 'ALL_Match'
csv_column = comparison_result[['ID','name_x','name_y','city_x','city_y','flag_x','flag_y','Result']]
print(csv_column)
Is there any other way I can check all the condition and report each in separate row. If separate row not possible, atleast i need in same column separated by all mismatches. something like FLAG_MISMATCH,CITY_MISMATCH
You could do:
df = pd.merge(Source_excel, Source_dw, on = 'ID', how = 'left', suffixes = (None, '_dw'))
This will create a new dataframe like the one you want, although you'll have to reorder the columns as you want. Note that the '_dw' is a suffix and not a prefix in this case.
You can reorder the columns as you like by using this code:
#Complement with the order you want
df = df[['ID', 'excel_name']]
For the result column I think you'll have to create a column for each condition you're trying to check (at least that's the way I know how to). Here's an example:
#This will return 1 if there's a match and 0 otherwise
df['result_flag'] = df.apply(lambda x: 1 if x.excel_flag == x.flag_dw else 0, axis = 1)
Here is a way to do the scoring:
df['result'] = 0
# repeated mask / df.loc statements suggests a loop, over a list of tuples
mask = df['excel_flag'] != df['df_flag']
df.loc[mask, 'result'] += 1
mask = df['excel_name'] != df['dw_name']
df.loc[mask, 'result'] += 10
df['result'] = df['result'].map({ 0: 'all match',
1: 'flag mismatch',
10: 'name mismatch',
11: 'all mismatch',})
I have 2 views as below:
experiments:
select * from experiments;
+--------+--------------------+-----------------+
| exp_id | exp_properties | value |
+--------+--------------------+-----------------+
| 1 | indicator:chemical | phenolphthalein |
| 1 | base | NaOH |
| 1 | acid | HCl |
| 1 | exp_type | titration |
| 1 | indicator:color | faint_pink |
+--------+--------------------+-----------------+
calculations:
select * from calculations;
+--------+------------------------+--------------+
| exp_id | exp_report | value |
+--------+------------------------+--------------+
| 1 | molarity:base | 0.500000000 |
| 1 | volume:acid:in_ML | 23.120000000 |
| 1 | volume:base:in_ML | 5.430000000 |
| 1 | moles:H | 0.012500000 |
| 1 | moles:OH | 0.012500000 |
| 1 | molarity:acid | 0.250000000 |
+--------+------------------------+--------------+
I managed to pivot each of these views individually as below:
experiments_pivot:
+-------+--------------------+------+------+-----------+----------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|
+-------+--------------------+------+------+-----------+----------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink |
+------+---------------------+------+------+-----------+----------------+
calculations_pivot:
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
|exp_id | molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
| 1 | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------------------------------------------+
My question is how to get these two pivot results as a single row? Desired result is as below:
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
Database Used: Mysql
Important Note: Each of these views can have increasing number of rows. Hence I considered "dynamic pivoting" for each of the view individually.
For reference -- Below is a prepared statement I used to pivot experiments in MySQL(and a similar statement to pivot the other view as well):
set #sql = Null;
SELECT
GROUP_CONCAT(DISTINCT
CONCAT(
'MAX(IF(exp_properties = ''',
exp_properties,
''', value, NULL)) AS ',
concat("`",exp_properties, "`")
)
)into #sql
from experiments;
set #sql = concat(
'select exp_id, ',
#sql,
' from experiment group by exp_id'
);
prepare stmt from #sql;
execute stmt;