how to select lat | long | name with given “polygon(,,,)” in postgis python - python

My table in postgis is this with name samplecol:
vessel_hash | name_city | station | speed | latitude | longitude | course | heading | timestamp | the_geom
--------------+-----------+---------+-------+-------------+-------------+--------+---------+--------------------------+----------------------------------------------------
103079215239 |newyork | 841 | 5 | -5.41844510 | 36.12160900 | 314 | 511 | 2016-06-12T06:31:04.000Z | 0101000020E61000001BF33AE2900F424090AF4EDF7CAC15C0
103079215239 |washangton | 3008 | 0 | -5.41778710 | 36.12144900 | 117 | 511 | 2016-06-12T06:43:27.000Z | 0101000020E6100000E2900DA48B0F424042C3AC61D0AB15C0
103079215239 |paris | 841 | 17 | -5.42236900 | 36.12356900 | 259 | 511 | 2016-06-12T06:50:27.000Z | 0101000020E610000054E6E61BD10F42407C60C77F81B015C0
103079215239 |room | 841 | 17 | -5.41781710 | 36.12147900 | 230 | 511 | 2016-06-12T06:27:03.000Z | 0101000020E61000004D13B69F8C0F424097D6F03ED8AB15C0
103079215239 |pensilvenia| 841 | 61 | -5.42201900 | 36.13256100 | 157 | 511 | 2016-06-12T06:08:04.000Z | 0101000020E6100000CFDC43C2F71042409929ADBF25B015C0
103079215239 |jorjia | 841 | 9 | -5.41834020 | 36.12225000 | 359 | 511 | 2016-06-12T06:33:03.000Z | 0101000020E6100000CFF753E3A50F42408D68965F61AC15C0
Output query will must be like this:
name_city | latitude | longitude
-----------+-------------+----------
newyork |-5.41844510 | 36.12160900
pensilvenia|-5.42201900 | 36.13256100
jorjia |-5.41834020 | 36.12225000
My code is this:
poisInpolygon = """SELECT samplecol.latitude,samplecol.name_city samplecol.longitude
FROM samplecol
WHERE ST_Contains(samplecol.the_geom,('POLYGON((-15.0292969 47.6357836, -15.2050781 47.5172007,
-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836, -15.0292969 47.6357836))'));"""
cursor.execute(poisInpolygon)
exists1 = cursor.fetchall()
count1 = 0;
for ex1 in exists1:
count1 = count1+1
print ex1,"\n"
print "points", count1
I try this code and query but return 0. What is problem? What is the correct query?

I believe it is the classic misplace of long,lat. Based on your data sample, if we create the_geom using longitude, latitude your points land somewhere in Tanzania and therefore far away from your polygon:
SELECT ST_SetSRID(ST_MakePoint(longitude,latitude),4326) FROM samplecol
UNION ALL SELECT 'SRID=4326;POLYGON((-15.0292969 47.6357836, -15.2050781 47.5172007,
-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836, -15.0292969 47.6357836))'::geometry;
If we switch the order to latitude,longitude your points land close to Gibraltar and the geometries do overlap:
SELECT ST_SetSRID(ST_MakePoint(latitude,longitude),4326) FROM samplecol
UNION ALL SELECT 'SRID=4326;POLYGON((-15.0292969 47.6357836, -15.2050781 47.5172007,
-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836, -15.0292969 47.6357836))'::geometry;
So my guess is that you stored longitude and latitude values in the wrong columns ;)

Related

Subsetting dataset by using IF statement

I have this massive dataset and I need to subset the data by using criteria. This is for illustration:
| Group | Name | Value |
|--------------------|-------------|-----------------|
| A | Bill| 256 |
| A | Jack| 268 |
| A | Melissa| 489 |
| B | Amanda | 787 |
| B | Eric| 485 |
| C | Matt| 1236 |
| C | Lisa| 1485 |
| D | Ben | 785 |
| D | Andrew| 985 |
| D | Cathy| 1025 |
| D | Suzanne| 1256 |
| D | Jim| 1520 |
I know how to handle this problem manually, such as:
import pandas as pd
df=pd.read_csv('Test.csv')
A=df[df.Group =="A "].to_numpy()
B=df[df.Group =="B "].to_numpy()
C=df[df.Group =="C "].to_numpy()
D=df[df.Group =="D "].to_numpy()
But considering the size of the data, it will take a lot of time if I handle it in this way.
With that in mind, I would like to know if it is possible to build an iteration with an IF statement that can look at the values in column “Group”(table above) . I was thinking, IF statement to see if the first value is the same with one the below if so, group them and create a new array/ dataframe.

Prevent Django from Removing Previous Entries in PostgreSQL

I have the following Django code that is being run on PostgreSQL and Huey (an automatic scheduler). The problem is that, whenever the periodic task is run, Django removes the previous rows in a table instead of adding on to existent ones.
Scheduled code:
#periodic_task(crontab(minute='*/1'))
def scheduled():
team = nitrotype.Team('PR2W')
team_id = team.data["info"]["teamID"]
timestamp = datetime.datetime.now()
for members in team.data["members"]:
racer_id = members["userID"]
races = members["played"]
time = members["secs"]
typed = members["typed"]
errs = members["errs"]
rcd = RaceData(
racer_id=racer_id,
team_id=team_id,
timestamp=timestamp,
races=races,
time=time,
typed=typed,
errs=errs
)
rcd.save()
Basically, the above code is going to run every minute. Here's the database (PSQL) data that I started with:
nttracker=# TABLE data_racedata;
racer_id | team_id | timestamp | races | time | typed | errs
----------+---------+------------+--------+---------+----------+--------
35051013 | 765879 | 1623410530 | 4823 | 123226 | 793462 | 42975
35272676 | 765879 | 1623410530 | 8354 | 211400 | 1844434 | 38899
36690038 | 765879 | 1623410530 | 113 | 2849 | 16066 | 995
38486084 | 765879 | 1623410530 | 34448 | 903144 | 8043345 | 586297
38625235 | 765879 | 1623410530 | 108 | 2779 | 20919 | 1281
39018052 | 765879 | 1623410530 | 1908 | 48898 | 395187 | 24384
39114823 | 765879 | 1623410530 | 2441 | 64170 | 440503 | 32594
...
(50 rows)
Afterward, I run Huey, which executes scheduled() every one minute. Here's what I end up with after two minutes (in other words, two iterations):
nttracker=# TABLE data_racedata;
racer_id | team_id | timestamp | races | time | typed | errs
----------+---------+------------+--------+---------+----------+--------
35051013 | 765879 | 1623410992 | 4823 | 123226 | 793462 | 42975
35272676 | 765879 | 1623410992 | 8354 | 211400 | 1844434 | 38899
36690038 | 765879 | 1623410992 | 113 | 2849 | 16066 | 995
38486084 | 765879 | 1623410992 | 34448 | 903144 | 8043345 | 586297
38625235 | 765879 | 1623410992 | 108 | 2779 | 20919 | 1281
39018052 | 765879 | 1623410992 | 1908 | 48898 | 395187 | 24384
39114823 | 765879 | 1623410992 | 2441 | 64170 | 440503 | 32594
...
(50 rows)
Note: most of the data just happened to be the same, the timestamp's always different among data generated from automated executions.
I'd like 150 rows instead of 50 rows as I want the data to accumulate rather than replace previous ones. Can anyone please tell me where I've got wrong?
If anyone needs additional log outputs, please comment below.
EDIT Model
class RaceData(models.Model):
racer_id = models.IntegerField(primary_key=True)
team_id = models.IntegerField()
timestamp = UnixDateTimeField()
races = models.IntegerField()
time = models.IntegerField()
typed = models.IntegerField()
errs = models.IntegerField()
Thanks in advance.

Pandas dataframe or SQLite fuzzy search

I'm scraping multiple sports betting websites in order to compare the odds for each match across the websites.
My question is how to identify match_id from a match that already exists in the DB but has team names written in a different way.
Please feel free to add any approaches even if they don't use dataframes or SQLite.
The columns for matches table are:
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)
So for each new match I want to verify if it already exists in the DB.
New match = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
My pseudo-code is like:
SELECT match_id FROM matches
WHERE sport = (sport_to_check)
AND date = (date_to_check)
AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation
If no match is found the new row would be inserted.
I believe there's no way to mix python with SQL like that so that's why I refer to it as "pseudo-code".
I can also pull the matches table into a Pandas dataframe and do the evaluation with it, if that works (how?...).
At any given time it isn't expected for matches table to have above a couple of thousand records.
Let me give you some examples of expected outputs. Where the solution is represented by "find(row)"
Having matches table in DB as:
+----------+------------+-----------------------------+----------------------+------------+
| match_id | sport | home_team | visitor_team | date |
+----------+------------+-----------------------------+----------------------+------------+
| 84 | football | confianca | cuiaba esporte clube | 24/11/2020 |
| 209 | football | cs alagoana | operario pr | 24/11/2020 |
| 184 | football | grenoble foot 38 | as nancy lorraine | 24/11/2020 |
| 7 | football | sv turkgucu-ataspor munchen | saarbrucken | 24/11/2020 |
| 414 | handball | dinamo bucareste | usam nimes | 24/11/2020 |
| 846 | handball | benidorm | naturhouse la rioja | 25/11/2020 |
| 874 | handball | cegledi | ferencvarosi tc | 25/11/2020 |
| 418 | handball | lemvig-thyboron | kif kolding | 25/11/2020 |
| 740 | ice hockey | tps | kookoo | 25/11/2020 |
| 385 | football | stevenage | hull | 29/11/2020 |
+----------+------------+-----------------------------+----------------------+------------+
And new matches to evaluate:
+----------------+------------+---------------------+---------------------+------------+
| row (for demo) | sport | home_team | visitor_team | date |
+----------------+------------+---------------------+---------------------+------------+
| A | football | confianca-se | cuiaba mt | 24/11/2020 |
| B | football | csa | operario | 24/11/2020 |
| C | football | grenoble | nancy | 24/11/2020 |
| D | football | sv turkgucu ataspor | 1 fc saarbrucken | 24/11/2020 |
| E | handball | dinamo bucuresti | nimes | 24/11/2020 |
| F | handball | bm benidorm | bm logrono la rioja | 25/11/2020 |
| G | handball | cegledi kkse | ftc budapest | 25/11/2020 |
| H | handball | lemvig | kif kobenhavn | 25/11/2020 |
| I | ice hockey | turku ps | kookoo kouvola | 25/11/2020 |
| J | football | stevenage borough | hull city | 29/11/2020 |
| K | football | west brom | sheffield united | 28/11/2020 |
+----------------+------------+---------------------+---------------------+------------+
Outputs:
find(A) returns: 84
find(B) returns: 209
find(C) returns: 184
find(D) returns: 7
find(E) returns: 414
find(F) returns: 846
find(G) returns: 874
find(H) returns: 418
find(I) returns: 740
find(J) returns: 385
find(K) returns: (something like "not found" => I would then insert the new row)
Thanks!
Basically I filter down the original table by the given date and sport. then use fuzzywuzzy to find the best match between the home and visitors between the rows remaining:
Setup:
import pandas as pd
cols = ['match_id','sport','home_team','visitor_team','date']
df1 = pd.DataFrame([
['84','football','confianca','cuiaba esporte clube','24/11/2020'],
['209','football','cs alagoana','operario pr','24/11/2020'],
['184','football','grenoble foot 38','as nancy lorraine','24/11/2020'],
['7','football','sv turkgucu-ataspor munchen','saarbrucken','24/11/2020'],
['414','handball','dinamo bucareste','usam nimes','24/11/2020'],
['846','handball','benidorm','naturhouse la rioja','25/11/2020'],
['874','handball','cegledi','ferencvarosi tc','25/11/2020'],
['418','handball','lemvig-thyboron','kif kolding','25/11/2020'],
['740','ice hockey','tps','kookoo','25/11/2020'],
['385','football','stevenage','hull','29/11/2020']], columns=cols)
cols = ['row','sport','home_team','visitor_team','date']
df2 = pd.DataFrame([
['A','football','confianca-se','cuiaba mt','24/11/2020'],
['B','football','csa','operario','24/11/2020'],
['C','football','grenoble','nancy','24/11/2020'],
['D','football','sv turkgucu ataspor','1 fc saarbrucken','24/11/2020'],
['E','handball','dinamo bucuresti','nimes','24/11/2020'],
['F','handball','bm benidorm','bm logrono la rioja','25/11/2020'],
['G','handball','cegledi kkse','ftc budapest','25/11/2020'],
['H','handball','lemvig','kif kobenhavn','25/11/2020'],
['I','ice hockey','turku ps','kookoo kouvola','25/11/2020'],
['J','football','stevenage borough','hull city','29/11/2020'],
['K','football','west brom','sheffield united','28/11/2020']], columns=cols)
Code:
import pandas as pd
from fuzzywuzzy import fuzz
import string
def calculate_ratio(row):
return fuzz.token_set_ratio(row['col1'],row['col2'] )
def find(df1, df2, row_search):
alpha = df2.query('row == "{row_search}"'.format(row_search=row_search))
sport = alpha.iloc[0]['sport']
date = alpha.iloc[0]['date']
home_team = alpha.iloc[0]['home_team']
visitor_team = alpha.iloc[0]['visitor_team']
beta = df1.query('sport == "{sport}" & date == "{date}"'.format(sport=sport,date=date))
if len(beta) == 0:
return 'Not found.'
else:
temp = pd.DataFrame({'match_id':list(beta['match_id']),'col1':list(beta['home_team'] + ' ' + beta['visitor_team']), 'col2':[home_team + ' ' + visitor_team]*len(beta)})
temp['score'] = temp.apply(calculate_ratio, axis=1)
temp = temp.sort_values('score', ascending=False)
outcome = temp.head(1).iloc[0]['match_id']
return outcome
for row_alpha in string.ascii_uppercase[0:11]:
outcome = find(df1, df2, row_alpha)
print ('{row_alpha} --> {outcome}'.format(row_alpha=row_alpha, outcome=outcome))
Output:
A --> 84
B --> 209
C --> 184
D --> 7
E --> 414
F --> 846
G --> 874
H --> 418
I --> 740
J --> 385
K --> Not found.

Check Multiple condition for same row

I have to compare 2 different sources and identify all the mismatches for all IDs
Source_excel table
+-----+-------------+------+----------+
| id | name | City | flag |
+-----+-------------+------+----------+
| 101 | Plate | NY | Ready |
| 102 | Back washer | NY | Sold |
| 103 | Ring | MC | Planning |
| 104 | Glass | NMC | Ready |
| 107 | Cover | PR | Ready |
+-----+-------------+------+----------+
Source_dw table
+-----+----------+------+----------+
| id | name | City | flag |
+-----+----------+------+----------+
| 101 | Plate | NY | Planning |
| 102 | Nut | TN | Expired |
| 103 | Ring | MC | Planning |
| 104 | Top Wire | NY | Ready |
| 105 | Bolt | MC | Expired |
+-----+----------+------+----------+
Expected result
+-----+-------------+----------+------------+----------+------------+---------+------------------+
| ID | excel_name | dw_name | excel_flag | dw_flag | excel_city | dw_city | RESULT |
+-----+-------------+----------+------------+----------+------------+---------+------------------+
| 101 | Plate | Plate | Ready | Planning | NY | NY | FLAG_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | NAME_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | FLAG_MISMATCH |
| 102 | Back washer | Nut | Sold | Expired | NY | TN | CITY_MISMATCH |
| 103 | Ring | Ring | Planning | Planning | MC | MC | ALL_MATCH |
| 104 | Glass | Top Wire | Ready | Ready | NMC | NY | NAME_MISMATCH |
| 104 | Glass | Top Wire | Ready | Ready | NMC | NY | CITY_MISMATCH |
| 107 | Cover | | Ready | | PR | | MISSING IN DW |
| 105 | | Bolt | | Expired | | MC | MISSING IN EXCEL |
+-----+-------------+----------+------------+----------+------------+---------+------------------+
I'm new to python and I have tried the below query but it not giving the expected result.
import pandas as pd
source_excel = pd.read_csv('C:/Mypython/Newyork/excel.csv',encoding = "ISO-8859-1")
source_dw = pd.read_csv('C:/Mypython/Newyork/dw.csv',encoding = "ISO-8859-1")
comparison_result = pd.merge(source_excel,source_dw,on='ID',how='outer',indicator=True)
comparison_result.loc[(comparison_result['_merge'] == 'both') & (name_x != name_y), 'Result'] = 'NAME_MISMATCH'
comparison_result.loc[(comparison_result['_merge'] == 'both') & (city_x != city_y), 'Result'] = 'CITY_MISMATCH'
comparison_result.loc[(comparison_result['_merge'] == 'both') & (flag_x != flag_y), 'Result'] = 'FLAG_MISMATCH'
comparison_result.loc[comparison_result['_merge'] == 'left_only', 'Result'] = 'Missing in dw'
comparison_result.loc[comparison_result['_merge'] == 'right_only', 'Result'] = 'Missing in excel'
comparison_result.loc[comparison_result['_merge'] == 'both', 'Result'] = 'ALL_Match'
csv_column = comparison_result[['ID','name_x','name_y','city_x','city_y','flag_x','flag_y','Result']]
print(csv_column)
Is there any other way I can check all the condition and report each in separate row. If separate row not possible, atleast i need in same column separated by all mismatches. something like FLAG_MISMATCH,CITY_MISMATCH
You could do:
df = pd.merge(Source_excel, Source_dw, on = 'ID', how = 'left', suffixes = (None, '_dw'))
This will create a new dataframe like the one you want, although you'll have to reorder the columns as you want. Note that the '_dw' is a suffix and not a prefix in this case.
You can reorder the columns as you like by using this code:
#Complement with the order you want
df = df[['ID', 'excel_name']]
For the result column I think you'll have to create a column for each condition you're trying to check (at least that's the way I know how to). Here's an example:
#This will return 1 if there's a match and 0 otherwise
df['result_flag'] = df.apply(lambda x: 1 if x.excel_flag == x.flag_dw else 0, axis = 1)
Here is a way to do the scoring:
df['result'] = 0
# repeated mask / df.loc statements suggests a loop, over a list of tuples
mask = df['excel_flag'] != df['df_flag']
df.loc[mask, 'result'] += 1
mask = df['excel_name'] != df['dw_name']
df.loc[mask, 'result'] += 10
df['result'] = df['result'].map({ 0: 'all match',
1: 'flag mismatch',
10: 'name mismatch',
11: 'all mismatch',})

pivot multiple views into single result table/view

I have 2 views as below:
experiments:
select * from experiments;
+--------+--------------------+-----------------+
| exp_id | exp_properties | value |
+--------+--------------------+-----------------+
| 1 | indicator:chemical | phenolphthalein |
| 1 | base | NaOH |
| 1 | acid | HCl |
| 1 | exp_type | titration |
| 1 | indicator:color | faint_pink |
+--------+--------------------+-----------------+
calculations:
select * from calculations;
+--------+------------------------+--------------+
| exp_id | exp_report | value |
+--------+------------------------+--------------+
| 1 | molarity:base | 0.500000000 |
| 1 | volume:acid:in_ML | 23.120000000 |
| 1 | volume:base:in_ML | 5.430000000 |
| 1 | moles:H | 0.012500000 |
| 1 | moles:OH | 0.012500000 |
| 1 | molarity:acid | 0.250000000 |
+--------+------------------------+--------------+
I managed to pivot each of these views individually as below:
experiments_pivot:
+-------+--------------------+------+------+-----------+----------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|
+-------+--------------------+------+------+-----------+----------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink |
+------+---------------------+------+------+-----------+----------------+
calculations_pivot:
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
|exp_id | molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+---------------+---------------+--------------+-------------+------------------+-------------------+
| 1 | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------------------------------------------+
My question is how to get these two pivot results as a single row? Desired result is as below:
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
|exp_id | indicator:chemical | base | acid | exp_type | indicator:color|molarity:base | molarity:acid | moles:H | moles:OH | volume:acid:in_ML| volume:base:in_ML |
+-------+--------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
| 1 | phenolphthalein | NaOH | HCl | titration | faint_pink | 0.500000000 | 0.250000000 | 0.012500000 | 0.012500000 | 23.120000000 | 5.430000000 |
+------+---------------------+------+------+-----------+----------------+--------------+---------------+--------------+-------------+------------------+------------------+
Database Used: Mysql
Important Note: Each of these views can have increasing number of rows. Hence I considered "dynamic pivoting" for each of the view individually.
For reference -- Below is a prepared statement I used to pivot experiments in MySQL(and a similar statement to pivot the other view as well):
set #sql = Null;
SELECT
GROUP_CONCAT(DISTINCT
CONCAT(
'MAX(IF(exp_properties = ''',
exp_properties,
''', value, NULL)) AS ',
concat("`",exp_properties, "`")
)
)into #sql
from experiments;
set #sql = concat(
'select exp_id, ',
#sql,
' from experiment group by exp_id'
);
prepare stmt from #sql;
execute stmt;

Categories