Combining GeoPandas Dataframe and Pandas Dataframe based on concentration - python
I have two spatial datasets that I want to compare. One of them has the polygon coordinates in it, but is not converted into geopoints. This dataset has about 700,000 points. The other only has the latitude and longitude of its coordinates, but no gis point data. This dataset has 6 million data points. I also have a shapefile which I converted into a Geopandas dataframe. It has all the neighborhoods of the city I am studying. I am trying to study the concentration of points in each neighborhood. To do so I added an extra column in my geo-data frame and set the values within it to 0. I then looped through all points in the first dataset and used Geopanda's point in polygon algorithm to find in which polygon the point is contained in and increment that row's new column value. However, that took all day (so this solution is already not that viable for the 6 million point dataset) and it did not work.
(Code is below)
How do you suggest I speed up the code or do this most effectively?
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon
import shapely.speedups
map_df = gpd.read_file("city.shp")
first_dataset = pd.read_csv(r'first_dataset.csv', header=0) # I have both datasets as csvs
map_df["data_count"] = 0
shapely.speedups.enable()
for index, rows in first_dataset.iterrows():
for index_m, rows_m in map_df.iterrows():
if rows_m['geometry'].contains(Point(float(rows['x_sp'].replace(',', '')), float(rows['y_sp'].replace(',', '')))):
rows_m["data_count"]+= 1
print(rows_m)
if index % 10000 == 0:
print(index)
Example of dataframe rows from dataset 1:
id created_at latitude longitude x_sp y_sp
0 8/27/2015 40.723092 -73.844215 1,027,431.148 202,756.7687
1 09/03/2015 40.794111 -73.818679 1,034,455.701 228,644.8374
2 09/05/2015 40.717581 -73.936608 1,001,822.831 200,716.8913
Example of dataframe rows from dataset 2:
id created_at latitude longitude
0 8/27/2015 40.723092 -73.844215
1 09/03/2015 40.794111 -73.818679
2 09/05/2015 40.717581 -73.936608
Example of dateframe rows from geodataframe:
0 POLYGON ((990897.9000244141 169268.1207885742, 990588.2515869141 169075.217590332, 990634.5867919922 168756.3862304688, 990675.9777832031 168471.1604003906, 990718.684387207 168203.7496337891, 990751.5944213867 167919.0817871094, 990769.1470336914 167817.549987793, 990787.1948242188 167713.2717895508, 990847.0239868164 167337.3375854492, 990561.9622192383 167292.2391967773, 990580.0333862305 167018.7095947266, 990604.2161865234 166962.7120361328, 990650.3184204102 166636.4661865234, 990680.6791992188 166428.1857910156, 990703.641784668 166257.7855834961, 990755.4462280273 165886.7659912109, 990648.6395874023 165753.7963867188, 990620.1052246094 165718.2723999023, 991080.934387207 165440.5859985352, 990684.0062255859 164677.5101928711, 990490.1318359375 164313.4031982422, 990399.3641967773 164171.6812133789, 991015.9805908203 163682.8403930664, 991087.1251831055 163613.3948364258, 991067.8176269531 163744.3165893555, 991057.1912231445 163816.3900146484, 991345.2709960938 163856.4456176758, 991646.3432006836 163900.2806396484, 991773.0786132813 163030.0364379883, 991470.4874267578 162985.1055908203, 991143.1979980469 163237.9592285156, 991198.5916137695 162853.2189941406, 991220.766784668 162677.7142333984, 991253.1356201172 162478.1116333008, 991133.2471923828 162567.9869995117, 990522.3430175781 163053.4155883789, 989903.0823974609 163543.8049926758, 989791.7080078125 163632.0856323242, 989558.5064086914 163816.9310302734, 989283.8712158203 164034.6274414063, 988673.0786132813 164518.5571899414, 988060.2022094727 165001.5054321289, 988221.6488037109 165207.3384399414, 987622.6669921875 165682.1539916992, 987000.0469970703 166175.3447875977, 986318.1618041992 166716.4169921875, 985920.5576171875 167032.3049926758, 985825.3380126953 167105.7139892578, 985667.7103881836 167149.0173950195, 985711.5521850586 167204.7424316406, 985735.1807861328 167234.7784423828, 985616.2537841797 167274.6865844727, 985141.4716186523 167646.0275878906, 985116.6848144531 167538.4523925781, 985078.2302246094 167367.5209960938, 985015.9666137695 167085.9205932617, 984772.9385986328 167277.9213867188, 984162.9234008789 167762.0607910156, 983552.0159912109 168246.4891967773, 983631.5764160156 168347.3909912109, 983714.2465820313 168449.9827880859, 983400.1986083984 168698.6110229492, 983226.8787841797 168835.8251953125, 983101.6134033203 168934.9949951172, 982490.5974121094 169419.4625854492, 982300.666809082 169571.1713867188, 982360.6928100586 169622.3165893555, 982412.7750244141 169667.8779907227, 982499.6558227539 169743.8810424805, 982705.4100341797 169925.6982421875, 982208.206237793 170319.3721923828, 982376.1818237305 170531.090637207, 982468.7208251953 170647.6915893555, 982537.7200317383 170734.6337890625, 982699.583190918 170939.6395874023, 982861.3120117188 171143.2908325195, 983023.1260375977 171347.2145996094, 983109.014831543 171455.7706298828, 983184.8184204102 171551.5745849609, 983347.2426147461 171754.9827880859, 983508.3842163086 171960.0018310547, 983592.399597168 172065.8973999023, 983669.7454223633 172163.3807983398, 983831.5626220703 172367.2941894531, 983993.1697998047 172571.6575927734, 984066.6146240234 172663.9800415039, 984155.3021850586 172775.4661865234, 984317.4638061523 172980.1575927734, 984478.7760009766 173183.559387207, 985089.9163818359 172698.5303955078, 985496.0212402344 172376.3232421875, 985694.4296264648 172550.7814331055, 985823.2385864258 172664.3868408203, 985893.815612793 172726.3685913086, 986092.0632324219 172900.3115844727, 986291.3049926758 173075.3837890625, 986490.1168212891 173249.0444335938, 986688.6240234375 173424.299987793, 986887.4290161133 173598.6697998047, 987086.5100097656 173773.731628418, 987286.3489990234 173946.5759887695, 987483.0447998047 174107.9993896484, 987719.0842285156 173919.7717895508, 987932.4072265625 173748.016784668, 988163.9849853516 173563.5582275391, 988386.3762207031 173383.4208374023, 988527.3765869141 173270.1207885742, 988606.2772216797 173205.4880371094, 988880.1351928711 172984.799987793, 988969.5344238281 172928.8251953125, 989122.0604248047 173006.6860351563, 989233.424987793 173068.2640380859, 989458.4180297852 173191.2103881836, 989683.6900024414 173315.2449951172, 989779.9464111328 172642.8909912109, 989798.666809082 172522.3532104492, 989825.658203125 172332.0057983398, 989835.7268066406 172270.7457885742, 989890.3657836914 171886.9739990234, 989924.4572143555 171651.7941894531, 989946.274597168 171510.4783935547, 989971.1500244141 171328.6119995117, 989999.0960083008 171139.9083862305, 990047.4818115234 170785.1354370117, 990350.0830078125 170826.1293945313, 990664.5280151367 170870.7161865234, 990734.6127929688 170389.4440307617, 990758.5037841797 170225.3895874023, 990791.0989990234 170001.5607910156, 990897.9000244141 169268.1207885742))
1 POLYGON ((1038593.459228516 221913.3550415039, 1039369.281188965 221834.5889892578, 1040016.937194824 221767.3710327148, 1040050.687194824 221763.8671875, 1040133.272399902 221639.3005981445, 1040238.59362793 221481.9191894531, 1040275.303039551 221429.7963867188, 1040316.041015625 221380.5115966797, 1040360.440612793 221334.5405883789, 1040360.463439941 221334.5176391602, 1040360.48638916 221334.4979858398, 1040392.428588867 221306.1614379883, 1040408.153625488 221292.2112426758, 1040551.463806152 221188.8290405273, 1040607.487182617 221150.3577880859, 1040831.604187012 220992.8909912109, 1040848.265991211 220980.7042236328, 1040970.862792969 220849.400390625, 1041010.750793457 220801.116394043, 1041048.40838623 220751.07421875, 1041084.792785645 220697.3634033203, 1041264.418395996 220427.8726196289, 1041322.846984863 220337.4133911133, 1041520.730224609 220046.5382080078, 1041536.760437012 220023.0607910156, 1041527.849609375 219998.7427978516, 1041471.048583984 219365.9125976563, 1041419.457397461 218864.266784668, 1041325.684631348 217942.9957885742, 1041584.375 217916.5983886719, 1041530.431640625 217377.5538330078, 1041497.744628906 217069.0079956055, 1041473.059631348 216814.458190918, 1041472.780822754 216784.1069946289, 1041471.678405762 216664.5338134766, 1041472.728210449 216532.3522338867, 1041472.895629883 216511.0759887695, 1041718.423400879 216539.4061889648, 1041967.330383301 216566.1448364258, 1042215.098815918 216593.5889892578, 1042278.983215332 216006.1691894531, 1042341.099182129 215441.750793457, 1042090.027038574 215432.1575927734, 1041871.126586914 215424.8546142578, 1041840.965820313 215423.8474121094, 1041588.99798584 215415.458190918, 1041313.476623535 215407.4401855469, 1041061.948242188 215429.4609985352, 1040820.14440918 215535.7139892578, 1040705.075622559 215590.2645874023, 1040570.450012207 215654.0862426758, 1040321.418212891 215772.4523925781, 1040072.088012695 215890.9625854492, 1039822.682434082 216009.2297973633, 1039573.237426758 216119.8107910156, 1039310.012817383 216165.0956420898, 1039059.793640137 216222.4512329102, 1038804.708984375 216272.5297851563, 1038547.944213867 216324.5051879883, 1038293.778015137 216385.8469848633, 1038037.837036133 216449.0289916992, 1037782.71282959 216511.2695922852, 1037527.778991699 216573.4415893555, 1037274.272216797 216635.7349853516, 1037019.584411621 216696.1871948242, 1036761.95703125 216723.8743896484, 1036594.0078125 216737.932434082, 1036256.774414063 216632.6439819336, 1036017.460632324 216547.7261962891, 1035777.588989258 216462.2112426758, 1035476.270629883 216354.6691894531, 1034867.013427734 216136.2247924805, 1033940.263183594 215805.2145996094, 1033892.861816406 215830.4443969727, 1033793.351013184 216101.4118041992, 1033683.557800293 216410.233215332, 1033625.257385254 216572.4212036133, 1033613.161010742 216606.0693969727, 1033539.942199707 216807.2940063477, 1033392.729797363 217220.2963867188, 1033274.84161377 217550.5029907227, 1033147.43359375 217909.2789916992, 1033078.946411133 218336.7681884766, 1033016.190612793 218723.4501953125, 1032942.762023926 219190.3881835938, 1032877.883605957 219596.958984375, 1032814.523986816 220005.6489868164, 1032566.499816895 219959.740234375, 1032432.372619629 219933.1885986328, 1032291.398620605 219906.0463867188, 1032288.993591309 220089.8088378906, 1032285.696594238 220450.083984375, 1032285.26361084 220710.5983886719, 1032311.743225098 220991.5687866211, 1032372.694396973 221503.2936401367, 1032488.727600098 222011.6292114258, 1032754.957397461 222240.6279907227, 1032903.536621094 222314.0530395508, 1032916.489196777 222404.0297851563, 1032933.516784668 222493.4227905273, 1032954.572998047 222581.9495849609, 1032979.589416504 222669.3442382813, 1033043.582214355 222903.4877929688, 1033059.717224121 222956.1351928711, 1033119.047790527 223149.7109985352, 1033183.440795898 223347.7189941406, 1033480.638183594 223316.1047973633, 1033734.718994141 223287.1185913086, 1034012.103637695 223260.5145874023, 1034153.652038574 223247.0729980469, 1034292.312988281 223233.9039916992, 1034854.191833496 223178.7626342773, 1035616.388427734 223101.1807861328, 1035580.289428711 222742.7860107422, 1035554.305236816 222487.2645874023, 1035546.32623291 222411.5527954102, 1035527.251586914 222225.7689819336, 1036129.015441895 222163.9714355469, 1037024.84362793 222073.1019897461, 1037817.207641602 221992.4197998047, 1038593.459228516 221913.3550415039))
2 POLYGON ((1022728.275024414 217530.8082275391, 1023052.64440918 216997.8765869141, 1023125.596984863 216889.2808227539, 1023273.037597656 216713.3721923828, 1023276.361022949 216661.2990112305, 1023320.054843883 216618.8505990705, 1023365.11109836 216577.851189134, 1023411.481757777 216538.3444855517, 1023459.117392412 216500.3726012749, 1023507.967224121 216463.9760131836, 1023557.979221938 216429.193600211, 1023609.100029511 216396.0623579916, 1023661.275153702 216364.6176033644, 1023714.448977505 216334.8928554269, 1023768.564819336 216306.9197998047, 1023946.881408691 216213.6618041992, 1023987.648620605 216191.0289916992, 1024036.671203613 216163.8128051758, 1024131.621826172 216098.0781860352, 1024206.487182617 216029.908996582, 1024272.221984863 215958.6986083984, 1024350.384399414 215887.7606201172, 1024401.58581543 215813.4290161133, 1024417.641784668 215790.1193847656, 1024520.187194824 215639.7949829102, 1024536.922790527 215611.208984375, 1024560.20098877 215571.5419921875, 1024523.26739502 215581.7974243164, 1024484.575622559 215589.4478149414, 1024445.806396484 215592.4146118164, 1024405.450195313 215590.1334228516, 1024336.880615234 215576.3468017578, 1024308.688415527 215593.5944213867, 1024252.044799805 215580.7860107422, 1023879.900024414 215517.4067993164, 1023657.876220703 215478.4797973633, 1023434.129821777 215436.2781982422, 1023343.286987305 215431.4096069336, 1023170.426391602 215423.5223999023, 1023055.702026367 215418.0599975586, 1022907.730224609 215410.4024047852, 1022758.914794922 215403.3388061523, 1022507.560424805 215391.9639892578, 1022258.003601074 215378.4998168945, 1022006.481994629 215365.9700317383, 1021755.898193359 215353.9393920898, 1021463.710632324 215330.1235961914, 1021215.643432617 215295.4747924805, 1020968.189819336 215260.8588256836, 1020685.559387207 215221.5936279297, 1020403.20098877 215182.217590332, 1020150.986816406 215147.0106201172, 1019898.86138916 215111.6365966797, 1019651.066589355 215077.348815918, 1019403.996826172 215043.3461914063, 1019252.57019043 215025.8591918945, 1019100.007995605 214999.4357910156, 1018842.210021973 214963.4545898438, 1018738.072998047 215708.6661987305, 1018655.300842285 216301.7557983398, 1018597.322021484 216715.5181884766, 1018742.200622559 216682.1550292969, 1018879.417785645 216650.5541992188, 1018789.959411621 217262.7443847656, 1018788.440429688 217285.70703125, 1018786.898620605 217303.4104003906, 1018728.470031738 217731.0209960938, 1018636.131225586 218394.3923950195, 1018559.18560791 218993.1937866211, 1018533.155395508 219135.4047851563, 1018530.182983398 219150.8411865234, 1018528.234191895 219164.5944213867, 1018489.773010254 219468.2059936523, 1018592.882995605 219766.1644287109, 1018590.461791992 219946.5151977539, 1018593.864013672 220008.2933959961, 1018602.453186035 220164.2609863281, 1018603.476806641 220212.774597168, 1018600.593017578 220261.3671875, 1018595.871398926 220294.948425293, 1018593.814819336 220309.5756225586, 1018582.728820801 220358.7814331055, 1018571.127807617 220392.7609863281, 1018591.321411133 220397.9415893555, 1018824.15222168 220444.0570068359, 1018939.43762207 220459.9885864258, 1019076.838806152 220472.3541870117, 1019203.705200195 220477.8526000977, 1019349.351623535 220477.8526000977, 1019502.75982666 220466.8685913086, 1019638.458435059 220452.1343994141, 1019777.109619141 220425.7139892578, 1019860.718383789 220409.033996582, 1020059.99621582 220341.6423950195, 1020213.319396973 220275.6680297852, 1020360.668212891 220202.0657958984, 1020414.034423828 220172.3121948242, 1020564.690185547 220072.1777954102, 1020785.18182373 219898.716796875, 1021011.713623047 219708.8746337891, 1021348.228393555 219398.3112182617, 1021504.392822266 219234.4495849609, 1021618.96282959 219114.2304077148, 1021689.125793457 219032.9976196289, 1021717.223022461 219000.4672241211, 1021759.950012207 218950.9990234375, 1021908.460205078 218768.0629882813, 1022052.534790039 218590.5928344727, 1022088.194213867 218546.6691894531, 1022139.306213379 218472.1713867188, 1022238.390808105 218324.9241943359, 1022530.632385254 217851.1929931641, 1022728.275024414 217530.8082275391))
If I understand your question correctly, you have two .csv files containing points and one .shp file having polygons, and you wanna count the number of points in each polygon. If so, you indeed need a spatial join. It will check the geometric relationships between each of your point and polygons (e.g., within), and then return the ID of the polygon in which each point locates. After the join, your point dataframe will have a new column where each row is a polygon ID. Thus, before doing it, make sure all of your dataframe and geodataframe have an ID variable with different names. The example code can be written as the following:
First, you need to covert your pandas dataframe to geopandas geodataframe.
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
# loading polygons geodataframe
gdf_polygons = gpd.read_file('shape_file.shp')
# loading points dataframe
df = pd.read_csv('points_file.csv')
# converting longitude & latitude to geometry
df['coordinates'] = list(zip(df.longitude, df.latitude))
df.coordinates = df.coordinates.apply(Point)
# converting dataframe to geodataframe
gdf_points = gpd.GeoDataFrame(df, geometry='coordinates')
gdf_points.crs = gdf_polygons.crs
# spatial join
sjoin = gpd.sjoin(gdf_points, gdf_polygons, how='left')
# converting geodataframe to dataframe
df_sjoin = pd.DataFrame(sjoin)
# checking missing values
df_sjoin[df_sjoin.geoid.isnull()].shape
Please note that the order of zip(df.longitude, df.latitude) cannot be reversed. CRS in both geodataframe must be the same before the join. It's a good idea to check the missing values after the join, and its number indicates how many points fall into none of the polygons. (The above code assumes you have a geoid column in your original polygons geodataframe.
Now I can think of two options to count the number of points in each polygon, and both of them use .groupby() method.
Option 1: Create a new column and assign each row to 1, and then .groupby() the polygon ID (i.e., geoid) while summing that new column.
df_sjoin['obs'] = 1
counts = df_sjoin.groupby('geoid')['obs'].sum()
df = pd.DataFrame(counts).reset_index()
Option 2: .groupby() the polygon ID (geoid) while counting the number of unique value in the original point dataframe ID column (assuming it's named as id).
counts = df_sjoin.groupby('geoid')['id'].count()
df = pd.DataFrame(counts).reset_index()
You should also check if the following code returns True. (This assumes you've dropped those missing values in the geoid column after the join.)
len(df_sjoin) == df.obs.sum() # if you use option 1
If you have experience of using other GIS software (e.g., QGIS, ArcGIS, etc.) for spatial join, You would be surprised by how fast of geopandas is.
Related
How do I calculate a large distance matrix with the haversine library in python?
I have a small set and a large set of locations and I need to know the geographic distance between the locations in these sets. An example of my datasets (they have the same structure, but one is larger): location lat long 0 Gieten 53.003312 6.763908 1 Godlinze 53.372605 6.814674 2 Grijpskerk 53.263894 6.306134 3 Groningen 53.219065 6.568008 In order to calculate the distances, I am using the haversine library. The haversine function wants the input to look like this: lyon = (45.7597, 4.8422) # (lat, lon) london = (51.509865, -0.118092) paris = (48.8567, 2.3508) new_york = (40.7033962, -74.2351462) haversine_vector([lyon, london], [paris, new_york], Unit.KILOMETERS, comb=True) after which the output looks like this: array([[ 392.21725956, 343.37455271], [6163.43638211, 5586.48447423]]) How do I get the function to calculate a distance matrix with my two datasets without adding all the locations separately? I have tried using dictionaries and I have tried looping over the locations in both datasets, but I can't seem to figure it out. I am pretty new to python, so if someone has a solution that is easy to understand but not very elegant I would prefer that over lambda functions and such. Thanks!
You are on the right track using haversine.haversine_vector. Since I'm not sure how you got your dataset, this is a self-contained example using CSV datasets, but so long as you get lists of cities and coordinates somehow, you should be able to work it out. Note that this does not compute distances between cities in the same array (e.g. not Helsinki <-> Turku) – if you want that too, you could concatenate your two datasets into one and pass it to haversine_vector twice. import csv import haversine def read_csv_data(csv_data): cities = [] locations = [] for (city, lat, lng) in csv.reader(csv_data.strip().splitlines(), delimiter=";"): cities.append(city) locations.append((float(lat), float(lng))) return cities, locations cities1, locations1 = read_csv_data( """ Gieten;53.003312;6.763908 Godlinze;53.372605;6.814674 Grijpskerk;53.263894;6.306134 Groningen;53.219065;6.568008 """ ) cities2, locations2 = read_csv_data( """ Turku;60.45;22.266667 Helsinki;60.170833;24.9375 """ ) distance_matrix = haversine.haversine_vector(locations1, locations2, comb=True) distances = {} for y, city2 in enumerate(cities2): for x, city1 in enumerate(cities1): distances[city1, city2] = distance_matrix[y, x] print(distances) This prints out e.g. { ("Gieten", "Turku"): 1251.501257597515, ("Godlinze", "Turku"): 1219.2012174066822, ("Grijpskerk", "Turku"): 1251.3232414412073, ("Groningen", "Turku"): 1242.8700308545722, ("Gieten", "Helsinki"): 1361.4575055586013, ("Godlinze", "Helsinki"): 1331.2811273683897, ("Grijpskerk", "Helsinki"): 1364.5464743878606, ("Groningen", "Helsinki"): 1354.8847270142198, }
GeoPandas plot shapefile by ignoring some administrative areas
Shapefile Data: The entire world (with 5 administrative areas) from https://gadm.org/data.html import geopandas as gpd World = gpd.read_file("~/gadm36.shp") World=World[['NAME_0','NAME_1','NAME_2','geometry']] #Keep only 3 columns World.head() In this GeoDataFrame, I have 60 columns (NAME_0: for country name, NAME_1 for the region, ...) For now, I am interested in studying the number of users of my website in Germany Germany=World[World['NAME_0'].isin(['Germany']) == True] Now here my website users data by region (NAME_1), I renamed the first column to be the same in shapefile GER = pd.read_csv("~/GER.CSV",sep=";") GER Now I merge my data to GeoDataFrame on NAME_1 to plot users in regions merged_ger = Germany.merge(GER, on = 'NAME_1', how='left') merged_ger['Users'] = merged_ger['Users'].fillna(0) The problem here is that NAME_1 is repeated according to NAME_2. Thus, the total number of users in the merged data greatly exceeds the original number print(merged_ger['Users'].sum()) print(GER['Users'].sum()) 7172411.0 74529 So plot data using this code import matplotlib.pyplot as plt merged_ger.plot(column='Users') is obviously wrong How can I merge the data in this case without duplication and without affecting the final plot? Or, how do I ignore the rest of the administrative areas in a shapefile?
Wouldn't mapping a dictionary of user's region help? GER_users = dict(zip(GER.NAME_1, GER.Users)) Germany['Users'] = Germany['NAME_1'].map(GER_users)
Computing the least distance between coordinate pairs
First dataframe df1 contains id and their corresponding two coordinates. For each coordinate pair in the first dataframe, i have to loop through the second dataframe to find the one with the least distance. I tried taking the individual coordinates and finding the distance between them but it does not work as expected. I believe it has to be taken as a pair when finding the distance between them. Not sure whether Python offers some methods to achieve this. For eg: df1 Id Co1 Co2 334 30.371353 -95.384010 337 39.497448 -119.789623 df2 Id Co1 Co2 339 40.914585 -73.892456 441 34.760395 -77.999260 dfloc3 =[[38.991512-77.441536], [40.89869-72.37637], [40.936115-72.31452], [30.371353-95.38401], [39.84819-75.37162], [36.929306-76.20035], [40.682342-73.979645]] dfloc4 = [[40.914585,-73.892456], [41.741543,-71.406334], [50.154522,-96.88806], [39.743565,-121.795761], [30.027597,-89.91014], [36.51881,-82.560844], [30.449587,-84.23629], [42.920475,-85.8208]]
Given you can get your points into a list like so... df1 = [[30.371353, -95.384010], [39.497448, -119.789623]] df2 = [[40.914585, -73.892456], [34.760395, -77.999260]] Import math then create a function to make finding the distance easier: import math def distance(pt1, pt2): return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2) Then simply transverse your your list saving the closest points: for pt1 in df1: closestPoints = [pt1, df2[0]] for pt2 in df2: if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]): closestPoints = [pt1, pt2] print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1])) Outputs: Point: [30.371353, -95.38401] is closest to [34.760395, -77.99926] Point: [39.497448, -119.789623] is closest to [34.760395, -77.99926]
The code below creates a new column in df1 showing the Id of the nearest point in df2. (I can't tell from the question if this is what you want.) I'm assuming the coordinates are in a Euclidean space, i.e., that the distance between points is given by the Pythagorean Theorem. If not, you could easily use some other calculation instead of dist_squared. import pandas as pd df1 = pd.DataFrame(dict(Id=[334, 337], Co1=[30.371353, 39.497448], Co2=[-95.384010, -119.789623])) df2 = pd.DataFrame(dict(Id=[339, 441], Co1=[40.914585, 34.760395], Co2=[-73.892456, -77.999260])) def nearest(row, df): # calculate euclidian distance from given row to all rows of df dist_squared = (row.Co1 - df.Co1) ** 2 + (row.Co2 - df.Co2) ** 2 # find the closest row of df smallest_idx = dist_squared.argmin() # return the Id for the closest row of df return df.loc[smallest_idx, 'Id'] near = df1.apply(nearest, args=(df2,), axis=1) df1['nearest'] = near
Big data visualization for multiple sampled data points from a large log
I have a log file which I need to plot in python with different data points as a multi line plot with a line for each unique point , the problem is that in some samples some points would be missing and new points would be added in another, as shown is an example with each line denoting a sample of n points where n is variable: 2015-06-20 16:42:48,135 current stats=[ ('keypassed', 13), ('toy', 2), ('ball', 2),('mouse', 1) ...] 2015-06-21 16:42:48,135 current stats=[ ('keypassed', 20, ('toy', 5), ('ball', 7), ('cod', 1), ('fish', 1) ... ] in the above 1 st sample 'mouse ' is present but absent in the second line with new data points in each sample added like 'cod','fish' so how can this be done in python in the quickest and cleanest way? are there any existing python utilities which can help to plot this timed log file? Also being a log file the samples are thousands in numbers so the visualization should be able to properly display it. Interested to apply multivariate hexagonal binning to this and different color hexagoan for each unique column "ball,mouse ... etc". scikit offers hexagoanal binning but cant figure out how to render different colors for each hexagon based on the unique data point. Any other visualization technique would also help in this.
Getting the data into pandas: import pandas as pd df = pd.DataFrame(columns = ['timestamp','name','value']) with open(logfilepath) as f: for line in f.readlines(): timestamp = line.split(',')[0] #the data part of each line can be evaluated directly as a Python list data = eval(line.split('=')[1]) #convert the input data from wide format to long format for name, value in data: df = df.append({'timestamp':timestamp, 'name':name, 'value':value}, ignore_index = True) #convert from long format back to wide format, and fill null values with 0 df2 = df.pivot_table(index = 'timestamp', columns = 'name') df2 = df2.fillna(0) df2 Out[142]: value name ball cod fish keypassed mouse toy timestamp 2015-06-20 16:42:48 2 0 0 13 1 2 2015-06-21 16:42:48 7 1 1 20 0 5 Plot the data: import matplotlib.pylab as plt df2.value.plot() plt.show()
How to show lines connecting latitude and longitude points in world map?
I have get a CSV file of longitude and latitude like the following (the total length is 86 points in the CSV): Index lon lat 1 2.352222 48.85661 2 -72.922343 41.31632 3 108.926694 34.25005 4 -79.944163 40.44306 5 -117.328119 33.97329 6 -79.953423 40.4442 7 -84.396285 33.77562 8 -95.712891 37.09024 And now I want to plot a line from the point(32.06025,118.7969) to all these points(lon,lat) that like many arrow lines from one point. I have try all this work in R and I have meet something strange.For example, if I use map('world2Hires') for (j in 1:length(location$lon)) { inter <- gcIntermediate(c(lon_nj, lat_nj), c(location$lon[j], location$lat[j]), n=100, addStartEnd=TRUE) lines(inter, col="black", lwd=0.8) } View(location) The result is like this: If all the lines point to USA further come across the pacific ocean that the map is very good. But it doesn't. Do you have any idea? How can I realize this? Any tools are OK although I have experience in Python and R. Thank you!
First, you have to add argument breakAtDateLine=TRUE inside function gcIntermediate(). This will ensure that if the line cross the DateLine function will produce two segments and will not connect points with straight line. All results of this calculation I stored in list gg. This list contains data frame for each line or list of data frames of line consists of two segments. library(mapdata) library(geosphere) lon_nj<-118.7969 lat_nj<-32.06025 location<-structure(list(Index = 1:8, lon = c(2.352222, -72.922343, 108.926694, -79.944163, -117.328119, -79.953423, -84.396285, -95.712891), lat = c(48.85661, 41.31632, 34.25005, 40.44306, 33.97329, 40.4442, 33.77562, 37.09024)), .Names = c("Index", "lon", "lat"), class = "data.frame", row.names = c(NA, -8L)) gg<-lapply(1:length(location$lon),function(j) { gcIntermediate(c(lon_nj, lat_nj), c(location$lon[j], location$lat[j]), n=100, breakAtDateLine=TRUE, addStartEnd=TRUE) }) This will change your list so that each segment is in separate data frame and not in list of list. gg2<-unlist(lapply(gg, function(x) if (class(x) == "list") x else list(x)), recursive=FALSE) To plot those data again you can use function lapply(). If you use map("world) then do just map("world") lapply(gg2,lines) If you use map('world2Hires') then this map is based on 0-360 latitudes. So you have to add 360 to those x coordinate values that are negative. map('world2Hires') lapply(gg2,function(x) lines(ifelse(x[,1]>0,x[,1],x[,1]+360),x[,2]))