How to map closed values from two dataframes:
I've two dataframes in below format and looking to map values based on o_lat,o_long from data1 and near_lat,near_lon:
data1 ={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024]}
Where lat,long is coordinates of destination, d is the distance between origin and destination, o_lat,o_long is the coordinates of origin.
data2={'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.8185,-37.8126,-37.8099],
'lon':[144.9695,144.9470,144.9952]}
I want to produce another column in data1 which locates nearest_warehouse in the following format based on closed value:
result={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse':['Bakers','Thompson','Nickolson']}
I've tried following code:
lat_diff=[]
long_diff=[]
min_distance=[]
for i in range(0,3):
lat_diff.append(float(warehouse.near_lat[i])-lat_long_d.o_lat[0])
for j in range(0,3):
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
min_distance=[min(lat_diff),min(long_diff)]
min_distance
Which gives the following result which is the minimum value of the difference between latitude and longitude for o_lat=-37.8095 and o_lang=145.0000:
[-0.00897867136701791, -0.05300973586690816].
I feel the approach is not viable to map close values over a large dataset.
Looking for a better approach in this regard
From the first dataframe, you can go through each row with lambda x: and compare to all rows of the second dataframe and return a list of the absolute difference of latitude and add that to the absolute difference of longitude using list comprehension. This effectively gives you the minimum distance.
Now, what you are interested in is the index, i.e. position of the minimum absolute difference of longiture plus absolute difference of latitude for each row. You can find this with idxmin(). In dataframe 1, this returns the index number which you can use to merge against the index of dataframe 2 to pull in the closest warehouse:
setup:
data1 = pd.DataFrame({'lat': [-0.659901, -0.659786, -0.659821], 'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024]})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.818595, -37.812673, -37.809996], 'lon':[144.969551, 144.947069, 144.995232],
'near_lat':[-37.8185,-37.8126,-37.8099], 'near_lon':[144.9695,144.9470,144.9952]})
code:
data1['key'] = data1.apply(lambda x: ((x['o_lat'] - data2['near_lat']).abs()
+ (x['o_long'] - data2['near_lon']).abs()).idxmin(), axis=1)
data1 = pd.merge(data1, data2[['nearest_warehouse']], how='left', left_on='key', right_index=True).drop('key', axis=1)
data1
Out[1]:
lat long d o_lat o_long nearest_warehouse
0 -0.659901 2.530561 0.4202 -37.8095 145.0000 Bakers
1 -0.659786 2.530797 1.0957 -37.8030 145.0077 Bakers
2 -0.659821 2.530587 0.6309 -37.8050 145.0024 Bakers
This result looks accurate if you append the two dataframes into one and do a basic scatterplot. As you can see Bakers warehouse is right there compared to the other points (graph IS to scale with last line of code):
import matplotlib.pyplot as plt
data1 = pd.DataFrame({'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse': ['0','1','2']})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'o_lat':[-37.8185,-37.8126,-37.8099], 'o_long':[144.9695,144.9470,144.9952]})
df = data1.append(data2)
y = df['o_lat'].to_list()
z = df['o_long'].to_list()
n = df['nearest_warehouse'].to_list()
fig, ax = plt.subplots()
ax.scatter(z, y)
for i, txt in enumerate(n):
ax.annotate(txt, (z[i], y[i]))
plt.gca().set_aspect('equal', adjustable='box')
Related
I have a very large spatial dataset stored in a dataframe. I am taking a slice of that dataframe into a new smaller subset to run further calculations.
The data has x, y and z coordinates with a number of additional columns, some of which are text and some are numeric. The x and y coordinates are on a defined grid and have a known separation.
Data looks like this
x,y,z,text1,text2,text3,float1,float2
75000,45000,120,aa,bbb,ii,12,0.2
75000,45000,110,bb,bbb,jj,22,0.9
75000,45100,120,aa,bbb,ii,11,1.8
75000,45100,110,bb,bbb,jj,45,2.4
75000,45100,100,bb,ccc,ii,13.6,1
75100,45000,120,bb,ddd,jj,8.2,2.1
75100,45000,110,bb,ddd,ii,12,0.6
For each x and y pair I want to iterate over a two series of text values and do three things in the z direction.
Calculate the average of one numeric value for all the values with a third specific text value
Sum another numeric value for all the values with the same text value
Write the a resultant table of 'x, y, average, sum' to a csv.
My code does part three (albeit very slowly) but doesn't calculate 1 or 2 or at least I don't appear to get the average and sum calculations in my output.
What have I done wrong and how can I speed it up?
for text1 in text_list1:
for text2 in text_list2:
# Get the data into smaller dataframe
df = data.loc[ (data["textfield1"] == text1) & (data["textfield2"] == text2 ) ]
#Get the minimum and maximum x and y
minXw = df['x'].min()
maxXw = df['x'].max()
minYw = df['y'].min()
maxYw = df['y'].max()
# dictionary for quicker printing
dict_out = {}
rows_list = []
# Make output filename
filenameOut = text1+"_"+text2+"_Values.csv"
# Start looping through x values
for x in np.arange(minXw, maxXw, x_inc):
xcount += 1
# Start looping through y values
for y in np.arange(minYw, maxYw, y_inc):
ycount += 1
# calculate average and sum
ave_val = df.loc[df['textfield3'] == 'text3', 'float1'].mean()
sum_val = df.loc[df['textfield3'] == 'text3', 'float2'].sum()
# Make Dictionary of output values
dict_out = dict([('text1', text1),
('text2', text2),
('text3', df['text3']),
('x' , x-x_inc),
('y' , y-y_inc),
('ave' , ave_val),
('sum' , sum_val)])
rows_list_c.append(dict_out)
# Write csv
columns = ['text1','text2','text3','x','y','ave','sum']
with open(filenameOut, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=columns)
writer.writeheader()
for data in dict_out:
writer.writerow(data)
My resultant csv gives me:
text1,text2,text3,x,y,ave,sum
text1,text2,,74737.5,43887.5,nan,0.0
text1,text2,,74737.5,43912.5,nan,0.0
text1,text2,,74737.5,43937.5,nan,0.0
text1,text2,,74737.5,43962.5,nan,0.0
Not really clear what you're trying to do. But here is a starting point
If you only need to process rows with a specific text3value, start by filtering out the other rows:
df = df[df.text3=="my_value"]
If at this point, you do not need text3 anymore, you can also drop it
df = df.drop(columns="text3")
Then you process several sub dataframes, and write each of them to their own csv file. groupby is the perfect tool for that:
for (text1, text2), sub_df in df.groupby(["text1", "text2"]):
filenameOut = text1+"_"+text2+"_Values.csv"
# Process sub df
output_df = process(sub_df)
# Write sub df
output_df.to_csv(filenameOut)
Note that if you keep your data as a DataFrame instead of converting it to a dict, you can use the DataFrame to_csv method to simply write the output csv.
Now let's have a look at the process function (Note that you dont really need to make it a separate function, you could as well dump the function body in the for loop).
At this point, if I understand correctly, you want to compute the sum and the average of every rows that have the same x and y coordinates. Here again you can use groupby and the agg function to compute the mean and the sum of the group.
def process(sub_df):
# drop the text1 and text2 columns since they are in the filename anyway
out = sub_df.drop(columns=["text1","text2"])
# Compute mean and max
return out.groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum"))
And that's preety much it.
Bonus: 2-liner version (but don't do that...)
for (text1, text2), sub_df in df[df.text3=="my_value"].drop(columns="text3").groupby(["text1", "text2"]):
sub_df.drop(columns=["text1","text2"]).groupby(["x", "y"]).agg(ave=("float1", "mean"), sum=("float2", "sum")).to_csv(text1+"_"+text2+"_Values.csv")
To do this in an efficient way in pandas you will need to use groupby, agg and the in-built to_csv method rather than using for loops to construct lists of data and writing each one with the csv module. Something like this:
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby("text3") \
.agg({"float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
It's not clear exactly what you're trying to do with the incrementing of x and y values, which is also what makes your current code very slow. To present sums and averages of the floating point columns by intervals of x and y, you could make bin columns and group by those too.
data["x_bin"] = (data["x"] - data["x"].min()) // x_inc
data["y_bin"] = (data["y"] - data["y"].min()) // y_inc
groups = data[data["text1"].isin(text_list1) & data["text2"].isin(text_list2)] \
.groupby(["text1", "text2"])
for (text1, text2), group in groups:
group.groupby(["text3", "x_bin", "y_bin"]) \
.agg({"x": "first", "y": "first", "float1": np.mean, "float2": sum}) \
.to_csv(f"{text1}_{text2}_Values.csv")
I have a pandas dataframe, bike_path_df, that contains a few columns, of which one is called coordinates. The format of coordinates is a list of lists, where each inner list is a pair of latitude and longitude coordinates, and the i'th and i+1'th element of the list of lists denote a straight line bike path segment connecting the two points i and i+1.
For example:
bike_path_df.iloc[0]['coordinates']
yields the following:
[[149.12482362501234, -35.17695800091904], # Point A of line 1
[149.12481244481404, -35.177008392939385], # Point B of line 1, point A of line 2
[149.12480556675655, -35.17703489702785], # Point B of line 2, point A of line 3
[149.12481021458206, -35.17706139012856], # etc...
[149.12483798252785, -35.17709736965295],
[149.12489568437493, -35.17714846206322]]
After some effort, I've written a clumsy loop that will allow me to pair each point with it's neighbours:
all_list = []
for list_of_points in bike_paths_df['coordinates']:
result = [ [ list_of_points[i], list_of_points[i+1] ] for i,j in enumerate(list_of_points) if i+1 < len(list_of_points) ]
all_list.append(result)
The output from the above resembles something like
[[[149.12482362501234, -35.17695800091904],[149.12481244481404, -35.177008392939385]],
[149.12481244481404, -35.177008392939385], [149.12480556675655, -35.17703489702785]],
...]]]
But converting all_list to a pd.Series object can return NaN when I try to add it back to the original dataframe (I believe because Series is expanding the list of lists, so the shapes are no longer equal).
Ideally I'd like to have each set of four points on a dataframe row, with the other data for that path repeated for each set of four points, such that it would resemble:
>>bike_path_df.head()
name coordinate_pair
Path1 [A1_long, A1_lat, B1_long, B1_lat]
Path1 [B1_long, B1_lat, C1_long, C1_lat]
Path1 [C1_long, C1_lat, D1_long, D1_lat]
Path1 [D1_long, D1_lat, E1_long, E1_lat]
Path2 [A2_long, A2_lat, B2_long, B2_lat]
Path2 [B2_long, B2_lat, C2_long, C3_lat]
...
Does anyone have any advice?
I've also uploaded a few rows of the actual data I'm working with in CSV format here: https://github.com/Ecaloota/BikePathInfrastructure-ACT as "bike_paths_progress.csv"
Thank you!
IIUC, use zip and explode
df = pd.read_csv('bike_paths_progress.csv', index_col=0)
df['coordinates'] = pd.eval(df['coordinates'])
df = df.join(df['coordinates'].apply(lambda x: [[i[0], i[1], j[0], j[1]]
for i, j in zip(x, x[1:])])
.explode().rename('coordinate_pair'))
Output:
>>> df.loc[81, 'coordinate_pair']
81 [149.12482362501234, -35.17695800091904, 149.12481244481404, -35.17700839293...
81 [149.12481244481404, -35.177008392939385, 149.12480556675655, -35.1770348970...
81 [149.12480556675655, -35.17703489702785, 149.12481021458206, -35.17706139012...
81 [149.12481021458206, -35.17706139012856, 149.12483798252785, -35.17709736965...
81 [149.12483798252785, -35.17709736965295, 149.12489568437493, -35.17714846206...
Name: coordinate_pair, dtype: object
I have a dataframe named SD_Apartments that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of apartment names, and their coordinates.
I have another dataframe named SD_Coffee that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of coffee shop names, and their coordinates.
I want to add another variable to SD_apartments called coffee_count that would have the number of coffee shop locations listed in my SD_coffee dataframe that are within x (for example, 300) meters from each apartment listed in SD_apartments.
Here is a setup of the code I'm working with:
import pandas as pd
import geopy.distance
from geopy.distance import geodesic
data = [['Insomnia', 32.784782, -117.129130], ['Starbucks', 32.827521, -117.139966], ['Dunkin', 32.778519, -117.154720]]
data1 = [['DreamAPT', 32.822090, -117.184200], ['OKAPT', 32.748081, -117.130691], ['BadAPT', 32.786886, -117.097536]]
SD_Coffee = pd.DataFrame(data, columns = ['name', 'latitude', 'longitude'])
SD_Apartments = pd.DataFrame(data1, columns = ['name', 'latitude', 'longitude'])
Here is the code I'm attempting to use to accomplish my goal:
def geodesic_pd(df1, df2_row):
return [(geodesic([tuple(x) for x in row.values], [tuple(x) for x in df2_row.values]).m for row in df1)]
SD_Apartments['coffee_count'] = pd.Series([(sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row) < 300) for row in SD_Apartments[['latitude', 'longitude']])])
If you run it and print SD_Apartments, you will see that SD_Apartments looks like:
name ... coffee_count
0 DreamAPT ... <generator object <genexpr> at 0x000002E178849...
1 OKAPT ... NaN
2 BadAPT ... NaN
This will probably help you:
import pandas as pd
df = pd.DataFrame({'geodesic': [1, 10, 8, 11, 20,2,2],'apartment': list('aaceeee')})
df.nsmallest(3, 'geodesic')
Another way of doing this is by using K-Nearest neighbors using the geodesic distance:
SKLearn-KNN
Assuming you are using pandas dataframes, you should be able to use something like this unless you have very large arrays -
import numpy as np
def geodesic_pd(df1, df2_row):
dist = []
for _, row in df1.iterrows():
dist.append(geodesic(tuple(row.values), tuple(df2_row.values)).m)
return np.array(dist)
SD_Apartments['coffee_count'] = SD_Apartments.apply(lambda row: sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row[['latitude', 'longitude']]) < 300), axis =1)
The geodesic_pd functions extends the geodesic calculation to a dataframe from individual tuples to a dataframe, and the next statement calculates the number of coffee stores within 300 meters and stores them in a new column.
If you have large arrays, then you should combine KNN in order to only perform this operation over a subset of points.
I've several hundreds of pandas dataframes and And the number of rows are not exactly the same in all the dataframes like some have 600 but other have 540 only.
So what i want to do is like, i have two samples of exactly the same numbers of dataframes and i want to read all the dataframes(around 2000) from both the samples. So that's how thee data looks like and i can read the files like this:
5113.440 1 0.25846 0.10166 27.96867 0.94852 -0.25846 268.29305 5113.434129
5074.760 3 0.68155 0.16566 120.18771 3.02654 -0.68155 101.02457 5074.745627
5083.340 2 0.74771 0.13267 105.59355 2.15700 -0.74771 157.52406 5083.337081
5088.150 1 0.28689 0.12986 39.65747 2.43339 -0.28689 164.40787 5088.141849
5090.780 1 0.61464 0.14479 94.72901 2.78712 -0.61464 132.25865 5090.773443
#first Sample
path_to_files = '/home/Desktop/computed_2d_blaze/'
lst = []
for filen in [x for x in os.listdir(path_to_files) if '.ares' in x]:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df = df.sort_values('stlines', ascending=False)
df = df.drop_duplicates('wave')
df = df.reset_index(drop=True)
lst.append(df)
#second sample
path_to_files1 = '/home/Desktop/computed_1d/'
lst1 = []
for filen in [x for x in os.listdir(path_to_files1) if '.ares' in x]:
df1 = pd.read_table(path_to_files1+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df1 = df1.sort_values('stlines', ascending=False)
df1 = df1.drop_duplicates('wave')
df1 = df1.reset_index(drop=True)
lst1.append(df1)
Now the data is stored in lists and as the number of rows in all the dataframes are not same so i cant subtract them directly.
So how can i subtract them correctly?? And after that i want to take average(mean) of the residual to make a dataframe?
You shouldn't use apply. Just use Boolean making:
mask = df['waves'].between(lower_outlier, upper_outlier)
df[mask].plot(x='waves', y='stlines')
One solution that comes into mind is writing a function that finds outliers based on upper and lower bounds and then slices the data frames based on outliers index e.g.
df1 = pd.DataFrame({'wave': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'stlines': [0.1, 0.2, 0.3, 0.4, 0.5]})
def outlier(value, upper, lower):
"""
Find outliers based on upper and lower bound
"""
# Check if input value is within bounds
in_bounds = (value <= upper) and (value >= lower)
return in_bounds
# Function finds outliers in wave column of DF1
outlier_index = df1.wave.apply(lambda x: outlier(x, 4, 1))
# Return DF2 without values at outlier index
df2[outlier_index]
# Return DF1 without values at outlier index
df1[outlier_index]
I've got two csv files, df1 which has postcodes only and df which has postcodes and their corresponding longitude and latitude values.
import numpy as np
from math import radians, sqrt, sin, cos, atan2
import pandas as pd
df = pd.read_csv("C:/Users///UKPostcodes.csv")
df1 = pd.read_csv("C:/Users///postcode.csv")
X = df['outcode'].values
lat = df['latitude'].values
lon = df['longitude'].values
find = df1['Postcode District'].values
longitude = []
for i in range(0, len(find)):
for j in range(0, len(X)):
if find[i] == X[j]:
print(find[i])
#longitude.append(float(lon[j]));
I'm trying to loop through both files and find all of the longitude and latitude for df1, at the moment it runs an infinite loop, any idea how I can do this for all the values in my df1 file only and terminate once that limit has been reached?
edit: example of files:
df1
df
If your data in df['outcode'] and df1['Postcode District'] have the same formalism, I think you can use merge to create two columns (latitude and longitude) associated to the column Postcode District of df1 such as:
df_output = df1.merge(df, how = 'left', left_on= 'Postcode District', right_on= 'outcode')
df1 is the left DF and df is the right DF, how = 'left' means you keep all the keys from df1. left_on= 'Postcode District' and right_on= 'outcode' define the column on which merge happens for each DF. See this link for mote details on merge