I am using GeoPandas and Pandas.
I have a, say, 300,000 rows Dataframe, df, with 4 columns + the index column.
id lat lon geometry
0 2009 40.711174 -73.99682 0
1 536 40.741444 -73.97536 0
2 228 40.754601 -73.97187 0
however the unique ids are only a handful (~200)
I want to generate a shapely.geometry.point.Point object for each (lat,lon) combination, similarly to what shown here: http://nbviewer.ipython.org/gist/kjordahl/7129098
(see cell#5),
where it loops through all rows of the dataframe; but for such a big dataset, I wanted to limit the loop to the much smaller number of unique ids.
Therefore, for a given id value, idvalue (i.e., 2009 from the first row) create the GeoSeries, and assign it directly to ALL rows that have id==idvalue
My code looks like:
for count, iunique in enumerate(df.if.unique()):
sc_start = GeoSeries([Point(np.array(df[df.if==iunique].lon)[0],np.array(df[df.if==iunique].lat)[0])])
df.loc[iunique,['geometry']] = sc_start
however things don't work - the geometry field does not change - and I think is because the indexes of sc_start don't match with the indexes of df.
how can I solve this? should I just stick to the loop through the whole df?
I would take the following approach:
First find the unique id's and create a GeoSeries of Points for this:
unique_ids = df.groupby('id', as_index=False).first()
unique_ids['geometry'] = GeoSeries([Point(x, y) for x, y in zip(unique_ids['lon'], unique_ids['lat'])])
Then merge these geometries with the original dataframe on matching ids:
df.merge(unique_ids[['id', 'geometry']], how='left', on='id')
Related
I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.
This is the structure of the dataframes:
tempDF
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse
use1
0 NaN
1 NaN
2 NaN
This is the code:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.
When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.
If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):
tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')
I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:
I have a datafrme df1 like below: lat-long can be duplicate
miles uid lat_long
12 235 (45,67)
13 234 (41.09,67)
14 233 (34,55)
15 236 (12.23,65.78)
16 239 (27,34)
I want remove the entry from df1 if the lat_long value is invalid.I am doing this like below but taking too much time.
all_lat_long = df1["lat_long"].tolist(). #list of tuples
def lat_long_check(each_coordnts):
match = re.match('^\((?P<lat>-?\d*(.\d+)),(?P<long>-?\d*(.\d+))\)$',
str(each_coordnts)) #find invalid lat-long
if match is None:
idx = df1[df1['lat_long'] == each_coordnts].index
df1.drop(idx,inplace=True)
for each_coordnts in all_lat_long:
lat_long_check(each_coordnts)
Is there any efficient way to do this for 1M records? Once wrong lat-long entries are removed, I want populate two new columns at the end of df1-"Latitude" and "Longitude" and populate corresponding values.
I would proceed as follows:
Define a function validate_lat_long that returns a tuple of floats if the latitude/longitude values are correct. I assume this has to do with checking that the values are within expected intervals (-90 to 90 for latitude, etc). The function should return np.nan if the values are not correct.
Create a new column with correct values as follows:
df1["validated_lat_long"] = df1["lat_long"].apply(validate_lat_long)
Finally, in order to remove invalid values, use dropna on the new column and possibly make a new dataframe if you need to preserve the previous work:
new_df = df1.dropna(subset=["validated_lat_long"])
Your code is most probably slow because it iterates on dataframe rows. Applying a function with df.apply() should speed things up reasonably. Also I hope you can check floats instead of searching for a regex.
I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.
I'm learning python and am currently trying to parse out the longitude and latitude from a "Location" column and assign them to the 'lat' and 'lon' columns. I currently have the following code:
def getlatlong(cell):
dd['lat'] = cell.split('\n')[2].split(',')[0][1:]
dd['lon'] = cell.split('\n')[2].split(',')[1][1:-1]
dd['Location'] = dd['Location'].apply(getlatlong)
dd.head()
The splitting portion of the code works. The problem is that this code copies the lat and lon from the last cell in the dataframe to all of the 'lat' and 'lon' rows. I want it to split the current row it is iterating through, assign the 'lat' and 'lon' values for that row, and then do the same on every subsequent row.
I get that assigning dd['lat'] to the split value assigns it to the whole column, but I don't know how to assign to just the row currently being iterated over.
Data sample upon request:
Index,Location
0,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"
1,"1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67931141, -121.7765988)"
2,"138 14TH ST\nOAKLAND, CA 94612\n(37.80140803, -122.26369831)"
3,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968061, -122.19690846)"
4,"4014 MACARTHUR BLVD\nOAKLAND, CA 94619\n(37.78968557, -122.19692165)"
Please see my approach below. It is based on creating a DataFrame with lat and lon columns and then adding it to the existing dataframe.
def getlatlong(x):
return pd.Series([x.split('\n')[2].split(',')[0][1:],
x.split('\n')[2].split(',')[1][1:-1]],
index = ["lat", "lon"])
df = pd.concat((df, df.Location.apply(getlatlong)), axis=1)
This addresses another technique you can use to get the answer, but isn't exact code you need. If you add sample data i can tailor it.
Using Pandas's build in str methods you can save yourself some headache as follows:
temp_df = df['Location'].str.split('\n').str.split().apply(pd.Series)
The above splits the Location column on spaces, and then turns the split values into columns. You can then assign just the Latitude and Longitude columns to the original df.
df[['Latitude', 'Longitude']] = temp_df[[<selection1>, <selection2>]]
str.split() also has an expand parameter so that you can write .str.split("char", expand=True) to spread out the columns without the apply.
Update
Given your example, this works for your specific case:
df = pd.DataFrame({"Location": ["1554 FIRST ST\nLIVERMORE, CA 94550\n(37.67930642, -121.7765857)"]})
df[["Latitude", "Longitude"]] = (df['Location']
.str.split('\n')
.apply(pd.Series)[2] # Column 2 has the str (lat, long)
.str[1:-1] # Strip the ()
.str.split(",", expand=True) # Expand latitude and longitude into two columns
.astype(float)) # Make sure latitude and longitude are floats
Out:
Location Latitude Longitude
0 1554 FIRST ST\nLIVERMORE, CA 94550\n(37.679306... 37.679306 -121.776586
Update #2
#Abhishek Mishra's answer is faster (takes only 55% of the time, since it goes through the data fewer times). Worth noting that the output from that example has strings in each column, so you might want to modify to get values back to floats.
for ind, row in dd.iterrows():
dd['lat'].loc[ind] = dd['Location'].loc[ind].split(',')[0][1:]
dd['lon'].loc[ind] = dd['Location'].loc[ind].split(',')[1][1:-1]
PS: iterrows() is slow.