Selecting rows from pandas dataframe based on cKDTree indices - python

I was trying to do some quick-and-dirty reverse geocoding.
I have the dataframe poi (around 50,000 rows), where each point of interest has a lat/lng coordinate.
I have also the dataframe postcode_existing (around 180,000 rows), which maps lat/lng coordinates to postcodes.
I pulled out the relevant coordinate columns and used cKDTree to determine, for each point of interest in poi, the nearest lat/lng coordinate in postcode_existing.
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
# read poi and postcode csv files
# Extract subset
postcode_existing_coordinates = postcode_existing[['Latitude', 'Longitude']]
# Extract subset
poi_coordinates = poi[['Latitude', 'Longitude']]
# Construct tree
tree = cKDTree(postcode_existing_coordinates)
# Query
distances, indices = tree.query(poi_coordinates)
I end up with the relevant indices. I am now looking to select the rows from the dataframe postcode_existing using those indices.
I tried postcode_existing.ix[indices], but this seems not to get the correct rows.
For example:
>>> postcode_existing.ix[indices].head()
Postcode Latitude Longitude Easting Northing GridRef \
78579 HA3 0NS 51.57553 -0.304296 517605.0 187658.0 TQ176876
178499 NaN NaN NaN NaN NaN NaN
62392 NaN NaN NaN NaN NaN NaN
78662 HA3 0TA 51.58409 -0.288764 518659.0 188635.0 TQ186886
79470 NaN NaN NaN NaN NaN NaN
County District Ward DistrictCode ... Terminated \
78579 Greater London Brent Kenton E09000005 ... NaN
178499 NaN NaN NaN NaN ... NaN
62392 NaN NaN NaN NaN ... NaN
78662 Greater London Brent Kenton E09000005 ... NaN
79470 NaN NaN NaN NaN ... NaN
Parish NationalPark Population Households Built up area \
78579 NaN NaN 72.0 25.0 Greater London
178499 NaN NaN NaN NaN NaN
62392 NaN NaN NaN NaN NaN
78662 NaN NaN 152.0 39.0 Greater London
79470 NaN NaN NaN NaN NaN
Built up sub-division Lower layer super output area \
78579 Brent Brent 004D
178499 NaN NaN
62392 NaN NaN
78662 Brent Brent 003E
79470 NaN NaN
Rural/urban Region
78579 Urban major conurbation London
178499 NaN NaN
62392 NaN NaN
78662 Urban major conurbation London
79470 NaN NaN
[5 rows x 25 columns]
But:
>>> postcode_existing.iloc[78579]
Postcode NW1 3AU
Latitude 51.5237
Longitude -0.143188
Easting 528915
Northing 182163
GridRef TQ289821
County Greater London
District Westminster
Ward Marylebone High Street
DistrictCode E09000033
WardCode E05000641
Country England
CountyCode E11000009
Constituency Cities of London and Westminster
Introduced 1980-01-01
Terminated NaN
Parish NaN
NationalPark NaN
Population 7
Households 1
Built up area Greater London
Built up sub-division City of Westminster
Lower layer super output area Westminster 013A
Rural/urban Urban major conurbation
Region London
Name: 133733, dtype: object
Also:
>>> postcode_existing.iloc[178499]
Postcode WC1E 6JL
Latitude 51.5236
Longitude -0.135522
Easting 529447
Northing 182168
GridRef TQ294821
County Greater London
District Camden
Ward Bloomsbury
DistrictCode E09000007
WardCode E05000129
Country England
CountyCode E11000009
Constituency Holborn and St Pancras
Introduced 1980-01-01
Terminated NaN
Parish NaN
NationalPark NaN
Population 1
Households 1
Built up area Greater London
Built up sub-division Camden
Lower layer super output area Camden 026D
Rural/urban Urban major conurbation
Region London
Name: 307029, dtype: object
These appear to be correct.
Why does postcode_existing.ix[indices] not select the correct rows? What should I be using instead?

I solved the problem. The issue was a mismatch between the position in the dataframe and the index due to the removal of certain rows.
To fix this, I simply reset the index:
postcode_existing.reset_index(inplace=True, drop=True)
I was then able to use loc to extract the relevant rows:
postcode_existing.loc[indices]

The problem is that you are using integers in your index. This messes things up as pandas attempts to keep track of list based locations as well as labels. ix attempts to figure it out. It is interpreting indices as list locations. In this case, use loc
Documentation
DataFrame.ix
A primarily label-location based indexer, with integer position fallback.
.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
.ix is the most general indexer and will support any of the inputs in .loc and .iloc. .ix also supports floating point label schemes. .ix is exceptionally useful when dealing with mixed positional and label based hierachical indexes.
However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

Related

Merge two rows pandas dataframe

I have this data, and I need to merge the two selected columns with the other row because its duplicated rows cames from my code.
So, how could I do this?
Here is a way to do what your question asks:
df[['State_new', 'Solution_new']] = df[['Power State', 'Recommended Solution']].shift()
mask = ~df['State_new'].isna()
df.loc[mask, 'State'] = df.loc[mask, 'State_new']
df.loc[mask, 'Recommended Solutuin'] = df.loc[mask, 'Solution_new']
df = df.drop(columns=['State_new', 'Solution_new', 'Power State', 'Recommended Solution'])[~df['State'].isna()].reset_index(drop=True)
Explanation:
create versions of the important data from your code shifted down by one row
create a boolean mask indicating which of these shifted rows are not empty
use this mask to overwrite the content of the State and Recommended Solutuin columns (NOTE: using original column labels verbatim from OP's question) with the updated data from your code contained in the shifted columns
drop the columns used to perform the update as they are no longer needed
use reset_index to create a new integer range index without gaps.
In case it's helpful, here is sample code to pull the dataframe in from Excel:
import pandas as pd
df = pd.read_excel('TestBook.xlsx', sheet_name='TestSheet', usecols='AD:AM')
Here's the input dataframe:
MAC RLC RLC 2 PDCCH Down PDCCH Uplink Unnamed: 34 Recommended Solutuin State Power State Recommended Solution
0 122.9822 7119.503 125.7017 1186.507 784.9464 NaN Downtitlt antenna serving cell is overshooting NaN NaN
1 4.1000 7119.503 24.0000 11.000 51.0000 NaN Downtitlt antenna serving cell is overshooting NaN NaN
2 121.8900 2127.740 101.3300 1621.000 822.0000 NaN uptilt antenna bad coverage NaN NaN
3 86.5800 2085.250 94.6400 1650.000 880.0000 NaN uptilt antenna bad coverage NaN NaN
4 64.7500 1873.540 63.8600 1259.000 841.0000 NaN uptilt antenna bad coverage NaN NaN
5 84.8700 1735.070 60.3800 1423.000 474.0000 NaN uptilt antenna bad coverage NaN NaN
6 49.3400 1276.190 59.9600 1372.000 450.0000 NaN uptilt antenna bad coverage NaN NaN
7 135.0200 2359.840 164.1300 1224.000 704.0000 NaN NaN NaN Bad Power Check hardware etc.
8 135.0200 2359.840 164.1300 1224.000 704.0000 NaN uptilt antenna bad coverage NaN NaN
9 163.7200 1893.940 90.0300 1244.000 753.0000 NaN NaN NaN Bad Power Check hardware etc.
10 163.7200 1893.940 90.0300 1244.000 753.0000 NaN uptilt antenna bad coverage NaN NaN
11 129.6400 1163.140 154.3200 663.000 798.0000 NaN NaN NaN Bad Power Check hardware etc.
12 129.6400 1163.140 154.3200 663.000 798.0000 NaN uptilt antenna bad coverage NaN NaN
Here is the sample output:
MAC RLC RLC 2 PDCCH Down PDCCH Uplink Unnamed: 34 Recommended Solutuin State
0 122.9822 7119.503 125.7017 1186.507 784.9464 NaN Downtitlt antenna serving cell is overshooting
1 4.1000 7119.503 24.0000 11.000 51.0000 NaN Downtitlt antenna serving cell is overshooting
2 121.8900 2127.740 101.3300 1621.000 822.0000 NaN uptilt antenna bad coverage
3 86.5800 2085.250 94.6400 1650.000 880.0000 NaN uptilt antenna bad coverage
4 64.7500 1873.540 63.8600 1259.000 841.0000 NaN uptilt antenna bad coverage
5 84.8700 1735.070 60.3800 1423.000 474.0000 NaN uptilt antenna bad coverage
6 49.3400 1276.190 59.9600 1372.000 450.0000 NaN uptilt antenna bad coverage
7 135.0200 2359.840 164.1300 1224.000 704.0000 NaN Check hardware etc. Bad Power
8 163.7200 1893.940 90.0300 1244.000 753.0000 NaN Check hardware etc. Bad Power
9 129.6400 1163.140 154.3200 663.000 798.0000 NaN Check hardware etc. Bad Power
You can use groupby to combine the rows by columns:
df = pd.DataFrame(data)
new_df = df.groupby(['MAC', 'RLC1', 'RLC2', 'POCCH', 'POCCH Up']).sum()
new_df.reset_index()
You can do something like:
fill_cols = ['Power State', 'Recommended Solution 2']
dup_cols = ['MAC_UL','RLC_Through_1','RLC_Through_2','PDCCH Down', 'PDCCH Up']
m = df.duplicated(subset=dup_cols, keep=False)
df_fill = df.loc[m,fill_cols]
df_fill[df_fill['Power State']==''] = np.NaN
df_fill[df_fill['Recommended Solution 2']==''] = np.NaN
df.loc[m,fill_cols]=df_fill.ffill()
Get duplicated rows using duplicated
Fill empty values with NaN
Then use ffill

Create dataframe pandas from dict where values are list of tuples and each column name is unique

I have two lists that I use to create a dictionary, where list1 has text data and list2 is a list of tuples (text, float). I use these 2 lists to create a dictionary and the goal is to create a dataframe where each row of the first column will contain the elements of list1, each column will have a column name based on each unique text term from the first tuple element and for each row there will be the float values that connect them.
For example here's the dictionary with keys : {be, associate, induce, represent} and values : {('prove', 0.583171546459198), ('serve', 0.4951282739639282)} etc.
{'be': [('prove', 0.583171546459198), ('serve', 0.4951282739639282), ('render', 0.4826732873916626), ('represent', 0.47748714685440063), ('lead', 0.47725602984428406), ('replace', 0.4695377051830292), ('contribute', 0.4529820680618286)],
'associate': [('interact', 0.8237789273262024), ('colocalize', 0.6831706762313843)],
'induce': [('suppress', 0.8159114718437195), ('provoke', 0.7866303324699402), ('elicit', 0.7509980201721191), ('inhibit', 0.7498961687088013), ('potentiate', 0.742023229598999), ('produce', 0.7384929656982422), ('attenuate', 0.7352016568183899), ('abrogate', 0.7260081768035889), ('trigger', 0.717864990234375), ('stimulate', 0.7136563658714294)],
'represent': [('prove', 0.6612186431884766), ('evoke', 0.6591314673423767), ('up-regulate', 0.6582908034324646), ('synergize', 0.6541063785552979), ('activate', 0.6512928009033203), ('mediate', 0.6494284272193909)]}
Desired Output
prove serve render represent
be 0.58 0.49 0.48 0.47
associate 0 0 0 0
induce 0.45 0.58 0.9 0.7
represent 0.66 0 0 1
So what tricks me is that the verb prove can be found in more than one keys (i.e. for the key be, the score is 0.58 and for the key represent the score is 0.66).
If I use df = pd.DataFrame.from_dict(d,orient='index'), then the verb prove will appear twice as a column name, whereas I want each term to appear once in each column.
Can someone help?
With the dictionary that you provided (as d), you can't use from_dict directly.
You either need to rework the dictionary to have elements as dictionaries:
pd.DataFrame.from_dict({k: dict(v) for k,v in d.items()}, orient='index')
Or you need to read it as a Series and to reshape:
(pd.Series(d).explode()
.apply(pd.Series)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
)
output:
prove serve render represent lead replace \
be 0.583172 0.495128 0.482673 0.477487 0.477256 0.469538
represent 0.661219 NaN NaN NaN NaN NaN
associate NaN NaN NaN NaN NaN NaN
induce NaN NaN NaN NaN NaN NaN
contribute interact colocalize suppress ... produce \
be 0.452982 NaN NaN NaN ... NaN
represent NaN NaN NaN NaN ... NaN
associate NaN 0.823779 0.683171 NaN ... NaN
induce NaN NaN NaN 0.815911 ... 0.738493
attenuate abrogate trigger stimulate evoke up-regulate \
be NaN NaN NaN NaN NaN NaN
represent NaN NaN NaN NaN 0.659131 0.658291
associate NaN NaN NaN NaN NaN NaN
induce 0.735202 0.726008 0.717865 0.713656 NaN NaN
synergize activate mediate
be NaN NaN NaN
represent 0.654106 0.651293 0.649428
associate NaN NaN NaN
induce NaN NaN NaN
[4 rows x 24 columns]

How to remove NaN values from corr() function output

EDITED TO SHOW EXAMPLE OF ORIGINAL DATAFRAME:
df.head(4)
shop category subcategory season
date
2013-09-04 abc weddings shoes winter
2013-09-04 def jewelry watches summer
2013-09-05 ghi sports sneakers spring
2013-09-05 jkl jewelry necklaces fall
I've successfully generated the following dataframe using get_dummies():
wedding_seasons = pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )
wedding_seasons.head(3)
weddings winter summer spring fall
71654 1.0 0.0 1.0 0.0 0.0
72168 1.0 0.0 1.0 0.0 0.0
72080 1.0 0.0 1.0 0.0 0.0
The goal of the above is to help assess frequency of weddings across seasons, so I've used corr() to generate the following result:
weddings fall spring summer winter
weddings NaN NaN NaN NaN NaN
fall NaN 1.000000 0.054019 -0.331866 -0.012122
spring NaN 0.054019 1.000000 -0.857205 0.072420
summer NaN -0.331866 -0.857205 1.000000 -0.484578
winter NaN -0.012122 0.072420 -0.484578 1.000000
I'm unsure why the wedding column is generating NaN values, but my gut feeling is that it originates from how I originally created wedding_seasons. Any guidance would be greatly appreciated so that I can properly assess column correlations.
I don't think what you're interested in seeing here is the "correlation".
All of the columns in the dataframe wedding_seasons contain floating point values; however, if my suspicions are correct, the rows in your original dataframe df contain something like transaction records, where each row corresponds to an individual.
Please tell me if I'm incorrect, but I'll proceed with my reasoning.
Correlation will measure, intuitively, the tendency for values vary together/against each other within the same observation (e.g. if X and Y are negatively correlated, then when we see X go above its mean, we'd expect Y to appear below its mean).
However, what you have here is data where, if one transaction is summer, then categorically it cannot possibly be winter at the same time. When you create wedding_seasons, Pandas is creating dummy variables that are treated as floating point values when computing your correlation matrix; since it's impossible for any row to contain two 1.0 entries at the same time, clearly your resulting correlation matrix is going to have negative entries everywhere.
You could drop the weddings column before doing corr().
wedding_seasons.drop(columns = ['weddings'])

How to transform several lists of words to a pandas dataframe?

I have file .txt that contains a list of words like this:
5.91686268506 exclusively, catering, provides, arms, georgia, formal, purchase, choose
5.91560417296 hugh, senlis
5.91527936181 italians
5.91470429433 soil, cultivation, fertile
5.91468087491 increases, moderation
....
5.91440227412 farmers, descendants
I would like to transform such data into a pandas table that I expect to show into a html/bootstrap template as follows (*):
COL_A COL_B
5.91686268506 exclusively, catering, provides, arms, georgia, formal, purchase, choose
5.91560417296 hugh, senlis
5.91527936181 italians
5.91470429433 soil, cultivation, fertile
5.91468087491 increases, moderation
....
5.91440227412 farmers, descendants
So I tried the following with pandas:
import pandas as pd
df = pd.read_csv('file.csv',
sep = ' ', names=['Col_A', 'Col_B'])
df.head(20)
However, my table doesnt have the above desired sructure:
COL_A COL_B
6.281426 engaged, chance, makes, meeting, nations, things, believe, tries, believing, knocked, admits, awkward
6.277438 sweden NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.271190 artificial, ammonium NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.259790 boats, prefix NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.230612 targets, tactical, wing, missile, squadrons NaN NaN NaN NaN NaN NaN NaN
Any idea of how to get the data as the (*) tabular format?
Because you have space between words and if you specify space as delimiter, it will naturally separate them. To get what you need, you can try to set the sep as a regular expression (?<!,), ?<! is a negative look behind syntax, which means separate on space only when it is not preceded by a comma and it should work for your case:
pd.read_csv("~/test.csv", sep = "(?<!,) ", names=['weight', 'topics'])
# weight topics
#0 5.916863 exclusively, catering, provides, arms, georgia...
#1 5.915604 hugh, senlis
#2 5.915279 italians
#3 5.914704 soil, cultivation, fertile
#4 5.914681 increases, moderation
#5 5.914402 farmers, descendants

Python pandas - value_counts not working properly

Based on this post on stack i tried the value counts function like this
df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))
and it works fine apart from the fact that although my data has 22 unique genres and after the split i get 42 values, which of course are not unique.
Data example:
Action Adventure Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing Accounting Action Adventure Animation & Modeling Audio Production Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing nan
0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
(i have pasted the head and the first row only)
I have a feeling that the problem is caused from my original data.Well, my column (genres) was a list of lists which contained brackets
example :[Action,Indie]
so when python was reading it, it would read [Action and Action and Action] as different values and the output was 303 different values.
So what i did is that:
for i in df1['genres'].tolist():
if str(i) != 'nan':
i = i[1:-1]
new.append(i)
else:
new.append('nan')
You have to remove first and last [] from column genres by function str.strip and then replace spaces by empty string by function str.replace
import pandas as pd
df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")
df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')
df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))
#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
print df
#remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
u'initial_price', u'is_free', u'metacritic', u'release_date',
u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
u'WebPublishing'],
dtype='object')

Categories