Sorting and ranking subsets of complex data

Sorting and ranking subsets of complex data - python

I have a large and complex GIS-datafile on road accidents in “cities" within “counties”. Rows represent roads. Columns provide “City”, “County” and "Sum of accidents in city”. A city thus contains several roads (repeated values of accident sums) and a county several cities.
For each 'County', I now want to rank cities according to the number of accidents, so that within each 'County', the city with most accidents is ranked “1" and cities with fewer accidents are ranked “2” and above. This rank-value shall be written into the original datafile.
My original approach was to:
1. Sort data according to "'County'_ID" and “Accidents” (descending)
2. than calculate for each row:
if('County' in row 'n+1' = 'County' in row ’n’) AND (Accidents in row 'n+1' = 'Accidents' in row ’n’):
return value: ’n’ ## maintain same rank for cities within 'County'
else if ('County' in row 'n+1' = 'County' in row ’n’) AND if ('Accidents' in row 'n+1' < 'Accidents' in row ’n’):
return value: ’n+1’ ## increasing rank value within 'County'
else if ('County' in row 'n+1' < 'County' in row ’n’) AND ('Accidents' in row 'n+1’ < 'Accidents' in row ’n’):
return value:’1’ ## new 'County', i.e. start ranking from 1
else:
return “0” #error
However, I could not figure out how to code this properly; and maybe this approach is not an appropriate way either; maybe a loop would do the trick?
Any recommendations?

Suggest using Python Pandas module
Fictitious Data
Create Data with columns county, accidents, city
Would use pandas read_csv to load actual data.
import pandas as pd
df = pd.DataFrame([
['a', 1, 'A'],
['a', 2, 'B'],
['a', 5, 'C'],
['b', 5, 'D'],
['b', 5, 'E'],
['b', 6, 'F'],
['b', 8, 'G'],
['c', 2, 'H'],
['c', 2, 'I'],
['c', 7, 'J'],
['c', 7, 'K']
], columns = ['county', 'accidents', 'city'])
Resultant Dataframe
df:
county accidents city
0 a 1 A
1 a 2 B
2 a 5 C
3 b 5 D
4 b 5 E
5 b 6 F
6 b 8 G
7 c 2 H
8 c 2 I
9 c 7 J
10 c 7 K
Group data rows by county, and rank rows within group by accidents
Ranking Code
# ascending = False causes cities with most accidents to be ranked = 1
df["rank"] = df.groupby("county")["accidents"].rank("dense", ascending=True)
Result
df:
county accidents city rank
0 a 1 A 3.0
1 a 2 B 2.0
2 a 5 C 1.0
3 b 5 D 3.0
4 b 5 E 3.0
5 b 6 F 2.0
6 b 8 G 1.0
7 c 2 H 2.0
8 c 2 I 2.0
9 c 7 J 1.0
10 c 7 K 1.0

I think the approach by #DarryIG is correct, but it doesn't consider that the environment is ArcGIS.
Since you tagged your question with Python I came up with a workflow utilizing Pandas. There are other ways to do the same, using ArcGIS tools and or the Field Calculator.
import arcpy # if you are using this script outside ArcGIS
import pandas as pd
# change this to your actual shapefile, you might have to include a path
filename = "road_accidents"
sFields = ['County', 'City', 'SumOfAccidents'] # consider this to be your columns
# read everything in your file into a Pandas DataFrame with a SearchCursor
with arcpy.da.SearchCursor(filename, sFields) as sCursor:
df = pandas.DataFrame(data=[row for row in sCursor], columns=field_names)
df = df.drop_duplicates() # since each row represents a street, we can remove duplicate
# we use this code from DarrylG to calculate a rank
df['Rank'] = df.groupby('County')['SumOfAccidents'].rank('dense', ascending=True)
# set a multiindex, since there might be duplicate city-names
df = df.set_index(['County', 'City'])
dct = df.to_dict() # convert the dataframe into a dictionary
# add a field to your shapefile
arcpy.AddField_management('Rank', 'Rank', 'SHORT')
# now we can update the Shapefile
uFields = ['County', 'City', 'Rank']
with arcpy.da.UpdateCursor(filename, uFields) as uCursor: # open a UpdateCursor on the file
for row in uCursor: # for each row (street)
# get the county/city combo
County_City = (row[uFields.index('County')], row[uFields.index('City')])
if County_City in dct: # see if it is in your dictionary (it should)
# give it the value from dictionary
row[uFields.index('Rank')] = dct['Rank'][County_City]
else:
# otherwise...
row[uFields.index('Rank')] = 999
uCursor.updateRow(row) # update the row
You can run this code inside ArcGIS Pro Python console. Or using Jupyter Notebooks. Hope it helps!

Related

Filter DataFrame for most matches

I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:
Index
One
Two
Three
Four
1
a
b
d
c
2
b
b
d
d
3
a
b
d
4
c
b
c
d
5
a
b
c
g
6
a
b
c
7
a
s
c
f
8
a
f
c
9
a
b
10
a
b
t
d
11
a
b
g
...
...
...
...
...
100
a
b
c
d
My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...).
In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Also the list might change in length e.g. list_to_match = ['a','b'].
Thanks for your help!

I would use:
list_to_match = ['a','b','c','d']
# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)
# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]
print(out)
Output (ignoring the input row 100):
One Two Three Four
Index
5 a b c g
6 a b c None

Here is my approach. Descriptions are commented below.
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine
data = {'One': ['a', 'a', 'a', 'a'],
'Two': ['b', 'b', 'b', 'b'],
'Three': ['c', 'c', 'y', 'c'],
'Four': ['g', 'g', 'z', 'd']}
dataframe_ = pd.DataFrame(data)
#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64
#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])
#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64
#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)
output:
0 0.999343
1 0.999343
2 0.973916
3 1.000000
Filtering rows based on their cosine similarities:
df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0 0.999343
1 0.999343
2 NaN
3 1.000000
From here on you can easily find the rows with non-NaN values by their indexes.

filtering rows which have highest consecutive values in pandas

I have a dataframe containing 3 million records. It has two columns, column A and column B which is the timestamp of data gathering. I am trying to split it into multiple dataframes according to the following conditions:
First, after grouping by Column A, any group with less than 3 members should be dropped.
Second, all records that have the same column A value and the count of records of column B, in which the timestamp was less than 7 seconds, is more than 4 should be saved as a CSV file.
Column A Column B
D 3
A 2
A 2
A 8
A 13
C 15
D 8
D 6
F 3
F 14
F 2
F 5
F 2
G 20
B 12
N 15
N 1
N 2
N 1
N 2
N 3
I developed the following code that meets the first condition, but the second condition.
#Filtering records that their Column A's values repeated less than four times.
grouped = df.groupby('A')
for i in grouped.groups.keys():
p = grouped.get_group(i)
if len(p.index)>3:
# Filtering records that their column B's values are less than 6
y= [len(list(g)) for k, g in groupby(p['Column B']<6) if k==True]
if len(y) !=0:
if max(y)>4:
p.to_csv('D:/TEST1/'f"{i}.csv", sep=';')
This code saves all column values that contain at least four consecutive values less than 6. It means it doesn't filter records whose values are more than 6. For example, in the proposed example, the only value that can be saved as a CSV file is 'N', but my code saves all values of N while I need to save only the last consecutive five values which are less than 6.

I reproduced your example with the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
columnA = ['D', 'A', 'A', 'A', 'A', 'C', 'D', 'D', 'F', 'F', 'F', 'F', 'F', 'G', 'B', 'N', 'N', 'N', 'N', 'N', 'N']
columnB = [3, 2, 2, 8, 13, 15, 8, 6, 3, 14, 2, 5, 2, 20, 12, 15, 1, 2, 1, 2, 3]
df = pd.DataFrame(data = zip(columnA, columnB),
columns = ['columnA', 'columnB'])
First, after grouping by Column A, any group with less than 3 members should be dropped.
The group by and filter function are clear and fast ways to do this task.
# any group with less than 3 members should be dropped
df1 = df.groupby('columnA').filter(lambda x: len(x) >= 3)
print(df1)
Second, all records that have the same column A value and the count of records of column B, in which the timestamp was less than 7 seconds, is more than 4 should be saved as a CSV file.
This task was confusing, but my best interpretation of this was to keep all timestamps that were less than 7 (using the where function) and then only keep the rows that were part of a group of size more than 4.
My solution saves the new dataframe as floats for the timestamp column. I'm not sure why it did that, but it is an easy fix if you need to change that.
# keep rows with timestamp less than 7 and only keep rows that are a part of a group with more than 4 values
df2 = df1.where(df['columnB'] < 7).groupby('columnA').filter(lambda x: len(x) > 4)
print(df2)
Finally, this saves the dataframe to the csv.
# save to csv
df2.to_csv("mycsv.csv")

How to compare a row of a dataframe against the whole table considering exact matches and column order importance

I have a pandas dataframe as follows:
col_1
col_2
col_3
col_4
col_5
col_6
a
b
c
d
e
f
a
b
c
h
j
f
a
b
c
k
e
l
x
b
c
d
e
f
And I want to get a score for all the rows to see which of them are mos similar to the first one considering:
First, the number of columns with same values between the first row and the row we are considering
If two rows have the same number of equal value compared to first row, consider digit importance, that is, go from left to right in column order and give more importance to those rows whose matching columns are the most left-ones.
In the example above, the score should be something in the following order:
4th row (the last one), as it has 4 columns values in common with row 1
3rd row, as it has 3 elements in common with row 1 and the non-matching columns are columns 4 and 6, while in row 2 these non matching columns are 4 and 5
2nd row, as it has the same number of coincidences than row 3 but the column 5 is matching in row 3 but not in row 2
I want to solve this using a lambda function that assigns the score to ecah row of the dataframe, given the row one as a constant.

You could use np.lexsort for this. This allows a nested sort based on the count of columns that match row 0, and the sum of the column index matches where leftmost matches are more valuable.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col_1': ['a', 'a', 'a', 'x'],
'col_2': ['b', 'b', 'b', 'b'],
'col_3': ['c', 'c', 'c', 'c'],
'col_4': ['d', 'h', 'k', 'd'],
'col_5': ['e', 'j', 'e', 'e'],
'col_6': ['f', 'f', 'l', 'f']})
df.loc[np.lexsort(((df.eq(df.iloc[0]) * df.columns.get_indexer(df.columns)[::-1]).sum(1).values,
df.eq(df.iloc[0]).sum(1).values))[::-1], 'rank'] = range(len(df))
print(df)
Output
col_1 col_2 col_3 col_4 col_5 col_6 rank
0 a b c d e f 0.0
1 a b c h j f 3.0
2 a b c k e l 2.0
3 x b c d e f 1.0

Set up the problem:
data = [list(s) for s in ["aaax", "bbbb", "cccc", "dhkd", "ejee", "fflf"]]
df = pd.DataFrame(data).T.set_axis([f"col_{n}" for n in range(1, 7)], axis=1)
Solution:
ranking = (df == df.loc[0]).sum(1).iloc[1:]
ranking[~(ranking.duplicated('first') | ranking.duplicated('last'))] *= 10
ranking.update((df.loc[ties] == df.loc[0]).mul(np.arange(6,0,-1)).sum(1))
ranking.argsort()[::-1]
Explanation:
We first calculate each row's similarity to the first row and rank them. Then we split ties and non-ties. The non-ties are multiplied by 10. The ties are recalculated but this time we weight them by a descending scale of weights, to give more weight to a column the further left it is. Then we sum the weights to get the score for each row and update our original ranking. We return the reverse argsort to show the desired order.
You don't need to use a lambda here, it will be slower.
Result:
3 2
2 1
1 0
The left is the row index, ranked in order.

How do I compare each row with all the others and if it's the same I concatenate to a new dataframe? Python

I have a DataFrame with 2 columns:
import pandas as pd
data = {'Country': ['A', 'A', 'A' ,'B', 'B'],'Capital': ['CC', 'CD','CE','CF','CG'],'Population': [5, 35, 20,34,65]}
df = pd.DataFrame(data,columns=['Country', 'Capital', 'Population'])
I want to compare each row with all others, and if it has the same Country, I would like to concatenate the pair into a new data frame (and transfor it into a new csv).
new_data = {'Country': ['A', 'A','B'],'Capital': ['CC', 'CD','CF'],'Population': [5, 35,34],'Country_2': ['A', 'A' ,'B'],'Capital_2': ['CD','CE','CG'],'Population_2': [35, 20,65]}
df_new = pd.DataFrame(new_data,columns=['Country', 'Capital', 'Population','Country_2','Capital_2','Population_2'])
NOTE: This is a simplification of my data, I have more than 5000 rows and I would like to do it automatically
I tried comparing dictionaries, and also comparing one row at a time, but I couldn't do it.
Thank you for the attention

>>> df.join(df.groupby('Country').shift(-1), rsuffix='_2')\
... .dropna(how='any')
Country Capital Population Capital_2 Population_2
0 A CC 5 CD 35.0
1 A CD 35 CE 20.0
3 B CF 34 CG 65.0
This pairs every row with the next one using join + shift − but we restrict shifting only within the same country using groupby. See what the groupby + shift does on its own:
>>> df.groupby('Country').shift(-1)
Capital Population
0 CD 35.0
1 CE 20.0
2 NaN NaN
3 CG 65.0
4 NaN NaN
Then once these values are added to the right of your data with the _2 suffix, the rows that have NaNs are dropped with dropna().
Finally note that Country_2 is not repeated as it’s the same as Country, but it would be very easy to add

To get all combinations you can try:
from itertools import combinations,chain
df = (
pd.concat(
[pd.DataFrame(
np.array(list(chain(*(combinations(k.values,2))))).reshape(-1, len(df.columns) * 2),
columns = df.columns.append(df.columns.map(lambda x: x + '_2')))
for g,k in df.groupby('Country')]
)
)

Merging two dataframes without creating suffix

I want to create a new dataframe by merging two seperate dataframes. The data frames share a common key and some common columns. This common columns also contain some but not all of the same values. I would like to remove the duplicate values and keep both values are that different in each cell. My data looks something like this:
Left:
key1 valueZ valueX valueY
A bob 1 4
B jes 8 5
C joe 3 6
Right:
key1 valueZ valueX valueY valueK
A sam 7 4 hill town
B beth 8 11 market
C joe 9 12 mall
The expected output would be:
key1 valueZ valueX valueY valueK
A bob/sam 1/7 4 hill town
B jes/beth 8 5/11 market
C joe 3/9 6/12 mall

You will need to do this in a few steps.
Here is my setup for reference:
import pandas as pd
# define Data Frames
left = pd.DataFrame({
'key1': ['A', 'B', 'C'],
'valueZ': ['bob', 'jes', 'joe'],
'valueX': [1, 8, 3],
'valueY': [4, 5, 6]
})
right = pd.DataFrame({
'key1': ['A', 'B', 'C'],
'valueZ': ['sam', 'beth', 'joe'],
'valueX': [7, 8, 9],
'valueY': [4, 11, 12],
'valueK': ['hill town', 'market', 'mall']
})
Now I have two DataFrame objects. They are left and right and match your example.
In order to combine how you want, I will need to know which columns are in common between the two Data Frames, as well as the final list of columns. I also define the key column here for ease of configuration. You can do that like this:
# determine important columns
keyCol = 'key1'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))
Which gives:
Common = ['valueZ', 'valueX', 'valueY', 'key1'], Final = ['valueZ', 'key1', 'valueK', 'valueX', 'valueY']
Next, you will join the two Data Frames as normal, but give the columns in both Data Frames a suffix (documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
# join dataframes with suffixes
mergeDf = left.merge(right, how='left', on=keyCol, suffixes=('_left', '_right'))
Finally, you will combine the common columns using whatever logic you desire. Once they are combined, you can remove the suffixed columns from your Data Frame. Example below. You can do this in more efficient ways, but I wanted to break it down for clarity.
# combine the common columns
for col in commonCols:
if col != keyCol:
for i, row in mergeDf.iterrows():
leftVal = str(row[col + '_left'])
rightVal = str(row[col + '_right'])
print(leftVal + ',' + rightVal)
if leftVal == rightVal:
mergeDf.loc[i, col] = leftVal
else:
mergeDf.loc[i, col] = leftVal + '/' + rightVal
# only use the finalCols
mergeDf = mergeDf[finalCols]
This gives:
valueZ key1 valueK valueX valueY
0 bob/sam A hill town 1/7 4
1 jes/beth B market 8 5/11
2 joe C mall 3/9 6/12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting and ranking subsets of complex data - python

Related

Filter DataFrame for most matches

filtering rows which have highest consecutive values in pandas

How to compare a row of a dataframe against the whole table considering exact matches and column order importance

How do I compare each row with all the others and if it's the same I concatenate to a new dataframe? Python

Merging two dataframes without creating suffix

Categories

Resources