Related
I am having trouble dynamically binning my dataset for further calculation. My goal is to have specific bin/labels for each individual row in my dataframe, based on a function, and have the corresponding label assign to the column 'action'.
My dataset is:
id value1 value2 type length amount
1 0.9 1.0 X 10 ['A', 'B']
2 2.0 1.6 Y 80 ['A']
3 0.3 0.5 X 29 ['A', 'C']
The function is as follows:
def bin_label_generator(amount):
if amount< 2:
amount= 2
lower_bound = 1.0 - (1.0/amount)
mid_bound = 1.0
upper_bound = 1.0 + (1.0/amount)
thresholds = {
'bins':[-np.inf, lower_bound, mid_bound, upper_bound, np.inf],
'labels':[0, 1.0, 2.0, 3.0]
}
return thresholds
This is my current code, but it requires me to specify a row in order to cut. I would want this to happen automatically with the dictionary specified in the row itself.
# filter on type
filter_type_series = df['type'].str.contains('X')
# get amount of items in amount list
amount_series = df[filter_type_series ]['amount'].str.len()
# generate bins for each row in series
bins_series = amount_series.apply(bin_label_generator)
# get the max values to for binning
max_values = df[filter_type_series].loc[:, [value1, value2]].abs().max(1)
# following line requires a row index, what I do not want
df['action'] = pd.cut(max_values, bins=bins_series[0]['bins'], labels=bins_series[0]['labels'])
Found a fix myself, by just iterating over every single row in the series, and then adding it towards the columns in the actual df.
type = 'X'
first_df = df.copy()
type_series = mst_df['type'].str.contains(type)
# loop over every row to dynamically use pd.cut with bins/labels from specific row
for index, row in mst_df[mst_series].iterrows():
# get the max value from rows
max_val = row[[value1, value2]].abs().max()
# get amount of cables
amount = len(row['amount'])
# get bins and labels for specific row
bins_label_dict = bin_label_generator(amount)
bins = bins_label_dict['bins']
labels = bins_label_dict['labels']
# append label to row with max value
first_df .loc[index, 'action'] = pd.cut([max_val], bins=bins, labels=labels)
How to map closed values from two dataframes:
I've two dataframes in below format and looking to map values based on o_lat,o_long from data1 and near_lat,near_lon:
data1 ={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024]}
Where lat,long is coordinates of destination, d is the distance between origin and destination, o_lat,o_long is the coordinates of origin.
data2={'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.8185,-37.8126,-37.8099],
'lon':[144.9695,144.9470,144.9952]}
I want to produce another column in data1 which locates nearest_warehouse in the following format based on closed value:
result={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse':['Bakers','Thompson','Nickolson']}
I've tried following code:
lat_diff=[]
long_diff=[]
min_distance=[]
for i in range(0,3):
lat_diff.append(float(warehouse.near_lat[i])-lat_long_d.o_lat[0])
for j in range(0,3):
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
min_distance=[min(lat_diff),min(long_diff)]
min_distance
Which gives the following result which is the minimum value of the difference between latitude and longitude for o_lat=-37.8095 and o_lang=145.0000:
[-0.00897867136701791, -0.05300973586690816].
I feel the approach is not viable to map close values over a large dataset.
Looking for a better approach in this regard
From the first dataframe, you can go through each row with lambda x: and compare to all rows of the second dataframe and return a list of the absolute difference of latitude and add that to the absolute difference of longitude using list comprehension. This effectively gives you the minimum distance.
Now, what you are interested in is the index, i.e. position of the minimum absolute difference of longiture plus absolute difference of latitude for each row. You can find this with idxmin(). In dataframe 1, this returns the index number which you can use to merge against the index of dataframe 2 to pull in the closest warehouse:
setup:
data1 = pd.DataFrame({'lat': [-0.659901, -0.659786, -0.659821], 'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024]})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.818595, -37.812673, -37.809996], 'lon':[144.969551, 144.947069, 144.995232],
'near_lat':[-37.8185,-37.8126,-37.8099], 'near_lon':[144.9695,144.9470,144.9952]})
code:
data1['key'] = data1.apply(lambda x: ((x['o_lat'] - data2['near_lat']).abs()
+ (x['o_long'] - data2['near_lon']).abs()).idxmin(), axis=1)
data1 = pd.merge(data1, data2[['nearest_warehouse']], how='left', left_on='key', right_index=True).drop('key', axis=1)
data1
Out[1]:
lat long d o_lat o_long nearest_warehouse
0 -0.659901 2.530561 0.4202 -37.8095 145.0000 Bakers
1 -0.659786 2.530797 1.0957 -37.8030 145.0077 Bakers
2 -0.659821 2.530587 0.6309 -37.8050 145.0024 Bakers
This result looks accurate if you append the two dataframes into one and do a basic scatterplot. As you can see Bakers warehouse is right there compared to the other points (graph IS to scale with last line of code):
import matplotlib.pyplot as plt
data1 = pd.DataFrame({'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse': ['0','1','2']})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'o_lat':[-37.8185,-37.8126,-37.8099], 'o_long':[144.9695,144.9470,144.9952]})
df = data1.append(data2)
y = df['o_lat'].to_list()
z = df['o_long'].to_list()
n = df['nearest_warehouse'].to_list()
fig, ax = plt.subplots()
ax.scatter(z, y)
for i, txt in enumerate(n):
ax.annotate(txt, (z[i], y[i]))
plt.gca().set_aspect('equal', adjustable='box')
I have a dataframe (portbase) that contains multiple signals (signalname) and their returns.
I want to subset every single, calculate the cumulative return and then plot them in a single figure.
I have done it step by step with on single as an example:
ChInvIA = portbase[portbase['signalname'] == 'ChInvIA']
cum_perf_ChInvIA = ChInvIA['return'].cumsum() + 100
cum_perf_ChInvIA.plot()
plt.show()
With multiple signals this would take me way too long and therefore i've tries to loop over my dataframe.
for i in signals:
i = portbase[portbase['signalname'] == 'i']
cum_perf_i = i['return'].cumsum() + 100
cum_perf_i.plot()
plt.show()
It doesn't work and i've tried to find a solution.
You are calling both the looping variable and a variable in the loop by the name i, and comparing signalname to a string containing i ('i') instead of the variable itself. You should do something like this instead:
for i in signals:
signal_i = portbase[portbase['signalname'] == i]
cum_perf_i = signal_i['return'].cumsum() + 100
cum_perf_i.plot()
plt.show()
To have all the plots in the same figure, you should use matplotlib's subplots function:
fig, ax = plt.subplots(len(signals))
for ind, i in enumerate(signals):
signal_i = portbase[portbase['signalname'] == i]
cum_perf_i = signal_i['return'].cumsum() + 100
cum_perf_i.plot(ax=ax[ind])
plt.show()
I am trying to show relative percentage by group as well as total frequency in an sns barplot. The two groups I am comparing are very different in size, which is why I show the percentage by group in the function below.
Here is syntax for a sample dataframe I created that has similar relative group sizes to my data ('groups') among the target categorical variable ('item'). 'rand' is just a variable I use to make the df.
# import pandas and seaborn
import pandas as pd
import seaborn as sns
import numpy as np
# create dataframe
foobar = pd.DataFrame(np.random.randn(100, 3), columns=('groups', 'item', 'rand'))
# get relative groupsizes
for row, val in enumerate(foobar.rand) :
if val > -1.2 :
foobar.loc[row, 'groups'] = 'A'
else:
foobar.loc[row, 'groups'] = 'B'
# assign categories that I am comparing graphically
if row < 20:
foobar.loc[row, 'item'] = 'Z'
elif row < 40:
foobar.loc[row, 'item'] = 'Y'
elif row < 60:
foobar.loc[row, 'item'] = 'X'
elif row < 80:
foobar.loc[row, 'item'] = 'W'
else:
foobar.loc[row, 'item'] = 'V'
Here is the function I wrote that compares relative frequencies by group. It has some default variables, but I've reassigned them for this question.
def percent_categorical(item, df=IA, grouper='Active Status') :
# plot categorical responses to an item ('column name')
# by percent by group ('diff column name w categorical data')
# select a data frame (default is IA)
# 'Active Status' is default grouper
# create df of item grouped by status
grouped = (df.groupby(grouper)[item]
# convert to percentage by group rather than total count
.value_counts(normalize=True)
# rename column
.rename('percentage')
# multiple by 100 for easier interpretation
.mul(100)
# change order from value to name
.reset_index()
.sort_values(item))
# create plot
PercPlot = sns.barplot(x=item,
y='percentage',
hue=grouper,
data=grouped,
palette='RdBu'
).set_xticklabels(
labels = grouped[item
].value_counts().index.tolist(), rotation=90)
#show plot
return PercPlot
The function and resulting graph follow:
percent_categorical('item', df=foobar, grouper='groups')
This is good, because it allows me show the relative percentage by group. However, I also want to display the absolute numbers for each group, preferably in the legend. In this case, I would want it to show that there are 89 total members of group A and 11 total members of group B.
Thank you in advance for any help.
I solved this by splitting out the groupby operation: one to get your percentages and one to count the number of objects.
I adjusted your percent_catergorical function as follows:
def percent_categorical(item, df=IA, grouper='Active Status') :
# plot categorical responses to an item ('column name')
# by percent by group ('diff column name w categorical data')
# select a data frame (default is IA)
# 'Active Status' is default grouper
# create groupby of item grouped by status
groupbase = df.groupby(grouper)[item]
# count the number of occurences
groupcount = groupbase.count()
# convert to percentage by group rather than total count
groupper = (groupbase.value_counts(normalize=True)
# rename column
.rename('percentage')
# multiple by 100 for easier interpretation
.mul(100)
# change order from value to name
.reset_index()
.sort_values(item))
# create plot
fig, ax = plt.subplots()
brplt = sns.barplot(x=item,
y='percentage',
hue=groupper,
data=groupper,
palette='RdBu',
ax=ax).set_xticklabels(
labels = grouper[item
].value_counts().index.tolist(), rotation=90)
# get the handles and the labels of the legend
# these are the bars and the corresponding text in the legend
thehandles, thelabels = ax.get_legend_handles_labels()
# for each label, add the total number of occurences
# you can get this from groupcount as the labels in the figure have
# the same name as in the values in column of your df
for counter, label in enumerate(thelabels):
# the new label looks like this (dummy name and value)
# 'XYZ (42)'
thelabels[counter] = label + ' ({})'.format(groupcount[label])
# add the new legend to the figure
ax.legend(thehandles, thelabels)
#show plot
return fig, ax, brplt
To get your figure:
fig, ax, brplt = percent_categorical('item', df=foobar, grouper='groups')
The resulting graph looks like this:
You can change the look of this legend how you want, I just added parentheses as an example.
I have script which looks at the rows and columns headers belonging to a group (REG_ID) and sums the values. The code runs on a matrix (small subset) as follows:
Outputs
My code runs well for calculating the sum for all the IDs based on rows and columns belonging to each internal group (REG_ID). For example for all row and column IDs which belong to REG_ID 1 are summed so the total flows between region 1 and region 1 (internal flows) is calculated and so on for each region.
I wish to extend this code by calculating (summing) the flows between regions for example region 1 to region 2, 3, 4 ,5....
I figure that I need to include another loop within the existing while loop but would really appreciate some help to figure out where it should be and how to construct it.
My code which currently runs on the internal flow sum (1-1, 2-2, 3-3 etc) is as follows:
global index
index = 1
x = index
while index < len(idgroups):
ward_list = idgroups[index] #select list of ward ids for each region from list of lists
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
print dfflows
new_df = pd.merge(pairlist, dfflows, how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2'])
print new_df #useful for checking dataframe merges
regionflows = r"C:\Temp\AllNI\regionflows.csv"
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"]
mergedcsv.to_csv(regionflows, columns = header, index=False)
regregflows = r"C:\Temp\AllNI\reg_regflows.csv"
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"]
pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv"
new_df.to_csv(pairlistCSV)
The output is as follows:
idgroups dataframe: (see image 1 - second part of image 1)
df7 and intflows for each region Reg_ID:(third part of image 1 - on the right)
ddflows dataframe:(fourth part of image 2)
and the final output is new_df:(fifth part of image 2)
I wish to populate the sums for all possible combinations of flows between the regions not just internal.
I figure I need to add another loop into the while loop. So possibly add an enumerate function like:
while index < len(idgroups):
#add line(s) to calculate flows between regions
for index, item in enumerate(idgroups):
ward_list = idgroups[index]
print ward_list
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
I'm unsure how to integrate the item so struggling to extend the code for all combinations. Any advice appreciated.
Update based on flows function:
w=pysal.rook_from_shapefile("C:/Temp/AllNI/NIW01_sort.shp",idVariable='LABEL')
Simil = pysal.open("C:/Temp/AllNI/simNI.csv")
Similarity = np.array(Simil)
db = pysal.open('C:\Temp\SQLite\MatrixCSV2.csv', 'r')
dbf = pysal.open(r'C:\Temp\AllNI\NIW01_sortC.dbf', 'r')
ids = np.array((dbf.by_col['LABEL']))
commuters = np.array((dbf.by_col['Total'],dbf.by_col['IDNO']))
commutersint = commuters.astype(int)
comm = commutersint[0]
floor = int(MIN_COM_CT + 100)
solution = pysal.region.Maxp(w=w,z=Similarity,floor=floor,floor_variable=comm)
regions = solution.regions
#print regions
writecsv = r"C:\Temp\AllNI\reg_output.csv"
csv = open(writecsv,'w')
csv.write('"LABEL","REG_ID"\n')
for i in range(len(regions)):
for lines in regions[i]:
csv.write('"' + lines + '","' + str(i+1) + '"\n')
csv.close()
flows = r"C:\Temp\SQLite\MatrixCSV2.csv"
regs = r"C:\Temp\AllNI\reg_output.csv"
wardflows = pd.read_csv(flows)
regoutput = pd.read_csv(regs)
merged = pd.merge(wardflows, regoutput)
#duplicate REG_ID column as the index to be used later
merged['REG_ID2'] = merged['REG_ID']
merged.to_csv("C:\Temp\AllNI\merged.csv", index=False)
mergedcsv = pd.read_csv("C:\Temp\AllNI\merged.csv",index_col='WardID_1') #index this dataframe using the WardID_1 column
flabelList = pd.read_csv("C:\Temp\AllNI\merged.csv", usecols = ["WardID", "REG_ID"]) #create list of all FLabel values
reg_id = "REG_ID"
ward_flows = "RegIntFlows"
flds = [reg_id, ward_flows] #create list of fields to be use in search
dict_ref = {} # create a dictionary with for each REG_ID a list of corresponding FLABEL fields
#group the dataframe by the REG_ID column
idgroups = flabelList.groupby('REG_ID')['WardID'].apply(lambda x: x.tolist())
print idgroups
idgrp_df = pd.DataFrame(idgroups)
csvcols = mergedcsv.columns
#create a list of column names to pass as an index to select columns
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID').sum()
mergedcsvgroup.describe()
idList = idgroups[2]
df4 = pd.DataFrame()
df5 = pd.DataFrame()
col_ids = idList #ward id no
regiddf = idgroups.index.get_values()
print regiddf
#total number of region ids
#print regiddf
#create pairlist combinations from region ids
#combinations with replacement allows for repeated items
#pairs = list(itertools.combinations_with_replacement(regiddf, 2))
pairs = list(itertools.product(regiddf, repeat=2))
#print len(pairs)
#create a new dataframe with pairlists and summed data
pairlist = pd.DataFrame(pairs,columns=['origID','destID'])
print pairlist.tail()
header_pairlist = ["origID","destID","flow"]
header_intflow = ["RegID", "RegID2", "regflow"]
dfflows = pd.DataFrame(columns=header_intflow)
print mergedcsv.index
print mergedcsv.dtypes
#mergedcsv = mergedcsv.select_dtypes(include=['int64'])
#print mergedcsv.columns
#mergedcsv.rename(columns = lambda x: int(x), inplace=True)
def flows():
pass
#def flows(mergedcsv, region_a, region_b):
def flows(mergedcsv, ward_lista, ward_listb):
"""Return the sum of all the cells in the row/column intersections
of ward_lista and ward_listb."""
mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64']
regionflows = mergedcsv.loc[ward_lista, ward_listb]
regionflowsum = regionflows.values.sum()
#grid = [ax, bx, regflowsuma, regflowsumb]
gridoutput = [ax, bx, regionflowsum]
print gridoutput
return regflowsuma
return regflowsumb
#print mergedcsv.index
#mergedcsv.columns = mergedcsv.columns.str.strip()
for ax, group_a in enumerate(idgroups):
ward_lista = map(int, group_a)
print ward_lista
for bx, group_b in enumerate(idgroups[ax:], start=ax):
ward_listb = map(int, group_b)
#print ward_listb
flow_ab = flows(mergedcsv, ward_lista, ward_listb)
#flow_ab = flows(mergedcsv, group_a, group_b)
This results in KeyError: 'None of [[189, 197, 198, 201]] are in the [columns]'
I have tried using ward_lista = map(str, group_a) and map(int, group_a) also but list objects not found in dataframe.loc.
The columns are mixed datatypes but all the columns containing the labels which should be sliced are of type int64.
I have tried many solutions around the datatypes but to no avail. Any suggestions?
I can't speak to the computations you're doing, but it seems like you just want to arrange combinations of groups. The question is whether they are directed or undirected- that is, do you need to compute flows(A,B) and flows(B,A), or just one?
If just one, you could do this:
for i,ward_list in enumerate(idgroups):
for j,ward_list2 in enumerate(idgroups[i:],start=i):
This would iterate over i,j pairs like:
0,0 0,1 0,2 ... 0,n
1,1 1,2 ... 1,n
2,2 ... 2,n
which would serve in the undirected case.
If you need to compute both flows(A,B) and flows(B,A), then you can simply push your code into a function called flows, and call it with reversed args, as shown. ;-)
Update
Let's define a function called flows:
def flows():
pass
Now, what are the parameters?
Well, looking at your code, it gets data from a DataFrame. And you want two different wards, so let's start with those. The result seems to be a sum of the resulting grid.
def flows(df, ward_a, ward_b):
"""Return the sum of all the cells in the row/column intersections
of ward_a and ward_b."""
return 0
Now I'm going to copy lines of your code:
ward_list = idgroups[index]
print ward_list
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
I think this is most of the flow function right here. Let's look.
The ward_list will obviously be either the ward_a or ward_b parameters.
I'm not sure what df6 is, because you sort of recompute it in df7. So that need to be clarified.
regflowsum is our desired output, I think.
Rewriting this into the function:
def flows(df, ward_a, ward_b):
"""Return the sum of all the cells in the row/column intersections
of ward_a and ward_b."""
print "Computing flows from:"
print " ", ward_a
print ""
print "flows into:"
print " ", ward_b
# Filter rows by ward_a, cols by ward_b:
grid = df.loc[ward_a, ward_b]
print "Grid:"
print grid
flowsum = grid.values.sum()
print "Flows:", flowsum
return flowsum
Now, I have assumed that the ward_a and ward_b values are already in the correct format. So we'll have to str-ify them or whatever outside the function. Let's do that:
for ax, group_a in enumerate(idgroups):
ward_a = map(str, group_a)
for bx, group_b in enumerate(idgroups[ax:], start=ax):
ward_b = map(str, group_b)
flow_ab = flows(mergedcsv, ward_a, ward_b)
if ax != bx:
flow_ba = flows(mergedcsv, ward_b, ward_a)
else:
flow_ba = flow_ab
# Now what?
At this point, you have two numbers. They will be equal when the wards are the same (internal flow?). At this point your original code stops being helpful because it only deals with internal flows, and not A->B flows, so I don't know what to do. But the values are in the variables, so ...