Python: Efficient looping in dataframe to find duplicates for multiple columns

Python: Efficient looping in dataframe to find duplicates for multiple columns - python

I am using python and I want to go through a dataset and highlight the most used locations.
This is my dataset (but with 300,000+ records):
Longitude Latitude
14.28586 48.3069
14.28577 48.30687
14.28555 48.30678
14.28541 48.30673
First I add a density column:
df['Density'] = 0
And this is the code that I am using to increase the density value for each record:
for index in range(0,len(df)):
for index2 in range(index + 1, len(df)):
if df['Longitude'].loc[index] == df['Longitude'].loc[index2] and df['Latitude'].loc[index] == df['Latitude'].loc[index2]:
df['Density'].loc[index] += 1
df['Density'].loc[index2] += 1
print("match")
print(str(index) + "/" + str(len(df)))
The code above is simply iterating through the dataframe, comparing the first record against all the other records in the dataset (inner loop) and when a match is found both of their density values are incremented.
I want to find the Longitudes and Latitudes that match and increase their density value.
The code is obviously very slow and I am sure that Python will have a cool technique for doing something like this, any ideas?

You can use duplicated, groupby, transform & sum to achieve this:
Lets create a sample dataset that actually has duplicates
df = pd.DataFrame({'lat': [0, 0, 0, 1, 1, 2, 2, 2],
'lon': [1, 1, 2, 1, 0, 2, 2, 2]})
First flag the duplicate rows based on lat & lon, & apply the transform to create a new column
df['is_dup'] = df[['lat', 'lon']].duplicated()
df['dups'] = df.groupby(['lat','lon']).is_dup.transform(np.sum)
# df outputs:
df['is_dup'] = df[['lat', 'lon']].duplicated()
df['dups'] = df.groupby(['lat','lon']).is_dup.transform(np.sum)

Related

IF(A2>A3, 1, 0) Excel formula in Pandas

I am trying to create a column with zeros and ones based on values from 1st column.
If the value of upper cell is bigger, then write 1, else 0.
Example code would look like this:
df = pd.Dataframe({'col1': [1, 2, 1, 3, 0]})
df['col2'] = ...python version of excel formula IF(A2>A3, 1, 0)...
expected output:
I have tried:
while True:
for index, rows in df.iterrows():
df['col1'] = np.where(df['col1'] > df['col1'][index+1], 1, 0)
but this is very slow and gives wrong results.
Thanks in advance!

You can use
df['col2'] = df['col1'].shift().lt(df['col1']).astype(int)

Here is the final solution I came up with:
df['col2] = (df['col1'<df['col1'].shift()).astype(int).shift(periods=-1).fillna(0)

Slice DataFrame at specific points and plot each slice

I am new to programming and Pythone could you help me?
I have a data frame which look like this.
d = {'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
df = pd.DataFrame(data=d)
I want to slice the data whenever value == 100 and then plot all slices in a figer.
So my questions are how to slice or cut the data as described? and what's the best structure to save slices in order to plot?.
Note 1: value column has no frequency that I can use and it varies from 0 to 100 where time is arbitrary.
Note 2: I already tried this solution but I get the same table
decreased_value = df[df['value'] <= 100][['time', 'value']].reset_index(drop=True)
How can I slice one column in a dataframe to several series based on a condition
Thanks in advance!

EDIT:
Here's a simpler way of handling my first answer (thanks to #aneroid for the suggestion).
Get the indices where value==100 and add +1 so that these land at the bottom of each slice:
indices = df.index[df['value'] == 100] + 1
Then use numpy.split (thanks to this answer for that method) to make a list of dataframes:
df_list = np.split(df, indices)
Then do your plotting for each slice in a for loop:
for df in df_list:
--- plot based on df here ---
VERBOSE / FROM SCRATCH METHOD:
You can get the indices for where value==100 like this:
indices = df.index[df.value==100]
Then add the smallest and largest indices in order to not leave out the beginning and end of the df:
indices = indices.insert(0,0).to_list()
indices.append(df.index[-1]+1)
Then cycle through a while loop to cut up the dataframe and put each slice into a list of dataframes:
i = 0
df_list = []
while i+1 < len(indices):
df_list.append(df.iloc[indices[i]:indices[i+1]])
i += 1

I already solved the problem using for loop, which can be used to slice and plot at the same time without using np.split function, as well as maintain the data structure.
Thanks to the previous answer by #k_n_c, it helps me improve it.
slices = df.index[df['score'] == 100]
slices = slices + 1
slices = np.insert(slices, 0,0, axis=0)
slices = np.append(slices,df.index[-1]+1)
prev_ind = 0
for ind in slices:
temp = df.iloc[prev_ind:ind,:]
plt.plot(temp.time, temp.score)
prev_ind = ind
plt.show()

How to use np.where() to divide elements of an array into categories?

I'm trying to use np.where() to classify elements of an array into three categories. My array is mean_house_value = [200.000, 120.000, 111.765, 326.234, 700.090, 99.345, 150.232, 250.000, 940.000, 177.000, 45.000, 42.000, 620.654]. The dataset is called housing. house_value_cat is the new column in the dataset where I want to save my new classification. The classification is the following:
mean_house_value < 200.000
200.000 < mean_house_value < 400.000
400.000 < mean_house_value
My code so far is the following:
housing["house_value_cat"] = np.ceil(housing["mean_house_value"]/3)
housing["house_value_cat"].where((housing["house_value_cat"]<200.000) &(housing["house_value_cat"]>400.000))
print(housing["house_value_cat"])
How can I implement the second condition (200.000 < mean_house_value < 400.000) in my code?
[my desired output should look like this:1

Numpy has a function digitize() that does what you want :
>>> import numpy as np
>>> mean_house_value = [200.000, 120.000, 111.765, 326.234, 700.090, 99.345, 150.232, 250.000, 940.000, 177.000, 45.000, 42.000, 620.654]
>>> np.digitize(mean_house_value,[0.,200.,400.])
array([2, 1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 1, 3])
You can create a new column in a dataframe with this result. Assuming you already defined a dataframe called housing :
housing["house_value_cat"] = np.digitize(mean_house_value,[0.,200.,400.])

Finding Dates in one array based on the ranges from another array and closest value

I have two Nested NumPy arrays (dateValArr & searchDates). dateValArr contains all dates for May 2011 (1st - 31st) and an associated value each date. searchDates contains 2 dates and an associated value as well (2 dates correspond to a date range).
Using date ranges specified in searchDates Array, I want to find dates in dateValArr array. Next for those selected dates in dateValArr, I want to find the closest value to the specified value of searchDates.
I have come up with is code but for the first part it it only works if only one value is specified.
#setup arrays ---------------------------------------------------------------------------
# Generate dates
st_date = '2011-05-01'
ed_date = '2011-05-31'
dates = pd.date_range(st_date,ed_date).to_numpy(dtype = object)
# Generate Values
val_arr = np.random.uniform(1,12,31)
dateValLs = []
for i,j in zip(dates,val_arr):
dateValLs.append((i,j))
dateValArr = np.asarray(dateValLs)
print(dateValArr)
#out:
[[Timestamp('2011-05-01 00:00:00', freq='D') 7.667399233149668]
[Timestamp('2011-05-02 00:00:00', freq='D') 5.906099813052642]
[Timestamp('2011-05-03 00:00:00', freq='D') 3.254485533826182]
...]
#Generate search dates
searchDates = np.array([(datetime(2011,5,11),datetime(2011,5,20),9),(datetime(2011,5,25),datetime(2011,5,29),2)])
print(searchDates)
#out:
[[datetime.datetime(2011, 5, 11, 0, 0) datetime.datetime(2011, 5, 20, 0, 0) 9]
[datetime.datetime(2011, 5, 25, 0, 0) datetime.datetime(2011, 5, 29, 0, 0) 2]]
#end setup ------------------------------------------------------------------------------
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0][0], dateValArr[:,0] < search_dates[0][1]))
print(x)
out: (array([11, 12, 13, 14, 15, 16, 17, 18], dtype=int64),)
However, the code works only if I select the first element searchDates (searchDates[0][0]). It will not run for all values in searcDates. What I mean if I replace by the following code.
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0], dateValArr[:,0] < search_dates[0]))
Then, I will get the following error: operands could not be broadcast together with shapes (31,) (3,)
To find the closest value I hoping to somehow combine the following line of the code,
n = (np.abs(dateValArr[:,1]-searchDates[:,2])).argmin()
Any ideas on how to solve it.Thanks in advance

Only thing came into my mind is a for loop.
Here is the link for my work
result = np.array([])
for search_term in searchDates:
mask = (dateValArr[:,0] > search_term[0]) & (dateValArr[:,0] < search_term[1])
date_search_result = dateValArr[mask, :]
d = np.abs(date_search_result[:,1] - searchDates[0,2])
result = np.hstack([result, date_search_result[d.argmin()]])
print(result)

I kinda figured out it as well,
date_value = []
for i in search_dates:
dateidx_arr = np.where(np.logical_and(dateValArr[:,0] >= i[0],dateValArr[:,0] <= i[1] )) #Get index of specified date ranges
date_arr = dateValArr[dateidx_arr] #Based on the index get the dates and values
value_arr = (np.abs(date_arr[:,1]-i[2])).argmin() #for those dates calculate the closest value index
date_value.append(date_arr[value_arr]) #Use the index to get the closest date and value

How to rewrite the code more elegant

I wrote this function. The input and expected results are indicated in the docstring.
def summarize_significance(sign_list):
"""Summarizes a series of individual significance data in a list of ocurrences.
For a group of p.e. 5 measurements and two diferent states, the input data
has the form:
sign_list = [[-1, 1],
[0, 1],
[0, 0],
[0,-1],
[0,-1]]
where -1, 0, 1 indicates decrease, no change or increase respectively.
The result is a list of 3 items lists indicating how many measurements
decrease, do not change or increase (as list items 0,1,2 respectively) for each state:
returns: [[1, 4, 0], [2, 1, 2]]
"""
swaped = numpy.swapaxes(sign_list, 0, 1)
summary = []
for row in swaped:
mydd = defaultdict(int)
for item in row:
mydd[item] += 1
summary.append([mydd.get(-1, 0), mydd.get(0, 0), mydd.get(1, 0)])
return summary
I am wondering if there is a more elegant, efficient way of doing the same thing. Some ideas?

Here's one that uses less code and is probably more efficient because it just iterates through sign_list once without calling swapaxes, and doesn't build a bunch of dictionaries.
summary = [[0,0,0] for _ in sign_list[0]]
for row in sign_list:
for index,sign in enumerate(row):
summary[index][sign+1] += 1
return summary

No, just more complex ways of doing so.
import itertools
def summarize_significance(sign_list):
res = []
for s in zip(*sign_list):
d = dict((x[0], len(list(x[1]))) for x in itertools.groupby(sorted(s)))
res.append([d.get(x, 0) for x in (-1, 0, 1)])
return res

For starters, you could do:
swapped = numpy.swapaxes(sign_list, 0, 1)
for row in swapped:
mydd = {-1:0, 0:0, 1:0}
for item in row:
mydd[item] += 1
summary.append([mydd[-1], mydd[0], mydd[1])
return summary

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Efficient looping in dataframe to find duplicates for multiple columns - python

Related

IF(A2>A3, 1, 0) Excel formula in Pandas

Slice DataFrame at specific points and plot each slice

How to use np.where() to divide elements of an array into categories?

Finding Dates in one array based on the ranges from another array and closest value

How to rewrite the code more elegant

Categories

Resources