Pandas assignment using nested loops leading to memory error - python

I am using pandas and trying to do an assignment using a nested loops. I iterate over a dataframe and then run a distance function if it meets a certain criteria. I am faced with two problems:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Memory Error. It doesn't work on large datasets. I end up having to terminate the process.
How should I change my solution to ensure it can scale with a larger dataset of 60,000 rows?
for i, row in df.iterrows():
listy = 0
school = []
if row['LS_Type'] == 'Primary (1-4)':
a = row['Northing']
b = row['Easting']
LS_ID = row['LS_ID']
for j, row2 in df.iterrows():
if row2['LS_Type'] == 'Primary (1-8)':
dist_km = distance(a,b, df.Northing[j], df.Easting[j])
if (listy == 0):
listy = dist_km
school.append([df.LS_Name[j], df.LS_ID[j]])
else:
if dist_km < listy:
listy = dist_km
school[0] = [df.LS_Name[j], int(df.LS_ID[j])]
df['dist_up_prim'][i] = listy
df["closest_up_prim"][i] = school[0]
else:
df['dist_up_prim'][i] = 0

The double for loop is what's killing you here. See if you can break it up into two separate apply steps.
Here is a toy example of using df.apply() and partial to do a nested for loop:
import math
import pandas as pd
from functools import partial
df = pd.DataFrame.from_dict({'A': [1, 2, 3, 4, 5, 6, 7, 8],
'B': [1, 2, 3, 4, 5, 6, 7, 8]})
def myOtherFunc(row):
if row['A'] <= 4:
return row['B']*row['A']
def myFunc(the_df, row):
if row['A'] <= 2:
other_B = the_df.apply(myOtherFunc, axis=1)
return other_B.mean()
return pd.np.NaN
apply_myFunc_on_df = partial(myFunc, df)
df.apply(apply_myFunc_on_df, axis=1)
You can rewrite your code in this form, which will be much faster.

Related

count how many times a record appears in a pandas dataframe and create a new feature with this counter

I have this two dataframes dt_t and dt_u. I want to be able to count how many times a record in the text feature appears and I want to create a new feature in df_u where I associate to each id the counter. So id_u = 1 and id_u = 2 both will have counter = 3 since hello appears 3 times in df_t and both published a post with "hello" in the text.
import pandas as pd
import numpy as np
df_t = pd.DataFrame({'id_t': [0, 1, 2, 3, 4], 'id_u': [1, 1, 3, 2, 2], 'text': ["hello", "hello", "friend", "hello", "my"]})
print(df_t)
df_u = pd.DataFrame({'id_u': [1, 2, 3]})
print()
print(df_u)
df_u_new = pd.DataFrame({'id_u': [1, 2, 3], 'counter': [3, 3, 1]})
print()
print(df_u_new)
The code I wrote for the moment is this, but this is very slow and also I have a very huge dataset so it is impossible to do.
user_counter_dict = {}
tmp = dict(df_t["text"].value_counts())
# to speedup the process we set as index the text column
df_t.set_index(["text"], inplace=True)
for i, (k, v) in enumerate(tmp.items()):
row = (k, v)
text = row[0]
counter = row[1]
#this is slow and take much of the time
uniques_id = df_.loc[tweet]["id_u"].unique()
for elem in uniques_id:
value = user_counter_dict.setdefault(str(elem), counter)
if value < counter:
user_counter_dict[str(elem)] = counter
# and now I will put the date on the dict on a new column in df_u
Is there a very fast way to compute this?
You can do:
df_u_new = df_t.assign(counter=df_t["text"].map(df_t["text"].value_counts()))[
["id_u", "counter"]
].groupby("id_u", as_index=False).max()
Get the value_counts of text and groupby id_u and get the maximum value which is what you were trying to get IIUC.
print(df_u_new)
id_u counter
0 1 3
1 2 3
2 3 1

How to speed up nested loop and add condition?

I am trying to speed up my nested loop it currently takes 15 mins for 100k customers.
I am also having trouble adding an additional condition that only multiplies states (A,B,C) by lookup2 val, else multiplies by 1.
customer_data = pd.DataFrame({"cust_id": [1, 2, 3, 4, 5, 6, 7, 8],
"state": ['B', 'E', 'D', 'A', 'B', 'E', 'C', 'A'],
"cust_amt": [1000,300, 500, 200, 400, 600, 200, 300],
"year":[3, 3, 4, 3, 4, 2, 2, 4],
"group":[10, 25, 30, 40, 55, 60, 70, 85]})
state_list = ['A','B','C','D','E']
# All lookups should be dataframes with the year and/or group and the value like these.
lookup1 = pd.DataFrame({'year': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'lim %': 0.1})
lookup2 = pd.concat([pd.DataFrame({'group':g, 'lookup_val': 0.1, 'year':range(1, 11)}
for g in customer_data['group'].unique())]).explode('year')
multi_data = np.arange(250).reshape(10,5,5)
lookups = [lookup1, lookup2]
# Preprocessing.
# Transform the state to categorical code to use it as array index.
customer_data['state'] = pd.Categorical(customer_data['state'],
categories=state_list,
ordered=True).codes
# Set index on lookups.
for i in range(len(lookups)):
if 'group' in lookups[i].columns:
lookups[i] = lookups[i].set_index(['year', 'group'])
else:
lookups[i] = lookups[i].set_index(['year'])
calculation:
results = {}
for customer, state, amount, start, group in customer_data.itertuples(name=None, index=False):
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
example of expected output:
{1: [[array([55000, 56000, 57000, 58000, 59000]),
array([5500., 5600., 5700., 5800., 5900.]),
array([550., 560., 570., 5800., 5900.])],...
You could use multiprocessing if you have more than one CPU.
from multiprocessing import Pool
def get_customer_data(data_tuple) -> dict:
results = {}
customer, state, amount, start, group = data_tuple
for year in range(start, len(multi_data)+1):
if year == start:
results[customer] = [[amount * multi_data[year-1, state, :]]]
else:
results[customer].append([results[customer][-1][-1] # multi_data[year-1]])
for lookup in lookups:
if isinstance(lookup.index, pd.MultiIndex):
value = lookup.loc[(year, group)].iat[0]
else:
value = lookup.loc[year].iat[0]
results[customer][-1].append(value * results[customer][-1][-1])
return results
p = Pool(mp.cpu_count())
# Pool.map() takes a function and an iterable like a list or generator
results_list = p.map(get_customer_data, [data_tuple for data_tuple in customer_data.itertuples(name=None, index=False)] )
# results is a list of dict()
results_dict = {k:v for x in results_list for k,v in x.items()}
p.close()
Glad to see you posting this! As promised, my thoughts:
With Pandas works with columns very well. What you need to look to do is remove the need for loops as much as possible (In your case I would say get rid of the main loop you have then keep the year and lookups loop).
To do this, forget about the results{} variable for now. You want to do the calculations directly on the DataFrame. For example your first calculation would become something like:
customer_data['meaningful_column_name'] = [[amount * multi_data[customer_data['year']-1, customer_data['state'], :]]]
For your lookups loop you just have to be aware that the if statement will be looking at entire columns.
Finally, as it seems you want to have your data in a list of arrays you will need to do some formatting to extract the data from a DataFrame structure.
I hope that makes some sense

Python Series Sum Resample

Given a list [5,2,4,5,1,2,4,5], how to do a sum resample like pandas.resample df.resample().sum() without the hasle of creating a DatetimeIndex?
inp = [5,2,4,5,1,2,4,5]
out = resample(inp, 2, how='sum')
>> [7, 9, 3, 9]
Note: This is because df.resample().sum() only accept datetime-like index. I have spent some time googling this topic but find nothing. Sorry if there is exist a same question like this.
Edit:
A manual solution might look like this
import numpy as np
def resample_sum(inp, window):
return np.sum(np.reshape(inp, (len(inp)//window, window)), axis=1)
def resample(inp_list,window_size,how='sum'):
output = []
for i in range(0,len(inp_list),window_size):
window = inp_list[i:i+window_size]
if how == 'sum':
output.append(sum(window))
else:
raise NotImplementedError #replace this with other how's you want
return output
inp = [5,2,4,5,1,2,4,5]
out = resample(inp, 2, how='sum')
#[7, 9, 3, 9]
Edit 1: A vectorized numpy array solution which will have better performance for a huge array. The idea is to reshape the array into a 2-d array where the rows of the new array are the values that should be summed together.
import numpy as np
def resample(inp_array,window_size,how='sum'):
inp_array = np.asarray(inp_array)
#check how many zeros need to be added to the end to make
# the array length a multiple of window_size
pad = (window_size-(inp_array.size % window_size)) % window_size
if pad > 0:
inp_array = np.r_[np.ndarray.flatten(inp_array),np.zeros(pad)]
else:
inp_array = np.ndarray.flatten(inp_array)
#reshape so that the number of columns = window_size
inp_windows = inp_array.reshape((inp_array.size//window_size,window_size))
if how == 'sum':
#sum across columns
return np.sum(inp_windows,axis=1)
else:
raise NotImplementedError #replace this with other how's you want
inp = [5,2,4,5,1,2,4,5]
out = resample(inp, 2, how='sum')
#[7, 9, 3, 9]
Edit 2:
The closest thing to this I found in a popular library is skimage.measure.block_reduce.
You can treat your data as a 1-dimensional image, pass a block size, and pass the np.sum function.

Remove values from numpy array closer to each other

Actually i want to remove the elements from numpy array which are closer to each other.For example i have array [1,2,10,11,18,19] then I need code that can give output like [1,10,18] because 2 is closer to 1 and so on.
In the following is provided an additional solution using numpy functionalities (more precisely np.ediff1d which makes the differences between consecutive elements of a given array. This code considers as threshold the value associated to the th variable.
a = np.array([1,2,10,11,18,19])
th = 1
b = np.delete(a, np.argwhere(np.ediff1d(a) <= th) + 1) # [1, 10, 18]
Here is simple function to find the first values of series of consecutives values in a 1D numpy array.
import numpy as np
def find_consec(a, step=1):
vals = []
for i, x in enumerate(a):
if i == 0:
diff = a[i + 1] - x
if diff == step:
vals.append(x)
elif i < a.size-1:
diff = a[i + 1] - x
if diff > step:
vals.append(a[i + 1])
return np.array(vals)
a = np.array([1,2,10,11,18,19])
find_consec(a) # [1, 10, 18]
Welcome to stackoverflow. below is the code that can answer you question:
def closer(arr,cozy):
result = []
result.append(arr[0])
for i in range(1,len(arr)-1):
if arr[i]-result[-1]>cozy:
result.append(arr[i])
print result
Example:
a = [6,10,7,20,21,16,14,3,2]
a.sort()
closer(a,1)
output : [2, 6, 10, 14, 16, 20]
closer(a,3)
Output: [2, 6, 10, 14, 20]

filter a pandas data frame on all rows that do NOT meet a condition [duplicate]

This question already has answers here:
How can I obtain the element-wise logical NOT of a pandas Series?
(6 answers)
Closed 6 years ago.
This seems simple, but I can't seem to figure it out. I know how to filter a pandas data frame to all rows that meet a condition, but when I want the opposite, I keep getting weird errors.
Here is the example. (Context: a simple board game where pieces are on a grid and we're trying to give it a coordinate and return all adjacent pieces, but NOT the actual piece on that actual coordinate)
import pandas as pd
import numpy as np
df = pd.DataFrame([[5,7, 'wolf'],
[5,6,'cow'],
[8, 2, 'rabbit'],
[5, 3, 'rabbit'],
[3, 2, 'cow'],
[7, 5, 'rabbit']],
columns = ['lat', 'long', 'type'])
coords = [5,7] #the coordinate I'm testing, a wolf
view = df[((coords[0] - 1) <= df['lat']) & (df['lat'] <= (coords[0] + 1)) \
& ((coords[1] - 1) <= df['long']) & (df['long'] <= (coords[1] + 1))]
view = view[not ((coords[0] == view['lat']) & (coords[1] == view['long'])) ]
print(view)
I thought the not should just negate the boolean inside the parentheses that followed, but this doesn't seem to be how it works.
I want it to return the cow at 5,6 but NOT the wolf at 5,7 (because that's the current piece). Just to double check my logic, I did
me = view[(coords[0] == view['lat']) & (coords[1] == view['long'])]
print(me)
and this returned just the wolf, as I'd expected. So why can't I just put a not in front of that and get everything else? Or, more importantly, what do I do instead to get everything else.
As numpy (therefore pandas) use bitwise operators, you should replace not with ~. This is also the reason you are using & and not and.
import pandas as pd
df = pd.DataFrame({'a': [1, 2]})
print(df[~(df['a'] == 1)])
>> a
1 2
And using your example:
import pandas as pd
import numpy as np
df = pd.DataFrame([[5,7, 'wolf'],
[5,6,'cow'],
[8, 2, 'rabbit'],
[5, 3, 'rabbit'],
[3, 2, 'cow'],
[7, 5, 'rabbit']],
columns = ['lat', 'long', 'type'])
coords = [5,7] #the coordinate I'm testing, a wolf
view = df[((coords[0] - 1) <= df['lat']) & (df['lat'] <= (coords[0] + 1)) \
& ((coords[1] - 1) <= df['long']) & (df['long'] <= (coords[1] + 1))]
view = view[~ ((coords[0] == view['lat']) & (coords[1] == view['long'])) ]
print(view)
>> lat long type
1 5 6 cow

Categories