Change values of a Column Pandas - python

I want to add my items values on the existing f['ECPM_medio'] column
I made some modifications on the items values to have 0.8 to 0.9values of each numbers. the problem is when I try to add these new numbers to the existing column... I paste the same number on all rows!
import pandas as pd
jf = pd.read_csv("Cliente_x_Pais_Sitio.csv", header=0, sep = ",")
del jf['Fill_rate']
del jf['Importe_a_pagar_a_medio']
a = jf.sort_values(by=["Cliente","Auth_domain","Sitio",'Country'])
f = a.groupby(["Cliente","Auth_domain","Sitio","Country"], as_index=False)['ECPM_medio'].min()
del a['Fecha']
del a['Subastas']
del a['Impresiones_exchange']
f.to_csv('Recom_Sitios.csv', index=False)
for item in f['ECPM_medio']:
item = float(item)
if item <= 0.5:
item = item * 0.8
else:
item = item * 0.9
item = float("{0:.2f}".format(item))
item
for item in item:
f['ECPM_medio'] = item
f.to_csv('Recom_Sitios22.csv', index=False)

It seems to me that you could also do something like this:
f.loc[:, 'ECPM_medio'] = (f['ECPM_medio'] * \
np.where(f['ECPM_medio'] <= 0.5, .8, .9)).round(2)
np.where(f['ECPM_medio'] <= 0.5, .8, .9) returns an array the length of your ECPM_medio column with values .8 or .9, depending on the same-indexed value in f['ECPM_medio']. You can then multiply your DataFrame column by this array, and wrap the whole expression in parentheses so that you can take the resulting Series (i.e. your transformed f['ECPM_medio'] column), and tack on .round(2) to round the column's values to two places.

You should create a function and then apply it with lambda.
Example:
def myfunc(item):
item = float(item)
if item <= 0.5:
item = item * 0.8
else:
item = item * 0.9
item = float("{0:.2f}".format(item))
return item
f['ECPM_medio'] = f['ECPM_medio'].apply(lambda x: myfunc(x))

You can do this using Pandas vectorized operations,
df['ECPM_medio'] = np.where(df['ECPM_medio'] <= 0.5, df['ECPM_medio'] * 0.8, df['ECPM_medio']* 0.9)

Related

Remove following rows that are above or under by X amount from the current row['x']

I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx

Subset a row based on the column with similar name

Assuming a pandas dataframe like the one in the picture, I would like to fill the na values based with the value of the other variable similar to it. To be more clear, my variables are
mean_1, mean_2 .... , std_1, std_2, ... min_1, min_2 ...
So I would like to fill the na values with the values of the other columns, but not all the columns, only those whose represent the same metric, in the picture i highligted 2 na values. The first one I would like to fill it with the mean obtain from the variables 'MEAN' at row 2, while the second na I would like to fill it with the mean obtain from variable 'MIN' at row 9. Is there a way to do it?
you can find the unique prefixes, iterate through each and do fillna for subsets seperately
uniq_prefixes = set([x.split('_')[0] for x in df.columns])
for prfx in uniq_prefixes:
mask = [col for col in df if col.startswith(prfx)]
# Transpose is needed because row wise fillna is not implemented yet
df.loc[:,mask] = df[mask].T.fillna(df[mask].mean(axis=1)).T
Yes, it is possible doing it using the loop. Below is the naive approach, but even for fancier ones, it is not much optimisation (at least I don't see them).
for i, row in df.iterrows():
sum_means = 0
n_means = 0
sum_stds = 0
n_stds = 0
fill_mean_idxs = []
fill_std_idxs = []
for idx, item in item.iteritems():
if idx.startswith('mean') and item is None:
fill_mean_idxs.append(idx)
elif idx.startswith('mean'):
sum_means += float(item)
n_means += 1
elif idx.startswith('std') and item is None:
fill_std_idxs.append(idx)
elif idx.startswith('std'):
sum_stds += float(item)
n_stds += 1
ave_mean = sum_means / n_means
std_mean = sum_stds / n_stds
for idx in fill_mean_idx:
df.loc[i, idx] = ave_mean
for idx in fill_std_idx:
df.loc[i, idx] = std_mean

Cover all columns using the least amount of rows in a pandas dataframe

I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb

Replacing 'NA's in a nested list

I am trying to do the following: identify if there is a 'NA' value in a nested list, and if so, to replace it with the average value of the sum of the other elements of the list. The elements of the lists should be floats. For example:
[["1.2","3.1","0.2"],["44.0","NA","90.0"]]
should return
[[1.2, 3.1, 0.2], [44.0, 67.0, 90.0]]
The code below, albeit long and redundant, works:
def convert_data(data):
first = []
second = []
third = []
fourth = []
count = 0
for i in data:
for y in i:
if 'NA' not in i:
y = float(y)
first.append(y)
elif 'NA' in i:
a = i.index('NA')
second.append(y)
second[a] = 0
for q in second:
q = float(q)
third.append(q)
count+= q
length = len(third)
count = count/(length-1)
third[a] = count
fourth.extend([first,third])
return fourth
data = [["1.2","3.1","0.2"],["44.0","NA","90.0"]]
convert_data(data)
for example:
data = [["1.2","3.1","0.2"],["44.0","NA","90.0"]]
convert_data(data)
returns the desired output:
[[1.2, 3.1, 0.2], [44.0, 67.0, 90.0]]
but if the 'NA' is in the first list e.g.
data = [["1.2","NA","0.2"],["44.0","67.00","90.0"]]
then it doesn't. Can someone please explain how to fix this?
data_var = [["1.2", "3.1", "0.2"], ["44.0", "NA", "90.0"]]
def replace_na_with_mean(list_entry):
for i in range(len(list_entry)):
index_list = []
m = 0
while 'NA' in list_entry[i]:
index_list.append(list_entry[i].index('NA') + m)
del list_entry[i][list_entry[i].index('NA')]
if list_entry[i]:
for n in range(len(list_entry[i])):
list_entry[i][n] = float(list_entry[i][n])
if index_list:
if list_entry[i]:
avg = sum(list_entry[i]) / len(list_entry[i])
else:
avg = 0
for l in index_list:
list_entry[i].insert(l, avg)
return list_entry
print(replace_na_with_mean(data_var))
I'd suggest to use pandas functionality, since these types of operations are exactly what pandas was developed for. One can simply achieve what you want in just few lines of code:
import pandas as pd
data = [["1.2","NA","0.2"],["44.0","67.00","90.0"]]
df = pd.DataFrame(data).T.replace("NA", pd.np.nan).astype('<f8')
res = df.fillna(df.mean()).T.values.tolist()
which returns the wanted output:
[[1.2, 0.7, 0.2], [44.0, 67.0, 90.0]]
Btw your code works for me just fine in this simple case:
convert_data(data)
> [[44.0, 67.0, 90.0], [1.2, 0.7, 0.2]]
It will definitely start failing or giving faulty results in more complicated cases, f.e. if you have more than 1 "NA" value in the nested list, you will get ValueError exception (you will be trying to convert string into float).
This should do the trick, using numpy:
import numpy as np
x=[["1.2","3.1","0.2"],["44.0","NA","90.0"]]
#convert to float
x=np.char.replace(np.array(x), "NA", "nan").astype(np.float)
#replace nan-s with mean
mask=x.astype(str)=="nan"
x[mask]=np.nanmean(x, axis=1)[mask.any(axis=1)]
Output:
[[ 1.2 3.1 0.2]
[44. 67. 90. ]]
One reason why your code ended up a little overcomplicated is that you tried to start by solving the problem of a "nested list." But really, all you need is a function that processes a list of numeric strings with some "NA" values, and then you can just apply that function to every item in the list.
def float_or_average(list_of_num_strings):
# First, convert every item that you can to a number. You need to do this
# before you can handle even ONE "NA" value, because the "NA" values need
# to be replaced with the average of all the numbers in the collection.
# So for now, convert ["1.2", "NA", "2.0"] to [1.2, "NA", 2.0]
parsed = []
# While we're at it, let's record the sum of the floats and their count,
# so that we can compute that average.
numeric_sum = 0.0
numeric_count = 0
for item in list_of_num_strings:
if item == "NA":
parsed.append(item)
else:
floating_point_value = float(item)
parsed.append(floating_point_value)
numeric_sum += floating_point_value
numeric_count += 1
# Now we can calculate the average:
average = numeric_sum / numeric_count
# And replace the "NA" values with them.
for i, item in enumerate(parsed):
if item == "NA":
parsed[i] == average
return parsed
# Or, with a list comprehension (replacing the previous four lines of
# code):
return [number if number != "NA" else average for number in parsed]
# Using this function on a nested list is as easy as
example_data = [["1.2", "3.1", "0.2"], ["44.0", "NA", "90.0"]]
parsed_nested_list = []
for sublist in example_data:
parsed_nested_list.append(float_or_average(sublist))
# Or, using a list comprehension (replacing the previous three lines of code):
parsed_nested_list = [float_or_average(sublist) for sublist in example_data]
def convert_data(data):
for lst in data:
sum = 0
index_na = list()
for elem in range(len(lst)):
if lst[elem] != 'NA':
sum += float(lst[elem])
lst[elem] = float(lst[elem])
else:
index_na.append(elem)
if len(index_na) > 0:
len_values = sum / (len(lst)-len(index_na))
for i in index_na:
lst[i] = float("{0:.2f}".format(len_values))
return data

Find pandas quartiles based on another column

I have a dataframe:
Av_Temp Tot_Precip
278.001 0
274 0.0751864
270.294 0.631634
271.526 0.229285
272.246 0.0652201
273 0.0840059
270.463 0.0602944
269.983 0.103563
268.774 0.0694555
269.529 0.010908
270.062 0.043915
271.982 0.0295718
I want to find the percentile values (25%, 50%, 75%) for the column: 'Tot_Precip' for each decile (top 10%, next 10% ...) of values from the column: Av_Temp. Currently, I am doing this:
import numpy, pandas, pdb
expl_var = 'Av_Temp'
cname = 'Tot_Precip'
num_samples = 10.0
max_val = df[expl_var].max()
min_val = df[expl_var].min()
expl_bins = numpy.linspace(min_val, max_val, num = num_samples)
for index, val in enumerate(expl_bins):
print index
if index < (len(expl_bins) - 1):
cur_val = val
nxt_val = expl_bins[index+1]
# Subset dataframe to rows with values of expl_var between
# cur_val and nxt_val
sub_ind_df = df[(df[expl_var] >= cur_val) & (df[expl_var] <= nxt_val)]
sub_ind_df[cname+'_quartiles'] = pandas.qcut(sub_ind_df[cname], 4)
# Merge with sub_df
pdb.set_trace()
Not sure how to proceed after this.
The answer could be something like:
Av_Temp_decile Tot_Precip_25 Tot_Precip_50 Tot_Precip_75
270 - 272 0.03 0.05 0.08
I'm only splitting you data into halves rather than deciles here due to the small example dataset, but everything should work the same if you just increase the number of bins in the initial cut:
# Change this to 10 to get deciles
df['Temp_Halves'] = pd.qcut(df['Av_Temp'], 2)
def get_quartiles(group):
# Add retbins=True to get the bin edges
qs, bins = pd.qcut(group['Tot_Precip'], [.25, .5, .75], retbins=True)
# Returning a series from a function means groupby.apply() will
# expand it into separate columns
return pd.Series(bins, index=['Precip_25', 'Precip_50', 'Precip_75']
df.groupby('Temp_Halves').apply(get_quartiles)
Out[21]:
Precip_25 Precip_50 Precip_75
Temp_Halves
[268.774, 270.995] 0.048010 0.064875 0.095036
(270.995, 278.001] 0.038484 0.070203 0.081801

Categories