Pandas for loop with 2 parameters and changing column values - python

I have 2 columns in a dataframe, one named "day_test" and one named "Temp Column". Some of my values in Temp Column are negative, and I want them to be 1 or 2. I've made a for loop with 2 if statements:
for (i,j) in zip(df['day_test'].astype(int), df['Temp Column'].astype(int)):
if i == 2 and j < 0:
j = 2
if i == 1 and j < 0:
j = 1
I tried printing j so I know the loops are working properly, but the values that I want to change in the dataframe are staying negative.
Thanks

Your code doesn't change the values inside the dataframe, it only changes the j value temporarily.
One way to do it is this:
df['day_test'] = df['day_test'].astype(int)
df['Temp Column'] = df['Temp Column'].astype(int)
df.loc[(df['day_test']==1) & (df['Temp Column']<0),'Temp Column'] = 1
df.loc[(df['day_test']==2) & (df['Temp Column']<0),'Temp Column'] = 2

Related

Subset a row based on the column with similar name

Assuming a pandas dataframe like the one in the picture, I would like to fill the na values based with the value of the other variable similar to it. To be more clear, my variables are
mean_1, mean_2 .... , std_1, std_2, ... min_1, min_2 ...
So I would like to fill the na values with the values of the other columns, but not all the columns, only those whose represent the same metric, in the picture i highligted 2 na values. The first one I would like to fill it with the mean obtain from the variables 'MEAN' at row 2, while the second na I would like to fill it with the mean obtain from variable 'MIN' at row 9. Is there a way to do it?
you can find the unique prefixes, iterate through each and do fillna for subsets seperately
uniq_prefixes = set([x.split('_')[0] for x in df.columns])
for prfx in uniq_prefixes:
mask = [col for col in df if col.startswith(prfx)]
# Transpose is needed because row wise fillna is not implemented yet
df.loc[:,mask] = df[mask].T.fillna(df[mask].mean(axis=1)).T
Yes, it is possible doing it using the loop. Below is the naive approach, but even for fancier ones, it is not much optimisation (at least I don't see them).
for i, row in df.iterrows():
sum_means = 0
n_means = 0
sum_stds = 0
n_stds = 0
fill_mean_idxs = []
fill_std_idxs = []
for idx, item in item.iteritems():
if idx.startswith('mean') and item is None:
fill_mean_idxs.append(idx)
elif idx.startswith('mean'):
sum_means += float(item)
n_means += 1
elif idx.startswith('std') and item is None:
fill_std_idxs.append(idx)
elif idx.startswith('std'):
sum_stds += float(item)
n_stds += 1
ave_mean = sum_means / n_means
std_mean = sum_stds / n_stds
for idx in fill_mean_idx:
df.loc[i, idx] = ave_mean
for idx in fill_std_idx:
df.loc[i, idx] = std_mean

Add a value in a column as a function of the timestamp and another column

The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0

Find value and index in panda series where the value increased 5 times

In a panda series it should go through the series and stop if one value has increased 5 times. With a simple example it works so far:
list2 = pd.Series([2,3,3,4,5,1,4,6,7,8,9,10,2,3,2,3,2,3,4])
def cut(x):
y = iter(x)
for i in y:
if x[i] < x[i+1] < x[i+2] < x[i+3] < x[i+4] < x[i+5]:
return x[i]
break
out = cut(list2)
index = list2[list2 == out].index[0]
So I get the correct Output of 1 and Index of 5.
But if I use a second list with series type and instead of (19,) which has (23999,) values then I get the Error:
pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 3489660928
You can do something like this:
# compare list2 with the previous values
s = list2.gt(list2.shift())
# looking at last 5 values
s = s.rolling(5).sum()
# select those equal 5
list2[s.eq(5)]
Output:
10 9
11 10
dtype: int64
The first index where it happens is
s.eq(5).idxmax()
# output 10
Also, you can chain them together:
(list2.gt(list2.shift())
.rolling(5).sum()
.eq(5).idxmax()
)

what can I do to make long to wide format in python

I have this long data. I like to sort this by 30 each and save separately.
Data print like this,
A292340
A291630
A278240
A267770
A267490
A261250
A261110
A253150
A252400
A253250
A243890
A243880
A236350
A233740
A233160
A225800
A225060
A225050
A225040
A225130
A219900
A204450
A204480
A204420
A196030
A196220
A167860
A152500
A123320
A122630
.
This is fairly simple question, but I need your help..
Thank you.
(And how can I make a list out of one results printed? list addtion?
I believe need create MultiIndex by modulo and floor divide np.arange by length of DataFrame and then unstack:
But if length modulo is not equal 0 (e.g. (30 % 12)), last values are not matched to last column and Nones are added:
N = 12
r = np.arange(len(df))
df.index = [r % N, r // N]
df = df['col'].unstack()
print (df)
0 1 2
0 A292340 A236350 A196030
1 A291630 A233740 A196220
2 A278240 A233160 A167860
3 A267770 A225800 A152500
4 A267490 A225060 A123320
5 A261250 A225050 A122630
6 A261110 A225040 None
7 A253150 A225130 None
8 A252400 A219900 None
9 A253250 A204450 None
10 A243890 A204480 None
11 A243880 A204420 None
Setup:
d = {'col': ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']}
df = pd.DataFrame(d)
print (df.head())
col
0 A292340
1 A291630
2 A278240
3 A267770
4 A267490
If you don't have Pandas and Numpy modules you can use this:
Setup:
long_list = ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400',
'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050',
'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860',
'A152500', 'A123320', 'A122630', 'A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250',
'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160',
'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420',
'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']
Code:
number_elements_in_sublist = 30
sublists = []
sublists.append([])
sublist_index = 0
for index, element in enumerate(long_list):
sublists[sublist_index].append(element)
if index > 0:
if (index+1) % number_elements_in_sublist == 0:
if index == len(long_list)-1:
break
sublists.append([])
sublist_index += 1
for index, sublist in enumerate(sublists):
print("Sublist Nr." + str(index+1))
for element in sublist:
print(element)

Pandas: Index of last non equal row

I have a pandas data frame F with a sorted index I. I am interested in knowing about the last change in one of the columns, let's say A. In particular, I want to construct a series with the same index as F, namely I, whose value at i is j where j is the greatest index value less than i such that F[A][j] != F[A][i]. For example, consider the following frame:
A
1 5
2 5
3 6
4 2
5 2
The desired series would be:
1 NaN
2 NaN
3 2
4 3
5 3
Is there a pandas/numpy idiomatic way to construct this series?
Try this:
df['B'] = np.nan
last = np.nan
for index, row in df.iterrows():
if index == 0:
continue
if df['A'].iloc[index] != df['A'].iloc[index - 1]:
last = index
df['B'].iloc[index] = last
This will create a new column with the results. I believe that changing the rows as you pass through them is not a good idea, after that you can simply replace a column and delete the other if you wish.
np.argmax or pd.Series.argmax on Boolean data can help you find the first (or in this case, last) True value. You still have to loop over the series in this solution, though.
# Initiate source data
F = pd.DataFrame({'A':[5,5,6,2,2]}, index=list('fobni'))
# Initiate resulting Series to NaN
result = pd.Series(np.nan, index=F.index)
for i in range(1, len(F)):
value_at_i = F['A'].iloc[i]
values_before_i = F['A'].iloc[:i]
# Get differences as a Boolean Series
# (keeping the original index)
diffs = (values_before_i != value_at_i)
if diffs.sum() == 0:
continue
# Reverse the Series of differences,
# then find the index of the first True value
j = diffs[::-1].argmax()
result.iloc[i] = j

Categories