Slicing a Data frame by checking consecutive elements [duplicate] - python

This question already has answers here:
Pandas: Drop consecutive duplicates
(8 answers)
Closed 4 years ago.
I have a DF indexed by time and one of its columns (with 2 variables) is like [x,x,y,y,x,x,x,y,y,y,y,x]. I want to slice this DF so Ill get this column without same consecutive variables- in this example :[x,y,x,y,x] and every variable was the first in his subsequence.
Still trying to figure it out...
Thanks!!

Assuming you have df like below
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
We using shift to find the next is equal to the current or not
df[df[0].shift()!=df[0]]
Out[142]:
0
0 x
2 y
4 x
7 y
11 x

You jsut try to loop through and safe the last element used
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
df2=pd.DataFrame()
old = df[0].iloc[0] # get the first element
for column in df:
df[column].iloc[0] != old:
df2.append(df[column].iloc[0])
old = df[column].iloc[0]
EDIT:
Or for a vector use a list
>>> L=[1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [x[0] for x in groupby(L)]
[1, 2, 3, 4, 5, 1, 2]

Related

Search common items in 2 lists in pandas dataframe [duplicate]

This question already has answers here:
Find intersection of two nested lists?
(21 answers)
Closed 2 years ago.
I have 2 columns (input & target) in the pandas which includes the list. The purpose is to find how many common item in these 2 lists and save the result in the new column "common"
For the 1st row, only have 'ELITE' in common. For the 2nd row, both 'ACURAWATCH' & 'PLUS' exist in both list.
Input:
frame = pd.DataFrame({'input' : [['MINIVAN','TOURING','ELITE'], ['4D','SUV','ACURAWATCH','PLUS']], 'target' : [['MINI','TOUR','ELITE'], ['ACURAWATCH','PLUS']]})
Expect Output:
frame = pd.DataFrame({'input' : [['MINIVAN','TOURING','ELITE'], ['4D','SUV','ACURAWATCH','PLUS']], 'target' : [['MINI','TOUR','ELITE'], ['ACURAWATCH','PLUS']], 'common' :[1, 2]})
You can use set.intersection with df.apply:
In [4307]: frame["common"] = frame.apply(
lambda x: len(set(x["input"]).intersection(set(x["target"]))), 1)
In [4308]: frame
Out[4308]:
input target common
0 [MINIVAN, TOURING, ELITE] [MINI, TOUR, ELITE] 1
1 [4D, SUV, ACURAWATCH, PLUS] [ACURAWATCH, PLUS] 2
You can apply a custom function with np.intersect1d:
frame['common'] = frame.apply(lambda x: len(np.intersect1d(x['input'],x['target'])), axis=1)
You could use:
frame['common'] = [len(set(x) & set(y)) for x, y in frame.to_numpy()]
print(frame)
Output
input target common
0 [MINIVAN, TOURING, ELITE] [MINI, TOUR, ELITE] 1
1 [4D, SUV, ACURAWATCH, PLUS] [ACURAWATCH, PLUS] 2

Pandas DataFrame: multiply values in a column, based on condition [duplicate]

This question already has an answer here:
Pandas: update a column with an if statement
(1 answer)
Closed 3 years ago.
Hi I have a DataFrame column like the follow.
dataframe['BETA'], which has float numbers between 0 and 100.
I need to have just numbers with the same numbers of digits. Example:
Dataframe['BETA´]:
[0] 0.11 to [0] 110
[1] 1.54 to [1] 154
[2] 22.1 to [2] 221
I tried to change one by one, but its super inefficient process:
for i in range (len(df_ld)):
nbeta=df_ld['BETA'][i]
if nbeta<1:
val=nbeta
val=val*1000
df_ld.loc[i,'BETA']=val
if (nbeta>=1) and (nbeta<=10):
val=nbeta
val=val*100
df_ld.loc[i,'BETA']=val
if (nbeta>10) and (nbeta<=100):
val=nbeta
val=val*10
df_ld.loc[i,'BETA']=val
#print('%.f >10, %.f Nuevo valor'% (nbeta,val))
Note: The dataframe size is more then 80k elements
Please help!
Edited: Solution
numpy.select
import numpy as np
x = df_ld['BETA']
condlist = [x<1, (x>=1) & (x<10),(x>=10) & (x<100)]
choicelist = [x*1000, x*100,x*10]
output=np.select(condlist, choicelist)
df_ld.insert(4,'BETA3',output,True)
Thank you!
Try this.
I'm guessing your dataframe is called df_ld and your target column is df_ld['BETA'].
def multiply(column):
newcol = []
for item in column:
if item<1:
item=item*1000
newcol.append(item)
if (item>=1) and (item<=10):
item=item*100
newcol.append(item)
if (item>10) and (item<=100):
item=item*10
newcol.append(item)
return newcol
# apply function and create new column
df_ld['newcol'] = multiply(df_ld['BETA'])

Get the index of the highest value inside a numpy array for each row? [duplicate]

This question already has answers here:
How to get the index of a maximum element in a NumPy array along one axis
(5 answers)
Closed 4 years ago.
I have a numpy array of 30 rows and 4 columns, for each row I need to get the index where the highest value is located.
So for an array like this
a = np.array([[0, 1, 2],[7, 4, 5]])
I would like to get a list with the indices 2 for the first row and 0 for the second row.
I tried with the numpy function argmax as follows:
for i in range(len(a)):
results=[np.argmax(a)]
return (results)
but I just get the global maximum, does anyone know how to go around this?
Thank you very much for the help.
Use the argmax method with axis=1 in order to work on the rows.
>>> import numpy as np
>>> a = np.array([[0, 1, 2],[7, 4, 5]])
>>> a.argmax(axis=1)
array([2, 0])
There is also the numpy.argmax module level function which works just the same.
>>> np.argmax(a, axis=1)
array([2, 0])

numpy sort arrays based on last column values [duplicate]

This question already has answers here:
Sorting arrays in NumPy by column
(16 answers)
Closed 4 years ago.
import numpy as np
a = np.array([[5,9,44],
[5,12,43],
[5,33,11]])
b = np.sort(a,axis=0)
print(b) #not well
# [[ 5 9 11]
# [ 5 12 43]
# [ 5 33 44]]
#desired output:
#[[5,33,11],
# [5,12,43],
# [5,9,44]]
what numpy sort does it changes rows completely(ofcourse based on lower to highest), but i would like to keep rows untouched. I would like to sort rows based on last column value, yet rows and values in array must stay untouched. Is there any pythonic way to do this?
Thanks
ind=np.argsort(a[:,-1])
b=a[ind]
EDIT
When you use axis in the sort, it sorts every column individually, what you want is to get indices of the sorted rows from the selected column (-1 is equivalent to the last column), and then reorder your original array.
a[a[:,-1].argsort()]
may work for you

change value in pandas dataframe based on length of current value [duplicate]

This question already has answers here:
Pandas - Add leading "0" to string values so all values are equal len
(3 answers)
Closed 6 years ago.
I have a pandas dataframe that has a certain column that should have values of a length of four. If the length is three, I would like to add a '0' to the beginning of the value. For example:
a b c
1 2 0054
3 6 021
5 5 0098
8 2 012
So in column c I would like to change the second row to '0021' and last row to '0012.' The values are already strings. I've tried doing:
df.loc[len(df['c']) == 3, 'c'] = '0' + df['c']
but it's not working out. Thanks for any help!
If the type in C is int you can do something like this:
df['C'].apply(lambda x: ('0'*(4-len(str(x))))+str(x) if(len(str(x)) < 4) else str(x))
In the lambda function, I check whether the number of digits/characters in x is less than four. If yes, I add zeros in front, so that the number of digits/characters in x will be four (this is also known as padding). If not, I return the value as string.
In case your type is string, you can remove the str() function calls, but it will work either way.

Categories