Find the last value in a list in a dataframe - python

please, i want to find the last value of in client in a dataframe, how can i do it?
Example :
df = pd.DataFrame({'date':
['2018-06-13', '2018-06-14', '2018-06-15', '2018-06-16'],
'gain': [[10, 12, 15],[14, 11, 15],[9, 10, 12], [6, 4, 2]],
'how': [['customer1', 'customer2', 'customer3'],
['customer4','customer5','customer6' ],
['customer7', 'customer8', 'customer9'],
['customer5', 'customer6', 'customer10'] ]}
df :
date gain how
0 2018-06-13 [10, 12, 15] [customer1, customer2, customer3]
1 2018-06-14 [14, 11, 15] [customer4, customer5, customer6]
2 2018-06-15 [9, 10, 12] [customer7, customer8, customer9]
3 2018-06-16 [6, 4, 2] [customer5, customer6, customer10]
I want to do a function that returns the last gain in the dataframe.
example :
for the customer5 = 6
for the customer4 = 14
for the customer20 = 'not found'
thank you so much

Using unnesting function then , drop_duplicates
newdf=unnesting(df,['gain','how']).drop_duplicates('how',keep='last')
newdf
Out[25]:
gain how date
0 10 customer1 2018-06-13
0 12 customer2 2018-06-13
0 15 customer3 2018-06-13
1 14 customer4 2018-06-14
2 9 customer7 2018-06-15
2 10 customer8 2018-06-15
2 12 customer9 2018-06-15
3 6 customer5 2018-06-16
3 4 customer6 2018-06-16
3 2 customer10 2018-06-16
Then input your search list with reindex
l=['customer5','customer6','customer20']
newdf.loc[newdf.how.isin(l)].set_index('how').reindex(l,fill_value='not_find')
Out[34]:
gain date
how
customer5 6 2018-06-16
customer6 4 2018-06-16
customer20 not_find not_find
Interesting reading about the solution of this type question
How do I unnest a column in a pandas DataFrame?
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

Related

delete redundant rows in a dataframe with set in columns

I have a dataframe df:
Cluster OsId BrowserId PageId VolumePred ConversionPred
0 11 11 {789615, 955761, 1149586, 955764, 955767, 1187... 147.0 71.0
1 0 11 12 {1184903, 955761, 1149586, 1158132, 955764, 10... 73.0 38.0
2 0 11 15 {1184903, 1109643, 955761, 955764, 1074581, 95... 72.0 40.0
3 0 11 16 {1123200, 1184903, 1109643, 1018637, 1005581, ... 7815.0 5077.0
4 0 11 17 {1184903, 789615, 1016529, 955761, 955764, 955... 52.0 47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1154705 220.0 182.0
308 {18} 99 16 1155314 12.0 6.0
309 {9} 99 16 1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1184903 966.0 539.0
This dataframe contains redundansts rows that I need to delete them , so I try this :
df.drop_duplicates()
But I got this error : TypeError: unhashable type: 'set'
Any idea to help me to fix this error? Thanks!
Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:
#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]
If no row was removed it means no row has duplicates (tested are all columns together)

Display a column with an indicator/flag when conditions are met between different columns in pandas dataframe (no-merge)

Good day all of you. I have a dataframe with region, customer and some deliveries, plus their price. There is this column used as type of purchase and the first and last purchase are marked as 'first' and 'last' and sometimes we have in-between deliveries marked as "delivery". I need to flag the customers and region that have the same first and last delivery price, as a column in the desired output. The whole data must be shown.
I've already solved the problem using merge but I would like to know if there's a way to do it without using merge since it doesn't seem that efficient.
Thanks for your time.
Sample data:
import pandas as pd
data = [['NY', 'A','FIRST', 25], ['NY', 'A','DELIVERY', 20], ['NY', 'A','DELIVERY', 30], ['NY', 'A','LAST', 25],
['NY', 'B','FIRST', 15], ['NY', 'B','DELIVERY', 10], ['NY', 'B','LAST', 20],
['FL', 'A','FIRST', 15], ['FL', 'A','DELIVERY', 10], ['FL', 'A','DELIVERY', 12], ['FL', 'A','DELIVERY', 25], ['FL', 'A','LAST', 15],
['FL', 'C','FIRST', 15], ['FL', 'C','LAST', 10],
['FL', 'D','FIRST', 10], ['FL', 'D','DELIVERY', 20], ['FL', 'D','LAST', 30],
['FL', 'E','FIRST', 20], ['FL', 'E','LAST', 20]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['region', 'customer', 'purchaseType', 'price'])
# print dataframe.
print(df)
region customer purchaseType price
0 NY A FIRST 25
1 NY A DELIVERY 20
2 NY A DELIVERY 30
3 NY A LAST 25
4 NY B FIRST 15
5 NY B DELIVERY 10
6 NY B LAST 20
7 FL A FIRST 15
8 FL A DELIVERY 10
9 FL A DELIVERY 12
10 FL A DELIVERY 25
11 FL A LAST 15
12 FL C FIRST 15
13 FL C LAST 10
14 FL D FIRST 10
15 FL D DELIVERY 20
16 FL D LAST 30
17 FL E FIRST 20
18 FL E LAST 20
Expected output:
region customer purchaseType price firstLastEqual
0 NY A FIRST 25 True
1 NY A DELIVERY 20 True
2 NY A DELIVERY 30 True
3 NY A LAST 25 True
4 NY B FIRST 15 False
5 NY B DELIVERY 10 False
6 NY B LAST 20 False
7 FL A FIRST 15 True
8 FL A DELIVERY 10 True
9 FL A DELIVERY 12 True
10 FL A DELIVERY 25 True
11 FL A LAST 15 True
12 FL C FIRST 15 False
13 FL C LAST 10 False
14 FL D FIRST 10 False
15 FL D DELIVERY 20 False
16 FL D LAST 30 False
17 FL E FIRST 20 True
18 FL E LAST 20 True
Answer using 'merge':
df_first = df[df['purchaseType'] == 'FIRST']
df_last = df[df['purchaseType'] == 'LAST']
df_compare = df_first.merge(df_last, how='inner', left_on=['region','customer'], right_on=['region','customer'])
df_compare = df_compare[df_compare['price_x'] == df_compare['price_y']]
df_compare['firstLastEqual'] = True
df = df.merge(df_compare, how='left', left_on=['region','customer'], right_on=['region','customer'])
df['firstLastEqual'] = df['firstLastEqual'].fillna(False)
df = df.drop(['purchaseType_x', 'price_x', 'purchaseType_y', 'price_y'], axis=1)
print(df)
Would like to know if possible without merging.

Pandas DataFrame find closest index in previous rows where condition is met

I have the following df1 dataframe:
t A
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6
5 23:05 5
6 23:06 4
7 23:07 9
8 23:08 7
9 23:09 10
10 23:10 8
For each t (increments simplified here, not uniformly distributed in real life), I would like to find, if any, the most recent time tr within the previous 5 min where A(t)- A(tr) >= 4. I want to get:
t A tr
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6 23:03
5 23:05 5 23:01
6 23:06 4
7 23:07 9 23:06
8 23:08 7
9 23:09 10 23:06
10 23:10 8 23:06
Currently, I can use shift(-1) to compare each row to the previous row like cond = df1['A'] >= df1['A'].shift(-1) + 4.
How can I look further in time?
Assuming your data is continuous by the minute, then you can do usual shift:
df1['t'] = pd.to_timedelta(df1['t'].add(':00'))
df = pd.DataFrame({i:df1.A - df1.A.shift(i) >= 4 for i in range(1,5)})
df1['t'] - pd.to_timedelta('1min') * df.idxmax(axis=1).where(df.any(1))
Output:
0 NaT
1 NaT
2 NaT
3 NaT
4 23:03:00
5 23:01:00
6 NaT
7 23:06:00
8 NaT
9 23:06:00
10 23:06:00
dtype: timedelta64[ns]
I added a datetime index and used rolling(), which now includes time-window functionalities beyond simple index-window.
import pandas as pd
import numpy as np
import datetime
df1 = pd.DataFrame({'t' : [
datetime.datetime(2020, 5, 17, 23, 0, 0),
datetime.datetime(2020, 5, 17, 23, 0, 1),
datetime.datetime(2020, 5, 17, 23, 0, 2),
datetime.datetime(2020, 5, 17, 23, 0, 3),
datetime.datetime(2020, 5, 17, 23, 0, 4),
datetime.datetime(2020, 5, 17, 23, 0, 5),
datetime.datetime(2020, 5, 17, 23, 0, 6),
datetime.datetime(2020, 5, 17, 23, 0, 7),
datetime.datetime(2020, 5, 17, 23, 0, 8),
datetime.datetime(2020, 5, 17, 23, 0, 9),
datetime.datetime(2020, 5, 17, 23, 0, 10)
], 'A' : [2,1,2,2,6,5,4,9,7,10,8]}, columns=['t', 'A'])
df1.index = df1['t']
df2 = df1
cond = df1['A'] >= df1.rolling('5s')['A'].apply(lambda x: x[0] + 4)
result = df1[cond]
Gives
t A
2020-05-17 23:00:04 6
2020-05-17 23:00:05 5
2020-05-17 23:00:07 9
2020-05-17 23:00:09 10
2020-05-17 23:00:10 8

Can you explain the output: diff.sort_values(ascending=False).index.astype

Can anyone explain the following statement.
list(diff.sort_values(ascending=False).index.astype(int)[0:5])
Output: Int64Index([24, 26, 17, 2, 1], dtype='int64')
It sorts first, but what is the index doing and how do i get 24, 26, 17, 2 ,1 ??
diff is series
ipdb> diff
1 0.017647
2 0.311765
3 -0.060000
4 -0.120000
5 -0.040000
6 -0.120000
7 -0.190000
8 -0.200000
9 -0.100000
10 -0.011176
11 -0.130000
12 0.008824
13 -0.060000
14 -0.090000
15 -0.060000
16 0.008824
17 0.341765
18 -0.140000
19 -0.050000
20 -0.060000
21 -0.040000
22 -0.210000
23 0.008824
24 0.585882
25 -0.060000
26 0.555882
27 -0.031176
28 -0.060000
29 -0.170000
30 -0.220000
31 -0.170000
32 -0.040000
dtype: float64
Yout code return list of index values of top5 values of Series sorted in descending order.
First 'column' printed in pandas Series is called index, so your code after sorting convert values of index to integers and slice by indexing.
print (diff.sort_values(ascending=False))
24 0.585882
26 0.555882
17 0.341765
2 0.311765
1 0.017647
12 0.008824
23 0.008824
16 0.008824
10 -0.011176
27 -0.031176
32 -0.040000
21 -0.040000
5 -0.040000
19 -0.050000
15 -0.060000
3 -0.060000
13 -0.060000
25 -0.060000
28 -0.060000
20 -0.060000
14 -0.090000
9 -0.100000
6 -0.120000
4 -0.120000
11 -0.130000
18 -0.140000
31 -0.170000
29 -0.170000
7 -0.190000
8 -0.200000
22 -0.210000
30 -0.220000
Name: a, dtype: float64
print (diff.sort_values(ascending=False).index.astype(int))
Int64Index([24, 26, 17, 2, 1, 12, 23, 16, 10, 27, 32, 21, 5, 19, 15, 3, 13,
25, 28, 20, 14, 9, 6, 4, 11, 18, 31, 29, 7, 8, 22, 30],
dtype='int64')
print (diff.sort_values(ascending=False).index.astype(int)[0:5])
Int64Index([24, 26, 17, 2, 1], dtype='int64')
print (list(diff.sort_values(ascending=False).index.astype(int)[0:5]))
[24, 26, 17, 2, 1]
Here's what's happening:
diff.sort_values(ascending) - sorts a Series. By default, ascending is True, but you've kept it false, so it returns sorted Series in descending order.
pandas.Series.index returns a row-labels of the index (the sorted numbers 1 - 32 in your case)
.as_type(int) typecasts index row-labels as integers.
[0: 5] just picks the cells 0 through 5
Let me know if this helps!

parser txt file python

I have txt file that contains accelerometer data and I would like to parse this file into columns.
Below is the data, the problem is that I want only these values as columns
(X value, Y value, Z value, time diff in ms) and I want to remove the headers and footers of the file.
# Accelerometer Values
# filename: default__3.txt
# Saving start time: Sat Nov 15 11:09:33 GMT+03:30 2014
# sensor resolution: 0.1m/s^2
#Sensorvendor: ST Microelectronic, name: ST accelerometer, type: 1,version : 104, range 16.0
# X value, Y value, Z value, time diff in ms
-3.236 -4.726 8.982 1
-3.206 -4.716 8.884 10
-3.187 -4.716 8.816 10
-3.138 -4.716 8.757 10
-3.138 -4.746 8.757 1
-3.059 -4.815 8.816 9
-3.059 -4.864 8.825 10
-3.069 -5.021 8.865 10
-3.069 -4.903 8.865 1
-3.089 -4.864 8.924 9
-3.108 -4.903 9.051 13
-3.157 -4.903 9.247 8
-3.206 -4.893 9.404 9
-3.275 -4.883 9.581 11
-3.314 -4.726 9.62 10
-3.314 -4.805 9.62 1
-3.324 -4.756 9.512 9
-3.324 -4.667 9.335 11
-3.246 -4.589 9.247 9
-3.177 -4.56 9.041 11
-3.02 -4.56 8.855 9
-3.128 -4.54 8.855 1
-3.098 -4.628 8.708 10
-3.098 -4.628 8.62 9
-3.02 -4.687 8.62 1
-3.02 -4.687 8.541 9
-2.991 -4.775 8.541 1
-2.961 -4.805 8.512 10
# end
#Sat Nov 15 11:10:36 GMT+03:30 2014
Don't reinvent the wheel. Load and process the data with pandas:
>>> import pandas as pd
>>> data = pd.read_csv('data.txt', sep=' ', header=None, comment='#')
>>> data
0 1 2 3
0 -3.236 -4.726 8.982 1
1 -3.206 -4.716 8.884 10
2 -3.187 -4.716 8.816 10
3 -3.138 -4.716 8.757 10
4 -3.138 -4.746 8.757 1
5 -3.059 -4.815 8.816 9
6 -3.059 -4.864 8.825 10
7 -3.069 -5.021 8.865 10
8 -3.069 -4.903 8.865 1
9 -3.089 -4.864 8.924 9
10 -3.108 -4.903 9.051 13
11 -3.157 -4.903 9.247 8
12 -3.206 -4.893 9.404 9
13 -3.275 -4.883 9.581 11
14 -3.314 -4.726 9.620 10
15 -3.314 -4.805 9.620 1
16 -3.324 -4.756 9.512 9
17 -3.324 -4.667 9.335 11
18 -3.246 -4.589 9.247 9
19 -3.177 -4.560 9.041 11
20 -3.020 -4.560 8.855 9
21 -3.128 -4.540 8.855 1
22 -3.098 -4.628 8.708 10
23 -3.098 -4.628 8.620 9
24 -3.020 -4.687 8.620 1
25 -3.020 -4.687 8.541 9
26 -2.991 -4.775 8.541 1
27 -2.961 -4.805 8.512 10
To get a particular column as an array:
>>> data[2].values
array([8.982, 8.884, 8.816, 8.757, 8.757, 8.816, 8.825, 8.865, 8.865,
8.924, 9.051, 9.247, 9.404, 9.581, 9.62 , 9.62 , 9.512, 9.335,
9.247, 9.041, 8.855, 8.855, 8.708, 8.62 , 8.62 , 8.541, 8.541,
8.512])
Since you have clearly defined comment lines in your file, it is fairly simple to filter them out.
Here's what I came up with:
with open("default__3.txt", "r") as f:
lines = f.readlines()
x_values = []
y_values = []
z_values = []
time_diffs = []
for line in lines:
if line.startswith('#'): # filter out comment lines
continue
tokens = line.split(' ')
if len(tokens) < 4: # filter out blank lines
continue
x_values.append(float(tokens[0]))
y_values.append(float(tokens[1]))
z_values.append(float(tokens[2]))
time_diffs.append(int(tokens[3].strip('\n'))) # remove carriage returns from last token
print(x_values)
print(y_values)
print(z_values)
print(time_diffs)
This puts your values into lists which you can manipulate as you see fit. I used it to print out the following:
[-3.236, -3.206, -3.187, -3.138, -3.138, -3.059, -3.059, -3.069, -3.069, -3.089, -3.108, -3.157, -3.206, -3.275, -3.314, -3.314, -3.324, -3.324, -3.246, -3.177, -3.02, -3.128, -3.098, -3.098, -3.02, -3.02, -2.991, -2.961]
[-4.726, -4.716, -4.716, -4.716, -4.746, -4.815, -4.864, -5.021, -4.903, -4.864, -4.903, -4.903, -4.893, -4.883, -4.726, -4.805, -4.756, -4.667, -4.589, -4.56, -4.56, -4.54, -4.628, -4.628, -4.687, -4.687, -4.775, -4.805]
[8.982, 8.884, 8.816, 8.757, 8.757, 8.816, 8.825, 8.865, 8.865, 8.924, 9.051, 9.247, 9.404, 9.581, 9.62, 9.62, 9.512, 9.335, 9.247, 9.041, 8.855, 8.855, 8.708, 8.62, 8.62, 8.541, 8.541, 8.512]
[1, 10, 10, 10, 1, 9, 10, 10, 1, 9, 13, 8, 9, 11, 10, 1, 9, 11, 9, 11, 9, 1, 10, 9, 1, 9, 1, 10]

Categories