Delete wrong values from an array with help of panda - python

I have a problem with my data, there are a number of displacement in different times. But unfortunately in some cases for one time I have 2 or 3 different displacements, of which only the higher value is acceptable.
I used Panda and now I have all the values in my paython code and change them to a 2X2 array.
Acctually I need to write an algorithm to find all the duplicates t and check their x and delete the whole line if x is lower than in other cases.
I would really appreciate for any ideas.
t x
1 2
2 3
3 4
4 5
5 5
3 3
1 1
7 5
I have written an example. I need here for the "Time" equal to 3 and 1 to delete the whole line that has a lower x.

Let's say you have pandas dataframe such as :
import pandas as pd
dictionary = {'t':[1,2,3,4,5,3,1,7],
'x':[2,3,4,5,5,3,1,5]}
dataframe = pd.DataFrame(dictionary)
you can select column: t equals 3 and 1, as following :
dataframe.loc[(dataframe['t'] == 3| 1)].reset_index(drop = True)
Also, you can select column: t does not equals 3 and 1, as following :
dataframe.loc[(dataframe['t'] != 3| 1)].reset_index(drop = True)

Related

How can I create a column target based on two different columns?

I have the following DataFrame with the columns low_scarcity and high_scarcity (a value is either on high or low scarcity):
id
low_scarcity
high_scarcity
0
When I was five..
1
I worked a lot...
2
I went to parties...
3
1 week ago
4
2 months ago
5
another story..
I want to create another column 'target' that when there's an entry in low_scarcity column, the value will be 0, and when there's an entry in high_scarcity column, the value will be 1. Just like this:
id
low_scarcity
high_scarcity
target
0
When I was five..
0
1
I worked a lot...
1
2
I went to parties...
1
3
1 week ago
0
4
2 months ago
0
5
another story..
1
I tried first replacing the entries with no value with 0 and then create a boolean condition, however I can't use .replace('',0) because the columns that are empty don't appear as empty values.
Supposing your dataframe is called df and that a value is either on on high or low scarcity, the following line of code does it
import numpy as np
df['target'] = 1*np.array(df['high_scarcity']!="")
in which the 1* performs an integer conversion of the boolean values.
If that is not the case, then a more complex approach should be taken
res = np.array(["" for i in range(df.shape[0])])
res[df['high_scarcity']!=""] = 1
res[df['low_scarcity']!=""] = 0
df['target'] = res

Pandas - How to extract values from a large DF without any 'keys' using another DF's values?

I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)

How can i speed up data labelling for a large pandas dataframe?

I have a large pandas data frame which roughly looks this
Identity periods one two three Label
0 one 1 -0.462407 0.022811 -0.277357
1 one 1 -0.617588 1.667191 -0.370436
2 one 2 -0.604699 0.635473 -0.556088
3 one 2 -0.852943 1.087415 -0.784377
4 two 3 0.421453 2.390097 0.176333
5 two 3 -0.447321 -1.215280 -0.187156
6 two 4 0.398953 -0.334095 -1.194132
7 two 4 -0.324348 -0.842357 0.970825
I need to be able to categorise the data according to groupings in the various columns, for example one of my categorisation criteria is to label each of the groups in the identity column with a label if there is between x and y periods in the periods column.
The code I have to categorise this looks like this, generating a final column:
for i in df['Identity'].unique():
if (2 <= df[df['Identity']==i]['periods'].max() <= 5) :
df.loc[df['Identity']==i,'label']='label 1'
I have also tried a version using
df.groupby('Identity').apply().
But this is no quicker.
My data is approximately 2.8m rows at the moment, and there are about 900 unique identities. The code takes about 5 minutes to run, which to me suggests it's the code within the loop that is slow, rather than the looping making it slow.
Let's try to enhance the system performance by using all vectorized Pandas operations instead of using loops or .apply() function which is also just commonly using the relatively slow Python loops internally.
Use .groupby() and .transform() to broadcast max() of periods within group to get a series for making mask. Then use .loc[] with the mask of the condition 2 <= max <=5 and setup label for such rows fulfulling the mask.
Assumed same label for all rows of same Identity group whenever the max period within the group is within 2 <= max <=5.
m = df.groupby('Identity')['periods'].transform('max')
df.loc[(m >=2) & (m <=5), 'Label'] = 'label 1'
print(df)
Identity periods one two three Label
0 one 1 -0.462407 0.022811 -0.277357 label 1
1 one 1 -0.617588 1.667191 -0.370436 label 1
2 one 2 -0.604699 0.635473 -0.556088 label 1
3 one 2 -0.852943 1.087415 -0.784377 label 1
4 two 3 0.421453 2.390097 0.176333 label 1
5 two 3 -0.447321 -1.215280 -0.187156 label 1
6 two 4 0.398953 -0.334095 -1.194132 label 1
7 two 4 -0.324348 -0.842357 0.970825 label 1

Filtering after Multi Indexing in pandas iterable indexing

I want to make a subset of my dataFrame object using pandas or any other python liberary using Hierarchical indexing that can be iterable depending on number of rows I have in one of the column.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.read_csv(address)
trajectory frame x y
1 1 447,956 2,219
1 2 447,839 2,327
1 3 449,183 1,795
1 4 450,444 1,833
1 5 448,514 1,708
1 6 451,532 1,832
1 7 448,471 1,759
1 8 450,028 2,097
1 9 448,215 2,203
1 10 449,311 2,063
1 11 451,745 1,76
1 12 450,827 2,264
1 13 448,991 2,208
1 14 452,829 3,106
1 15 448,688 1,77
1 16 449,844 1,951
1 17 450,044 1,991
1 18 449,835 1,901
1 19 450,793 3,49
1 20 449,618 2,354
2 1 445.936 7.219
2 2 442.879 3.327
3 1 441.283 9.795
4 1 447.956 2.219
4 3 447.839 2.327
4 6 449.183 1.795
In this DataFrame, let say there are 4 columns, names: 'trajectory', 'frame, 'x' and 'y'. Number of 'trajectory' can be different from one dataframe to another. Each 'trajectory' can have multiple frames between 1 and 20, where they can be sequential from 1-20 or with some missing frames as well. Each frame has its own value in the column 'x' and 'y'.
My aim is to create a new dataframe where I can have only those 'trajectory' where the 'frame' values is present for all the 20 rows. As the number of rows in 'trajectory' and 'frame' columns are changing, so I would like to have a code that can be used in such conditions.
df_1 = df.set_index(['trajectory','frame'], drop=False)
Here, I did a heirarchical indexing using 'trajectory' and 'frame' and then I found that 'trajectory' number 1 and 6 have 20 frames in them. So I could manually select them using the following code.
df_1_subset = df_1[(df1['traj']== 1)|(df1['trajectory']== 6)]
However, I have multiple csv files where in each Dataframe, the 'trajectory' that will have 20 rows in the 'frame' column will be different, so I will have to do this manually. I think, there must be a better way, but I just can not seem to find it. I am very new to coding and I would really appreciate anybody's help. Thank you very much in advance.
Iif need filter trajectory level for 1 or 6 use Index.get_level_values with Index.isin:
df_1_subset = df_1[df1.index.get_level_values('trajectory').isin([1,6])]
If need filter trajectory level for 1 and frame for 6 select with DataFrame.loc with tuple:
df_1_subset = df_1.loc[(1, 6)]
Alternative:
df_1_subset = df_1.loc[(df1.index.get_level_values('trajectory') == 1) |
(df1.index.get_level_values('frame') == 6)]

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Categories