Given an id in a pandas dataframe, how can I create a new column that has an additional id that maxes out at a count of 5 for each ID. almost like "batches" of rows
df = pd.DataFrame([[1, 1],
[2, 1],
[3, 1],
[4, 1],
[5, 1],
[6, 1],
[7, 1],
[8, 2],
[9, 2],
[10, 3],
[11, 3],
[12, 3],
[13, 4],
[14, 5],
[15, 5],
[16, 5],
[17, 5],
[18, 5],
[19, 5],
[20, 5]])
df.columns = ['ln_num', 'id']
print(df)
#expected output
expected = pd.DataFrame([[1, 1, 1],
[2, 1, 1],
[3, 1, 1],
[4, 1, 1],
[5, 1, 1],
[6, 1, 2],
[7, 1, 2],
[8, 2, 3],
[9, 2, 3],
[10, 3, 4],
[11, 3, 4],
[12, 1, 2],
[13, 1, 2],
[14, 1, 2],
[15, 1, 5],
[16, 4, 6],
[17, 4, 6],
[18, 4, 6],
[19, 3, 4],
[20, 3, 4]])
expected.columns = ['ln_num', 'id', 'grp_id']
print(expected)
so for example if I have 11 rows with ID=1 I need 3 different unique Id's for these subset of alerts. 1. lines 1-5, 2. lines 6-10 3. line 11
The closest I've gotten so far is using a groupby with +1 offset that gives me a new grp_id for each id, but doesn't limit this to 5.
df = df.groupby('id').ngroup() + 1
I've also tried by head() and nlargest() but these don't sort ALL lines into batches, only the first or top 5
I would start by getting all the points where you know the transition will happen:
df[1].diff() \ # Show where column 1 differs from the previous row
.astype(bool) # Make it a boolean (true/false)
We can use this selection on the index of the dataframe to get the indices of rows that change:
df.index[df[1].diff().astype(bool)]
This gives output: Int64Index([0, 7, 9, 12, 13], dtype='int64') and we can check that rows 0, 7, 9, 12, and 13 are where column 1 changes.
Next, we need to break down any segments that are longer than 5 rows into smaller batches. We'll iterate though each pair of steps and use the range function to batch them:
all_steps = [] # Start with an empty list of steps
for i, step in enumerate(steps[:-1]):
all_steps += list(range(step, steps[i+1], 5)) # Add each step, but also any needed 5-steps
Last, we can use all_steps to assign values to the dataframe by index:
df['group'] = 0
for i, step in enumerate(all_steps[:-1]):
df.loc[step:all_steps[i+1], 'group'] = i
Putting it all together, we also need to use len(df) a few times, so that the range function knows how long the interval is on the last group.
steps = df.index[df[1].diff().astype(bool)].tolist() + [len(df)] # range needs to know how long the last interval is
all_steps = []
for i, step in enumerate(steps[:-1]):
all_steps += list(range(step, steps[i+1], 5))
all_steps += [len(df)] # needed for indexing
df['group'] = 0
for i, step in enumerate(all_steps[:-1]):
df.loc[step:all_steps[i+1], 'group'] = i
Our final output:
0 1 group
0 1 1 0
1 2 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 2 2
8 9 2 2
9 10 3 3
10 11 3 3
11 12 3 3
12 13 4 4
13 14 5 5
14 15 5 5
15 16 5 5
16 17 5 5
17 18 5 5
18 19 5 6
19 20 5 6
If you want the groups to start at 1, use the start=1 keyword in the enumerate function.
Related
Let's assume I have the following data frame:
x y
1 -1.808909 0.093380
2 1.733595 -0.380938
3 -1.385898 0.714071
And I want to insert a value in the column after "y".
However, it's possible that I might insert more than one value.
So, I need to check if the cell after "y" is empty or not to avoid overwriting the cell.
so, the expected output might be like
x y
1 -1.808909 0.093380 5
2 1.733595 -0.380938 6 7
3 -1.385898 0.714071 8
Compared to the input above, I need to check the cell first if it's empty or not.
I thought I might use: x = df.iloc[1,:].last_valid_index()
but that method returns "y" not the index of "y" which is 1.
later I'll use that index to inset "5":
x +=1
df.iloc[1,x] = 5
I want to use that approach of finding the last non-empty cell because of the 2nd row in the output.
You see that I need to insert "6" then "7"
If I ended up using always the same method like this one:
df.iloc[1,2] = 6
df.iloc[1,2] = 7
It'll overwrite the "6" when inserting "7"
One more thing, I can't look for the value using something like: (df['y'].iloc[2]).index because later I'll have two "y" columns so, that might leads to returns index number less than the required.
It is easy to identify the position of the first zero in each row in a numpy array or a dataframe. Let's create a dataframe with zeros after a certain position:
df = pd.DataFrame(np.random.normal(size=(5, 10)))
df
0 1 2 3 4 5 6 7 8 9
0 4 1 4 2 6 0 0 0 0 0
1 5 4 9 5 5 4 0 0 0 0
2 6 6 6 5 4 8 6 0 0 0
3 5 3 9 5 3 9 6 3 0 0
4 3 2 7 9 7 6 6 7 5 0
For instance, the code below will give you all positions in the dataframe where the value is 0
np.argwhere(df.values == 0)
array([[0, 5],
[0, 6],
[0, 7],
[0, 8],
[0, 9],
[1, 6],
[1, 7],
[1, 8],
[1, 9],
[2, 7],
[2, 8],
[2, 9],
[3, 8],
[3, 9],
[4, 9]], dtype=int64)
Or you can get the positions where the values are not zero:
np.argwhere(df.values != 0)
array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[1, 4],
[1, 5],
[2, 0],
[2, 1],
[2, 2],
[2, 3],
[2, 4],
[2, 5],
[2, 6],
[3, 0],
[3, 1],
[3, 2],
[3, 3],
[3, 4],
[3, 5],
[3, 6],
[3, 7],
[4, 0],
[4, 1],
[4, 2],
[4, 3],
[4, 4],
[4, 5],
[4, 6],
[4, 7],
[4, 8]], dtype=int64)
I hope it helps.
I suggest this, a less complicated solution
import random
nums = [0, 7, 78, 843, 34893, 0 , 2, 23, 4, 0]
random.shuffle(nums)
thg = [x for x in nums if x != 0]
print(thg[0])
what this does is shuffle the 'nums' list and filters out all the zeros. Then it prints the first non-zero value
I have this task where I need to create a dataset based on two other connected datasets.
df = pd.DataFrame(columns=['ID','P1','P2'],
data=[[1, 2, 0], [2,1,0], [3, 1, 2], [4, 2, 1],
[5, 1, 2], [6, 0, 1], [7, 1, 0]])
fp = pd.DataFrame(columns=['ID','FP'],
data=[[1, 'fp'], [2,'i'], [3, 'i'], [4, 'fp'],
[5, 'fp'], [6, 'fp'], [7, 'i']])
My task is to create a third dataset which only contains the id, p1, and p2 from the df dataset if the fp data set 'FP' column shows 'fp'.
I tried this
df2 = np.where((df['ID']==fp['ID'])&fp['FP']=='fp)
But it didn't work
I am sure there is a better way than mine, but this is what I would do
import pandas as pd
df = pd.DataFrame(columns=['ID','P1','P2'],
data=[[1, 2, 0], [2,1,0], [3, 1, 2], [4, 2, 1],
[5, 1, 2], [6, 0, 1], [7, 1, 0]])
fp = pd.DataFrame(columns=['ID','FP'],
data=[[1, 'fp'], [2,'i'], [3, 'i'], [4, 'fp'],
[5, 'fp'], [6, 'fp'], [7, 'i']])
# Merging dataframes
res = df.merge(fp)
# Filtering
res = res[res['FP'] == 'fp'].drop(columns=['FP'])
res
Result
ID P1 P2
0 1 2 0
3 4 2 1
4 5 1 2
5 6 0 1
You can use Series.isin with boolean indexing.
idx = fp['ID'][fp['FP'].eq('fp')]
df.loc[df['ID'].isin(idx)]
ID P1 P2
0 1 2 0
3 4 2 1
4 5 1 2
5 6 0 1
I have a dataframe like this:
df_encoded.head()
Time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 ... Q31 Q33 Q36 Q38 Q42 Q44 Q45 Q47 Q49 Q50
0 3746 0 3 56 3 1 7 7 0 4152 ... [1, 5, 9, 10] [6, 2, 0, 1, 3] [1, 11] 19 0 5 5 [54, 55, 97] [11, 8, 10] 8
1 3778 1 1 21 3 8 4 7 0 8541 ... 1 11 [10, 0, 13, 1] [9, 2] 1 [0, 1] [0, 5] 39 9 [8, 4]
2 4261 1 4 8 1 7 11 0 2 870 ... [1, 5, 9] 3 1 13 3 4 4 91 [18, 19, 5, 2, 1, 0, 7, 19, 5, 3, 7, 17, 6, 4,... [7, 1]
3 1180 1 0 21 3 7 11 16 0 4103 ... [4, 5, 8, 9] [2, 0, 1, 5, 10] [10, 4, 11] [19, 20, 9, 11] [5, 0] 4 [0, 4, 6] 54 [16, 12, 11, 9] 4
4 3823 1 3 19 3 2 17 15 7 3251 ... [5, 8, 9, 10] [2, 0, 1, 7, 1, 5, 4] 10 13 5 4 [4, 6] [54, 47, 97, 98] [19, 5, 2, 1, 0, 7, 12, 11, 8, 10] [8, 0]
the type of data in all columns are object. I can easily change the type from OBJECT to int or float for the columns that they are not any list in them. But as you can see through the data frame, there are some column which have list in it and I can not change their type from OBJECT to Float.....Is there any solution for it?
Finally, I want to have correlation matrix. but with having object, I can not have df_encoded.corr() in the columns with object type data. This correlation matrix is needed for making a heatmap.
What do you need to achieve?
If you definitely know you can only solve your problem with having a row that contains a list and the objects in that list need to be float then you will probable need to iterate over every row. If you have a huge dataset, i.e. millions of rows, then you might need to rethink what you are trying to achieve.
To simply convert the rows you would need to use .apply, which iterates over each row in a pandas dataframe and allows you to perform an action on that row, in this case changing the types in that row. A quick win may be to use numpy.array.
import numpy as np
df_encoded['Q31'] = df_encoded.apply(
lambda x: np.array(x['Q31']).astype(float),
axis=1
)
If you have an x*n matrix how do you check for a row that contains a certain number and if so, how do you delete that row?
If you are using pandas, you can create a mask that you can use to index the dataframe, negating the mask with ~:
df = pd.DataFrame(np.arange(12).reshape(3, 4))
# 0 1 2 3
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
value = 2
If you want to check if the value is contained in a specific column:
df[~(df[2] == value)]
# 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
Or if it can be contained in any column:
df[~(df == value).any(axis=1)]
# 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
Just reassign it to df afterwards.
This also works if you are using just numpy:
x = np.arange(12).reshape(3, 4)
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
x[~(x == value).any(axis=1)]
# array([[ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
And finally, if you are using plain Python and have a list of lists, use the built-in any in a list comprehension:
y = [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
[row for row in y if not any(x == value for x in row)]
# [[4, 5, 6, 7], [8, 9, 10, 11]]
I have matrix of 4x4 like this
ds1=
4 13 6 9
7 12 5 7
7 0 4 22
9 8 12 0
and other file with two columns:
ds2 =
4 1
5 3
6 1
7 2
8 2
9 3
12 1
13 2
22 3
ds1 = ds1.apply(lambda x: ds2_mean[1] if [condition])
What condition to be added to compare and check that elements from ds1 and ds2 are equal?
I want col1 value from 2nd matrix to be replaced by col2 value in matrix 1, so resultant matrix should look like
1 2 1 3
2 1 3 2
2 0 1 3
3 2 1 0
please see Replacing mean value from one dataset to another this does not answer my question
If you are working with numpy arrays, you could do this -
# Make a copy of ds1 to initialize output array
out = ds1.copy()
# Find out the row indices in ds2 that have intersecting elements between
# its first column and ds1
_,C = np.where(ds1.ravel()[:,None] == ds2[:,0])
# New values taken from the second column of ds2 to be put in output
newvals = ds2[C,1]
# Valid positions in output array to be changed
valid = np.in1d(ds1.ravel(),ds2[:,0])
# Finally make the changes to get desired output
out.ravel()[valid] = newvals
Sample input, output -
In [79]: ds1
Out[79]:
array([[ 4, 13, 6, 9],
[ 7, 12, 5, 7],
[ 7, 0, 4, 22],
[ 9, 8, 12, 0]])
In [80]: ds2
Out[80]:
array([[ 4, 1],
[ 5, 3],
[ 6, 1],
[ 7, 2],
[ 8, 2],
[ 9, 3],
[12, 1],
[13, 2],
[22, 3]])
In [81]: out
Out[81]:
array([[1, 2, 1, 3],
[2, 1, 3, 2],
[2, 0, 1, 3],
[3, 2, 1, 0]])
Here is another solution. Using DataFrame.replace() function.
df1.replace(to_replace= df2[0].tolist(), value= df2[1].tolist, inplace=True)