I have a dataframe like this:
df_encoded.head()
Time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 ... Q31 Q33 Q36 Q38 Q42 Q44 Q45 Q47 Q49 Q50
0 3746 0 3 56 3 1 7 7 0 4152 ... [1, 5, 9, 10] [6, 2, 0, 1, 3] [1, 11] 19 0 5 5 [54, 55, 97] [11, 8, 10] 8
1 3778 1 1 21 3 8 4 7 0 8541 ... 1 11 [10, 0, 13, 1] [9, 2] 1 [0, 1] [0, 5] 39 9 [8, 4]
2 4261 1 4 8 1 7 11 0 2 870 ... [1, 5, 9] 3 1 13 3 4 4 91 [18, 19, 5, 2, 1, 0, 7, 19, 5, 3, 7, 17, 6, 4,... [7, 1]
3 1180 1 0 21 3 7 11 16 0 4103 ... [4, 5, 8, 9] [2, 0, 1, 5, 10] [10, 4, 11] [19, 20, 9, 11] [5, 0] 4 [0, 4, 6] 54 [16, 12, 11, 9] 4
4 3823 1 3 19 3 2 17 15 7 3251 ... [5, 8, 9, 10] [2, 0, 1, 7, 1, 5, 4] 10 13 5 4 [4, 6] [54, 47, 97, 98] [19, 5, 2, 1, 0, 7, 12, 11, 8, 10] [8, 0]
the type of data in all columns are object. I can easily change the type from OBJECT to int or float for the columns that they are not any list in them. But as you can see through the data frame, there are some column which have list in it and I can not change their type from OBJECT to Float.....Is there any solution for it?
Finally, I want to have correlation matrix. but with having object, I can not have df_encoded.corr() in the columns with object type data. This correlation matrix is needed for making a heatmap.
What do you need to achieve?
If you definitely know you can only solve your problem with having a row that contains a list and the objects in that list need to be float then you will probable need to iterate over every row. If you have a huge dataset, i.e. millions of rows, then you might need to rethink what you are trying to achieve.
To simply convert the rows you would need to use .apply, which iterates over each row in a pandas dataframe and allows you to perform an action on that row, in this case changing the types in that row. A quick win may be to use numpy.array.
import numpy as np
df_encoded['Q31'] = df_encoded.apply(
lambda x: np.array(x['Q31']).astype(float),
axis=1
)
Related
Given an id in a pandas dataframe, how can I create a new column that has an additional id that maxes out at a count of 5 for each ID. almost like "batches" of rows
df = pd.DataFrame([[1, 1],
[2, 1],
[3, 1],
[4, 1],
[5, 1],
[6, 1],
[7, 1],
[8, 2],
[9, 2],
[10, 3],
[11, 3],
[12, 3],
[13, 4],
[14, 5],
[15, 5],
[16, 5],
[17, 5],
[18, 5],
[19, 5],
[20, 5]])
df.columns = ['ln_num', 'id']
print(df)
#expected output
expected = pd.DataFrame([[1, 1, 1],
[2, 1, 1],
[3, 1, 1],
[4, 1, 1],
[5, 1, 1],
[6, 1, 2],
[7, 1, 2],
[8, 2, 3],
[9, 2, 3],
[10, 3, 4],
[11, 3, 4],
[12, 1, 2],
[13, 1, 2],
[14, 1, 2],
[15, 1, 5],
[16, 4, 6],
[17, 4, 6],
[18, 4, 6],
[19, 3, 4],
[20, 3, 4]])
expected.columns = ['ln_num', 'id', 'grp_id']
print(expected)
so for example if I have 11 rows with ID=1 I need 3 different unique Id's for these subset of alerts. 1. lines 1-5, 2. lines 6-10 3. line 11
The closest I've gotten so far is using a groupby with +1 offset that gives me a new grp_id for each id, but doesn't limit this to 5.
df = df.groupby('id').ngroup() + 1
I've also tried by head() and nlargest() but these don't sort ALL lines into batches, only the first or top 5
I would start by getting all the points where you know the transition will happen:
df[1].diff() \ # Show where column 1 differs from the previous row
.astype(bool) # Make it a boolean (true/false)
We can use this selection on the index of the dataframe to get the indices of rows that change:
df.index[df[1].diff().astype(bool)]
This gives output: Int64Index([0, 7, 9, 12, 13], dtype='int64') and we can check that rows 0, 7, 9, 12, and 13 are where column 1 changes.
Next, we need to break down any segments that are longer than 5 rows into smaller batches. We'll iterate though each pair of steps and use the range function to batch them:
all_steps = [] # Start with an empty list of steps
for i, step in enumerate(steps[:-1]):
all_steps += list(range(step, steps[i+1], 5)) # Add each step, but also any needed 5-steps
Last, we can use all_steps to assign values to the dataframe by index:
df['group'] = 0
for i, step in enumerate(all_steps[:-1]):
df.loc[step:all_steps[i+1], 'group'] = i
Putting it all together, we also need to use len(df) a few times, so that the range function knows how long the interval is on the last group.
steps = df.index[df[1].diff().astype(bool)].tolist() + [len(df)] # range needs to know how long the last interval is
all_steps = []
for i, step in enumerate(steps[:-1]):
all_steps += list(range(step, steps[i+1], 5))
all_steps += [len(df)] # needed for indexing
df['group'] = 0
for i, step in enumerate(all_steps[:-1]):
df.loc[step:all_steps[i+1], 'group'] = i
Our final output:
0 1 group
0 1 1 0
1 2 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 2 2
8 9 2 2
9 10 3 3
10 11 3 3
11 12 3 3
12 13 4 4
13 14 5 5
14 15 5 5
15 16 5 5
16 17 5 5
17 18 5 5
18 19 5 6
19 20 5 6
If you want the groups to start at 1, use the start=1 keyword in the enumerate function.
I have a file with various columns. Say
1 2 3 4 5 6
2 4 5 6 7 4
3 4 5 6 7 6
2 0 1 5 6 0
2 4 6 8 9 9
I would like to select and save out rows (in each column) in a new file which have the values in column two in the range [0 - 2].
The answer in the new file should be
1 2 3 4 5 6
2 0 1 5 6 0
Kindly assist me. I prefer doing this with numpy in python.
For array a, you can use:
a[(a[:,1] <= 2) & (a[:,1] >= 0)]
Here, the condition filters the values in your second column.
For your example:
>>> a
array([[1, 2, 3, 4, 5, 6],
[2, 4, 5, 6, 7, 4],
[3, 4, 5, 6, 7, 6],
[2, 0, 1, 5, 6, 0],
[2, 4, 6, 8, 9, 9]])
>>> a[(a[:,1] <= 2) & (a[:,1] >= 0)]
array([[1, 2, 3, 4, 5, 6],
[2, 0, 1, 5, 6, 0]])
I try to put a condition in the array which if the last value in the line is less than the first value in the next line then increases the next line by the value +10
a = np.array([[0, 1, 5, 2, 3],[4, 2,2, 3, 4],[0, 3, 5,6, 8],[5,2,1,2,4],[7,8,2,3,6]])
for k in range(a):
j=np.where(a[k,-1] < a[k+1,0], a[k+1]+10,a)
print(j)
it just gave me an error message
TypeError: only integer scalar arrays can be converted to a scalar index
input
[[0 1 5 2 3]
[4 2 2 3 4]
[0 3 5 6 8]
[5 2 1 2 4]
[7 8 2 3 6]]
I need
required output:
[[ 0 1 5 2 3]
[14 12 12 13 14]
[ 0 3 5 6 8]
[ 5 2 1 2 4]
[17 18 12 13 16]]
I only tried it from the first two lines and the result was that it changed the whole array
j=np.where(a[0,-1] < a[1,0], a[1]+10,a)
print(j)
my output:
[[14 12 12 13 14]
[14 12 12 13 14]
[14 12 12 13 14]
[14 12 12 13 14]
[14 12 12 13 14]]
I also tried through if else but it didn't work
You can try the following:
>>> a[1:][a[:-1,-1] < a[1:, 0]] *= 10
array([[ 0, 1, 5, 2, 3],
[40, 20, 20, 30, 40],
[ 0, 3, 5, 6, 8],
[ 5, 2, 1, 2, 4],
[70, 80, 20, 30, 60]])
Where
>>> a[:-1,-1] # Gives the value along last col from first to the second last row
array([3, 4, 8, 4])
>>> a[1:, 0] # Gives the value along first col from second to the last row
array([4, 0, 5, 7])
>>> a[:-1,-1] < a[1:, 0] # Gives indices where cond==True, from 2nd to last row
array([ True, False, False, True])
# Then we access these boolean indices, in slice `a[1:]` which gives 2nd to last row
NOTE: As pointed out by #GiovanniFrisson You have asked for +10 which would mean a[1:][a[:-1,-1] < a[1:, 0]] += 10, but I have assumed here your output to be your desired outcome, hence used *=10 which is multiplied by 10.
I have a dataframe that contains a string of varying length in each cell i.e.
Num
(1,2,3,4,5)
(6,7,8)
(9)
(10,11,12)
I want to avoid attempting to perform str.split(',') on the cells that only have one number in them. However, I want all of the single numbers to be converted to a list of one element.
Here is what I have tried, it gives an error that says " 'int' object is not callable"
if(df['Num'].size() > 1):
df['Num'] = df['Num'].str.split(',')
update for clarification:
Index Num
0 2,6,7
1 1,3,6,7,8
2 2,4,7,8,9
3 3,5,8,9,10
4 4,9,10
5 1,2,7
6 1,2,3,6,8
7 2,3,4,7,9
8 3,4,5,8,10
9 4,5,9
10 2,3
11 1,3
12 1,2
13 2,3,4
14 1,3,4
15 1,2,4
16 1,2,3
17 2
18 1
I am trying to take this dataframe and convert each Num row from a string of numbers to a list. I want all of the indices that contain only one number (17 and 18) to be converted to a list containing a single element (itself).
This code below only works if every string is more than one number separated by a ','.
df['Adj'] = df['Adj'].str.split(',')
The output dataframe that I get when I run the above code. Notice the elements that only had one number are now nan.
Index Num
0 [2, 6, 7]
1 [1, 3, 6, 7, 8]
2 [2, 4, 7, 8, 9]
3 [3, 5, 8, 9, 10]
4 [4, 9, 10]
5 [1, 2, 7]
6 [1, 2, 3, 6, 8]
7 [2, 3, 4, 7, 9]
8 [3, 4, 5, 8, 10]
9 [4, 5, 9]
10 [2, 3]
11 [1, 3]
12 [1, 2]
13 [2, 3, 4]
14 [1, 3, 4]
15 [1, 2, 4]
16 [1, 2, 3]
17 NaN
18 NaN
Assuming your column are all strings and you just want the individual numbers as a list of str, this should do the trick:
df['Num'].str.strip('()').str.split(',')
# 0 [1, 2, 3, 4, 5]
# 1 [6, 7, 8]
# 2 [9]
# 3 [10, 11, 12]
# Name: Num, dtype: object
Since not all your data are str type, you'll need to coerce them into str first to ensure the string methods are called properly:
df['Num'].astype(str).str.split(',')
# 0 [2, 6, 7]
# 1 [1, 3, 6, 7, 8]
# 2 [2, 4, 7, 8, 9]
# ...
# 16 [1, 2, 3]
# 17 [2]
# 18 [1]
If you have an x*n matrix how do you check for a row that contains a certain number and if so, how do you delete that row?
If you are using pandas, you can create a mask that you can use to index the dataframe, negating the mask with ~:
df = pd.DataFrame(np.arange(12).reshape(3, 4))
# 0 1 2 3
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
value = 2
If you want to check if the value is contained in a specific column:
df[~(df[2] == value)]
# 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
Or if it can be contained in any column:
df[~(df == value).any(axis=1)]
# 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
Just reassign it to df afterwards.
This also works if you are using just numpy:
x = np.arange(12).reshape(3, 4)
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
x[~(x == value).any(axis=1)]
# array([[ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
And finally, if you are using plain Python and have a list of lists, use the built-in any in a list comprehension:
y = [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
[row for row in y if not any(x == value for x in row)]
# [[4, 5, 6, 7], [8, 9, 10, 11]]