Select every n values in pandas column - python

I would like to choose from a column every 3 values.
For example:
Input
12
73
56
33
16
output
12
73
56
------
73
56
33
-----
56
33
16
I have tried to add a key column and group by it, but my data frame is too large to perform the grouping. Here is my attempt:
df.groupby('key').agg(lambda x: x.tolist())

If use list type, you can do like this :
lst = [12,73,56,33,16]
slide_size=3
result = []
for i in range(0,len(lst)-slide_size+1):
result.append(lst[i:i+3])
result
# output : [[12, 73, 56], [73, 56, 33], [56, 33, 16]]
After this, you can transform the list to DataFrame

Related

Slicing a list in sublists according to unknown indices

What is the best method to slice a list (here: lst_num) into (more than two) parts of variable length according to another list containing the indices?
A string of numbers has to be split into sublists that contain the numbers standing between all subsequent occurrences of a certain number. E.g.: "30 24 17 30 22 1 67 2 4 3 30 24 95 34 29 56 30 43 24" and "30" yields: [24, 17], [22, 1, 22, 1, 67, 2, 4, 3 ] and [24, 95, 34, 29, 56]
str_num="30 24 17 30 22 1 67 2 4 3 30 24 95 34 29 56 30 43 24"
lst_num=[int(x) for x in ciphtext.split()]
idx=[i for i, x in enumerate(lst_num) if x==30]
for i in idx: ???
To slice the list, the first argument should be "i+1", but how to obtain the subsequent index from idx as stop index? Is there a way to give each sublist a unique name in the iteration?
One little step to go:
>>> [lst_num[start+1:end] for start, end in zip(idx, idx[1:])]
[[24, 17], [22, 1, 67, 2, 4, 3], [24, 95, 34, 29, 56]]
Just zip the indexes into pairs of slice boundaries.

simple usecase if numpy.delete() is not working

here is some code:
c = np.delete(a,b)
print(len(a))
print(a)
print(len(b))
print(b)
print(len(c))
print(c)
it gives back:
24
[32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55]
20
[46, 35, 37, 54, 40, 49, 34, 48, 50, 38, 42, 47, 33, 52, 41, 36, 39, 44, 55,
51]
24
[32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55]
as you can see, all elements of b appear in a, but are not being deleted. can not figure out why. any ideas? thank you.
numpy.delete does not remove the elements contained in b, it deletes a[b], in other words, b needs to contain the indices to remove. Since your b contains only values larger than the length of a, no values are removed. Currently out of bounds indices are ignored, but this will not be true in the future:
/usr/local/bin/ipython3:1: DeprecationWarning: in the future out of bounds indices will raise an error instead of being ignored by `numpy.delete`.
#!/usr/bin/python3
A pure Python solution would be to use set:
set_b = set(b)
c = np.array([x for x in a if x not in set_b])
# array([32, 43, 45, 51, 53])
And using numpy broadcasting to create a mask to determine which values to delete:
c = a[~(a[None,:] == b[:, None]).any(axis=0)]
# array([32, 43, 45, 51, 53])
They are about the same speed with the given example, but the numpy approach and takes more memory (because it generates a 2D matrix that contains all combinations of a and b).

Pandas Dataframe splice data into 2 columns and make a number with a comma and integer

I currently am running into two issues:
My data-frame looks like this:
, male_female, no_of_students
0, 24 : 76, "81,120"
1, 33 : 67, "12,270"
2, 50 : 50, "10,120"
3, 42 : 58, "5,120"
4, 12 : 88, "2,200"
What I would like to achieve is this:
, male, female, no_of_students
0, 24, 76, 81120
1, 33, 67, 12270
2, 50, 50, 10120
3, 42, 58, 5120
4, 12, 88, 2200
Basically I want to convert male_female into two columns and no_of_students into a column of integers. I tried a bunch of things, converting the no_of_students column into another type with .astype. But nothing seems to work properly, I also couldn't really find a smart way of splitting the male_female column properly.
Hopefully someone can help me out!
Use str.split with pop for new columns by separator, then strip trailing values, replace and if necessary convert to integers:
df[['male','female']] = df.pop('male_female').str.split(' : ', expand=True)
df['no_of_students'] = df['no_of_students'].str.strip('" ').str.replace(',','').astype(int)
df = df[['male','female', 'no_of_students']]
print (df)
male female no_of_students
0 24 76 81120
1 33 67 12270
2 50 50 10120
3 42 58 5120
4 12 88 2200

How to remove a specific repeated line in pandas dataframe?

In this pandas dataframe:
df =
pos index data
21 36 a,b,c
21 36 a,b,c
23 36 c,d,e
25 36 f,g,h
27 36 g,h,k
29 39 a,b,c
29 39 a,b,c
31 39 .
35 39 c,k
36 41 g,h
38 41 k,l
39 41 j,k
39 41 j,k
I want to remove the repeated line that are only in the same index group and when they are in the head regions of the subframe.
So, I did:
df_grouped = df.groupby(['index'], as_index=True)
now,
for i, sub_frame in df_grouped:
subframe.apply(lamda g: ... remove one duplicate line in the head region if pos value is a repeat)
I want to apply this method because some pos value will be repeated in the tail region which should not be removed.
Any suggestions.
Expected output:
pos index data
removed
21 36 a,b,c
23 36 c,d,e
25 36 f,g,h
27 36 g,h,k
removed
29 39 a,b,c
31 39 .
35 39 c,k
36 41 g,h
38 41 k,l
39 41 j,k
39 41 j,k
If it doesn't have to be done in a single apply statement, then this code will only remove duplicates in the head region:
data= {'pos':[21, 21, 23, 25, 27, 29, 29, 31, 35, 36, 38, 39, 39],
'idx':[36, 36, 36, 36, 36, 39, 39, 39, 39, 41, 41, 41, 41],
'data':['a,b,c', 'a,b,c', 'c,d,e', 'f,g,h', 'g,h,k', 'a,b,c', 'a,b,c', '.', 'c,k', 'g,h', 'h,l', 'j,k', 'j,k']
}
df = pd.DataFrame(data)
accum = []
for i, sub_frame in df.groupby('idx'):
accum.append(pd.concat([sub_frame.iloc[:2].drop_duplicates(), sub_frame.iloc[2:]]))
df2 = pd.concat(accum)
print(df2)
EDIT2: The first version of the chained command that I posted was wrong and and only worked for the sample data. This version provides a more general solution to remove duplicate rows per the OP's request:
df.drop(df.groupby('idx') # group by the index column
.head(2) # select the first two rows
.duplicated() # create a Series with True for duplicate rows
.to_frame(name='duped') # make the Series a dataframe
.query('duped') # select only the duplicate rows
.index) # provide index of duplicated rows to drop

Indexing data by string

I have a large csv file with columns that encode the name and index of the array below. eg:
time, dataset1[0], dataset1[1], dataset1[2], dataset2[0], dataset2[1], dataset2[2]\n
0, 43, 35, 29, 21, 59, 39\n
1, 21, 59, 39, 43, 35, 29\n
You get the idea (obviously there is far more data in the arrays).
Any ideas how can I easily parse/strip this into an efficient dataframes?
[EDIT]
Ideally I'm after a structure like this:
time dataset1 dataset2
0 0 [43,35,29] [21,59,39]
1 1 [21,59,39] [43,35,29]
where the index's have been stripped from the labels and turned into nparray indices.
from pandas import read_csv
df = read_csv('data.csv')
print df
Gives as output:
>>>
time dataset1[0] dataset1[1] dataset1[2] dataset2[0] dataset2[1] \
0 0 43 35 29 21 59
1 1 21 59 39 43 35
dataset2[2]
0 39
1 29

Categories