I have a large csv file with columns that encode the name and index of the array below. eg:
time, dataset1[0], dataset1[1], dataset1[2], dataset2[0], dataset2[1], dataset2[2]\n
0, 43, 35, 29, 21, 59, 39\n
1, 21, 59, 39, 43, 35, 29\n
You get the idea (obviously there is far more data in the arrays).
Any ideas how can I easily parse/strip this into an efficient dataframes?
[EDIT]
Ideally I'm after a structure like this:
time dataset1 dataset2
0 0 [43,35,29] [21,59,39]
1 1 [21,59,39] [43,35,29]
where the index's have been stripped from the labels and turned into nparray indices.
from pandas import read_csv
df = read_csv('data.csv')
print df
Gives as output:
>>>
time dataset1[0] dataset1[1] dataset1[2] dataset2[0] dataset2[1] \
0 0 43 35 29 21 59
1 1 21 59 39 43 35
dataset2[2]
0 39
1 29
Related
I`m reading a csv and the data is a little bit messy. Here's the code:
import pandas as pd
ocorrencias = pd.read_csv('data.csv', encoding="1252", header=None)
ocorrencias = ocorrencias.drop([0, 1, 2, 4, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36], axis=1)
Output:
And I want to remove columns names from rows and promote them to the headers, so the dataframe will be like::
Anyone can help me?
You can use split(': ') to keep only part after : in cells
df = df.apply(lambda x: x.str.split(': ', 1).str[1])
You can use also split(': ') to get column names from any row (ie. from first row .iloc[0])
df.columns = df.iloc[0].str.split(': ', 1).str[0]
Minimal working code
First it has to get headers before names are removed from cells.
I used random to generate random value - but using random.seed(0) you should get the same values as in my result.
I use 1 in split(': ', 1) to split it only on first : because sometimes there can be more : if you would have text values.
import pandas as pd
import random
random.seed(0) # to get the same random values in every test
df = pd.DataFrame([f'{col}: {random.randint(0,100)}'
for col in ['hello', 'world', 'of', 'python']]
for row in range(3))
print(df)
df.columns = df.iloc[0].str.split(': ', 1).str[0]
print(df)
df = df.apply(lambda x: x.str.split(': ', 1).str[1])
print(df)
Result:
0 1 2 3
0 hello: 49 world: 97 of: 53 python: 5
1 hello: 33 world: 65 of: 62 python: 51
2 hello: 100 world: 38 of: 61 python: 45
0 hello world of python
0 hello: 49 world: 97 of: 53 python: 5
1 hello: 33 world: 65 of: 62 python: 51
2 hello: 100 world: 38 of: 61 python: 45
0 hello world of python
0 49 97 53 5
1 33 65 62 51
2 100 38 61 45
Suppose I have the following DataFrame:
df = pd.DataFrame({'id': [2, 4, 10, 12, 13, 14, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31, 42, 50, 54],
'value': [37410.0, 18400.0, 200000.0, 392000.0, 108000.0, 423000.0, 80000.0, 307950.0,
50807.0, 201740.0, 182700.0, 131300.0, 282005.0, 428800.0, 56000.0, 412400.0, 1091595.0, 1237200.0,
927500.0]})
And I do the following:
df.sort_values(by='id').set_index('id').cumsum()
value
id
2 37410.0
4 55810.0
10 255810.0
12 647810.0
13 755810.0
14 1178810.0
19 1258810.0
20 1566760.0
21 1617567.0
22 1819307.0
24 2002007.0
25 2133307.0
27 2415312.0
29 2844112.0
30 2900112.0
31 3312512.0
42 4404107.0
50 5641307.0
54 6568807.0
I want to know the first element of id that is bigger than 25% of the cumulative sum. In this example, 25% of the cumsum would be 1,642,201.75. The first element to exceed that would be 22. I know it can be done with a for, but I think it would be pretty inefficient.
You could do:
percentile_25 = df['value'].sum() * 0.25
res = df[df['value'].cumsum() > percentile_25].head(1)
print(res)
Output
id value
9 22 201740.0
Or use searchsorted to do the search in O(log N):
percentile_25 = df['value'].sum() * 0.25
i = df['value'].cumsum().searchsorted(percentile_25)
res = df.iloc[i]
print(res)
Output
id 22.0
value 201740.0
Name: 9, dtype: float64
I currently am running into two issues:
My data-frame looks like this:
, male_female, no_of_students
0, 24 : 76, "81,120"
1, 33 : 67, "12,270"
2, 50 : 50, "10,120"
3, 42 : 58, "5,120"
4, 12 : 88, "2,200"
What I would like to achieve is this:
, male, female, no_of_students
0, 24, 76, 81120
1, 33, 67, 12270
2, 50, 50, 10120
3, 42, 58, 5120
4, 12, 88, 2200
Basically I want to convert male_female into two columns and no_of_students into a column of integers. I tried a bunch of things, converting the no_of_students column into another type with .astype. But nothing seems to work properly, I also couldn't really find a smart way of splitting the male_female column properly.
Hopefully someone can help me out!
Use str.split with pop for new columns by separator, then strip trailing values, replace and if necessary convert to integers:
df[['male','female']] = df.pop('male_female').str.split(' : ', expand=True)
df['no_of_students'] = df['no_of_students'].str.strip('" ').str.replace(',','').astype(int)
df = df[['male','female', 'no_of_students']]
print (df)
male female no_of_students
0 24 76 81120
1 33 67 12270
2 50 50 10120
3 42 58 5120
4 12 88 2200
I would like to choose from a column every 3 values.
For example:
Input
12
73
56
33
16
output
12
73
56
------
73
56
33
-----
56
33
16
I have tried to add a key column and group by it, but my data frame is too large to perform the grouping. Here is my attempt:
df.groupby('key').agg(lambda x: x.tolist())
If use list type, you can do like this :
lst = [12,73,56,33,16]
slide_size=3
result = []
for i in range(0,len(lst)-slide_size+1):
result.append(lst[i:i+3])
result
# output : [[12, 73, 56], [73, 56, 33], [56, 33, 16]]
After this, you can transform the list to DataFrame
In this pandas dataframe:
df =
pos index data
21 36 a,b,c
21 36 a,b,c
23 36 c,d,e
25 36 f,g,h
27 36 g,h,k
29 39 a,b,c
29 39 a,b,c
31 39 .
35 39 c,k
36 41 g,h
38 41 k,l
39 41 j,k
39 41 j,k
I want to remove the repeated line that are only in the same index group and when they are in the head regions of the subframe.
So, I did:
df_grouped = df.groupby(['index'], as_index=True)
now,
for i, sub_frame in df_grouped:
subframe.apply(lamda g: ... remove one duplicate line in the head region if pos value is a repeat)
I want to apply this method because some pos value will be repeated in the tail region which should not be removed.
Any suggestions.
Expected output:
pos index data
removed
21 36 a,b,c
23 36 c,d,e
25 36 f,g,h
27 36 g,h,k
removed
29 39 a,b,c
31 39 .
35 39 c,k
36 41 g,h
38 41 k,l
39 41 j,k
39 41 j,k
If it doesn't have to be done in a single apply statement, then this code will only remove duplicates in the head region:
data= {'pos':[21, 21, 23, 25, 27, 29, 29, 31, 35, 36, 38, 39, 39],
'idx':[36, 36, 36, 36, 36, 39, 39, 39, 39, 41, 41, 41, 41],
'data':['a,b,c', 'a,b,c', 'c,d,e', 'f,g,h', 'g,h,k', 'a,b,c', 'a,b,c', '.', 'c,k', 'g,h', 'h,l', 'j,k', 'j,k']
}
df = pd.DataFrame(data)
accum = []
for i, sub_frame in df.groupby('idx'):
accum.append(pd.concat([sub_frame.iloc[:2].drop_duplicates(), sub_frame.iloc[2:]]))
df2 = pd.concat(accum)
print(df2)
EDIT2: The first version of the chained command that I posted was wrong and and only worked for the sample data. This version provides a more general solution to remove duplicate rows per the OP's request:
df.drop(df.groupby('idx') # group by the index column
.head(2) # select the first two rows
.duplicated() # create a Series with True for duplicate rows
.to_frame(name='duped') # make the Series a dataframe
.query('duped') # select only the duplicate rows
.index) # provide index of duplicated rows to drop