pandas: convert a CSV series into a data frame - python

I'm new to pandas so apologies for what I think is a trivial question, but I can't quite find the relevant function for this:
I've got a file which consists of essentially 12 different data series, with the nth element of each series grouped together; i.e.
series_A_data0
series_B_data0
series_C_data0
...
series_L_data0
series_A_data1
series_B_data1
series_C_data1
...
I can import this into pandas as a single column data frame, but how can I get it into a 12-column data series?
For reference, currently I'm doing:
data = pd.read_csv(file)
data.head(14)
0 17655029760
1 1529585664
2 1598763008
3 4936196096
4 2192232448
5 2119827456
6 2143997952
7 1549099008
8 1593683968
9 1361498112
10 1514512384
11 1346588672
12 17939451904
13 1544957952

Do you know that the series will always be in the same order? If so, I'd create a MultiIndex, and the unstack from that. Just read in the Series like you've done. I'll work with this data frame:
In [31]: df = pd.DataFrame(np.random.randn(24))
In [32]: df
Out[32]:
0
0 -1.642765
1 1.369409
2 -0.732588
3 0.357242
4 -1.259126
5 0.851803
6 -1.582394
7 -0.508507
8 0.123032
9 0.421857
10 -0.524147
11 0.381085
12 1.286025
13 -0.983004
14 0.813764
15 -0.203370
16 -1.107230
17 1.855278
18 -2.041401
19 1.352107
20 -1.630252
21 -0.326678
22 -0.080991
23 0.438606
In [33]: import itertools as it
In [34]: series_id = it.cycle(list('abcdefghijkl')) # first 12 letters.
In [60]: idx = pd.MultiIndex.from_tuples(zip(series_id, df.index.repeat(12)[:len(df)]))
We need to repeat the index so that the first observation for each Series is at index 0. Now set that as the index and unstack.
In [61]: df.index = idx
In [62]: df
Out[62]:
0
a 0 -1.642765
b 0 1.369409
c 0 -0.732588
d 0 0.357242
e 0 -1.259126
f 0 0.851803
g 0 -1.582394
h 0 -0.508507
i 0 0.123032
j 0 0.421857
k 0 -0.524147
l 0 0.381085
a 1 1.286025
b 1 -0.983004
c 1 0.813764
d 1 -0.203370
e 1 -1.107230
f 1 1.855278
g 1 -2.041401
h 1 1.352107
i 1 -1.630252
j 1 -0.326678
k 1 -0.080991
l 1 0.438606
[24 rows x 1 columns]
In [74]: df.unstack(0)[0]
Out[74]:
a b c d e f g \
0 -1.642765 1.369409 -0.732588 0.357242 -1.259126 0.851803 -1.582394
1 1.286025 -0.983004 0.813764 -0.203370 -1.107230 1.855278 -2.041401
h i j k l
0 -0.508507 0.123032 0.421857 -0.524147 0.381085
1 1.352107 -1.630252 -0.326678 -0.080991 0.438606
[2 rows x 12 columns]
The unstack(0) say to move the outer index labels to the columns.

I don't know if there is a simpler method, but if you can construct a comparable series with the desired column names and index values, you can use pd.pivot:
Suppose you have 3 times the 12 values, creating a dummy example:
data = pd.Series(np.random.randn(12*3))
Now you can construct the desired columns and indices as follows:
col = pd.Series(np.tile(list('ABCDEFGHIJKL'),3))
idx = pd.Series(np.repeat(np.arange(3), 12))
And now:
In [18]: pd.pivot(index=idx, columns=col, values=data.values)
Out[18]:
A B C D E F G \
0 1.296702 0.270532 -0.645502 0.213300 -0.224421 -0.634656 -2.362567
1 -1.986403 1.006665 -1.167412 -0.697443 -1.394925 -0.365205 -1.468349
2 0.689492 -0.410681 0.378916 1.552068 0.144651 -0.419082 -0.433970
H I J K L
0 2.102229 0.538711 -0.839540 -0.066535 1.154742
1 -1.090374 -1.344588 0.515923 -0.050190 -0.163259
2 -0.235364 0.296751 0.456884 0.237697 1.089476
PS: for some reason just using data instead of data.values does not work.
You can also do it with unstack as #TomAugspurger explained:
midx = pd.MultiIndex.from_tuples(zip(idx, col))
data.index = midx
data.unstack()

Related

pandas - take last N rows from one subgroup

Let's suppose we have a dataframe that be generated using this code:
import pandas as pd
d = {'p1': np.random.rand(32),
'a1': np.random.rand(32),
'phase': [0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3],
'file_number': [1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2]
}
df = pd.DataFrame(d)
For each file number i want to take only last N rows of phase number 3. So that the result for N==2 looks like this:
Currently I'm doing it in this way:
def phase_3_last_n_observations(df, n):
result = []
for fn in df['file_number'].unique():
file_df = df[df['file_number']==fn]
for phase in [0,1,2,3]:
phase_df = file_df[file_df['phase']==phase]
if phase == 3:
phase_df = phase_df[-n:]
result.append(phase_df)
df = pd.concat(result, axis=0)
return df
phase_3_last_n_observations(df, 2)
However, it is very slow and I have terabytes of data, so I need to worry about performance. Does anyone have any idea how to speed my solution up? Thanks!
Filter the rows where phase is 3 then groupby and use tail to select the last two rows per file_number, finally append to get the result
m = df['phase'].eq(3)
df[~m].append(df[m].groupby('file_number').tail(2)).sort_index()
p1 a1 phase file_number
0 0.223906 0.164288 0 1
1 0.214081 0.748598 0 1
2 0.567702 0.226143 0 1
3 0.695458 0.567288 0 1
4 0.760710 0.127880 1 1
5 0.592913 0.397473 1 1
6 0.721191 0.572320 1 1
7 0.047981 0.153484 1 1
8 0.598202 0.203754 2 1
9 0.296797 0.614071 2 1
10 0.961616 0.105837 2 1
11 0.237614 0.640263 2 1
14 0.500415 0.220355 3 1
15 0.968630 0.351404 3 1
16 0.065283 0.595144 0 2
17 0.308802 0.164214 0 2
18 0.668811 0.826478 0 2
19 0.888497 0.186267 0 2
20 0.199129 0.241900 1 2
21 0.345185 0.220940 1 2
22 0.389895 0.761068 1 2
23 0.343100 0.582458 1 2
24 0.182792 0.245551 2 2
25 0.503181 0.894517 2 2
26 0.144294 0.351350 2 2
27 0.157116 0.847499 2 2
30 0.194274 0.143037 3 2
31 0.542183 0.060485 3 2
I use idea from deleted answer - get indices by previous rows for rows matching 3 by GroupBy.cumcount and remove them by DataFrame.drop:
def phase_3_last_n_observations(df, N):
df1 = df[df['phase'].eq(3)]
idx = df1[df1.groupby('file_number').cumcount(ascending=False).ge(N)].index
return df.drop(idx)
#index is reseted for default, because used for remove rows
df = phase_3_last_n_observations(df.reset_index(drop=True), 2)
As an alternative solution to what already exists: You can calculate the last elements for all phase groups and afterwards just use .loc to get the needed group result. I have written the code for N==2, if you want for N==3, then use [-1, -2, -3]
result = df.groupby(['phase']).nth([-1, -2])
PHASE = 3
result.loc[PHASE]

Find symmetric pairs quickly in numpy

from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.
The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.
You can sort the values, then groupby:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique with return_index=True and use the result to index the dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].
For generic numbers (ints/floats, etc.), we will use a view-based one -
# https://stackoverflow.com/a/44999009/ #Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.
If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.

Pandas Counting Each Column with its Spesific Thresholds

If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4

Finding min of values across multiple columns in pandas

I am trying to find min of values across columns in a pandas data frame where cols are ranged and split. For example, I have the dataframe in pandas as shown in the image.
I am iterating over the dataframe for more logic and would like to get the min of values in columns between T3:T6 and T11:T14 in separate variables.
Tried print(df.iloc[2,2:,2:4].min(axis=1))
I expect 9 and 13 for Row1 when I iterate.
create a simple dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
A B C D
0 2 0 5 1
1 9 7 5 5
2 5 5 3 0
3 0 6 3 8
4 4 4 4 0
5 8 2 1 4
6 4 1 1 8
7 6 5 2 9
8 2 4 3 0
9 4 7 1 8
use the min() function:
df.min()
result:
A 0
B 0
C 1
D 0
and if you wish to select specific columns, use the loc:
df.loc[:,'B':'C'].min()
B 0
C 1
Bonus: Take pandas to another level - paint the minimum:
df.style.apply(lambda x: ['background-color : red; font-size: 16px' if v==x.min() else 'font-size: 16px' for _,v in enumerate(x) ],axis=0)
print(df[['T'+str(x) for x in range(3,7)]].min(axis=1)]
print(df[['T'+str(x) for x in range(11,15)]].min(axis=1)]
should print the mins for all the rows of t3, t4, t5,16 and t11, t12, t13,14 separately
For test dataframe:
df = pd.DataFrame({'A':[x for x in range(100)], 'B': [x for x in range(10,110)], 'C' : [x for x in range(20,120)] })
Create a function that can be applied to each row to find the minimum:
def test(row):
print(row[['A','B']].min())
Then use apply to run the function on each row:
df.apply(lambda row: test(row), axis=1)
This will print the minimum of whichever columns you put in the "test function"

Modifying DataFrames in loop

Given this data frame:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I'd like to create 3 new data frames; one from each column.
I can do this one at a time like this:
a=pd.DataFrame(df[['A']])
a
A
0 1
1 2
2 3
But instead of doing this for each column, I'd like to do it in a loop.
Here's what I've tried:
a=b=c=df.copy()
dfs=[a,b,c]
fields=['A','B','C']
for d,f in zip(dfs,fields):
d=pd.DataFrame(d[[f]])
...but when I then print each one, I get the whole original data frame as opposed to just the column of interest.
a
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Update:
My actual data frame will have some columns that I do not need and the columns will not be in any sort of order, so I need to be able to get the columns by name.
Thanks in advance!
A simple list comprehension should be enough.
In [68]: df_list = [df[[x]] for x in df.columns]
Printing out the list, this is what you get:
In [69]: for d in df_list:
...: print(d)
...: print('-' * 5)
...:
A
0 1
1 2
2 3
-----
B
0 4
1 5
2 6
-----
C
0 7
1 8
2 9
-----
Each element in df_list is its own data frame, corresponding to each data frame from the original. Furthermore, you don't even need fields, use df.columns instead.
Or you can try this, instead create copy of df, this method will return the result as single Dataframe, not a list, However, I think save Dataframe into a list is better
dfs=['a','b','c']
fields=['A','B','C']
variables = locals()
for d,f in zip(dfs,fields):
variables["{0}".format(d)] = df[[f]]
a
Out[743]:
A
0 1
1 2
2 3
b
Out[744]:
B
0 4
1 5
2 6
c
Out[745]:
C
0 7
1 8
2 9
You should use loc
a = df.loc[:,0]
and then loop through like
for i in range(df.columns.size):
dfs[i] = df.loc[:, i]

Categories