Pandas: Get out values from a list in a DataFrame - python

from one of my scripts I end up with a big dataframe in Pandas
and one of it's columns look like this
13 [1705916]
14 [116242799]
15 [17865718]
...
9551 [74736013]
9553 []
9620 [92090990]
9666 [113455]
9667 [327478610]
9733 [52782791]
9838 []
9951 [229462842]
9952 []
10070 []
When I do type(df.column_of_interest)
I get back <class 'pandas.core.series.Series'>.
So my question is, is it possible to extract data from list in a dataframe while keeping rows with empty values in the list?
expected output:
13 1705916
14 116242799
15 17865718
...
9551 74736013
9553
9620 92090990
9666 113455
9667 327478610
9733 52782791
9838
9951 229462842
9952
10070

If the column is currently a list of integers, then you can use series.apply along with str.join() to get what you want. Example -
In [42]: df = pd.DataFrame([[1,[2]],[2,[3]],[3,[]],[4,[5,6]]], columns=['A','B'])
In [43]: df
Out[43]:
A B
0 1 [2]
1 2 [3]
2 3 []
3 4 [5, 6]
In [44]: df['B'] = df['B'].apply(lambda x:','.join([str(i) for i in x]))
In [45]: df
Out[45]:
A B
0 1 2
1 2 3
2 3
3 4 5,6

Related

Apply a for loop to multiple DataFrames in Pandas

I have multiple DataFrames that I want to do the same thing to.
First I create a list of the DataFrames. All of them have the same column called 'result'.
df_list = [df1,df2,df3]
I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:
for df in df_list:
df =df[df['result'] == 'passed']
...this does not work, the values are not filtered out of each DataFrame.
If I filter each one separately then it does work.
df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']
This is because every time you do a subset like this df[<whatever>] you are returning a new dataframe, and assigning it to the df looping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:
>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
... lyst = lyst[1:-1]
...
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]
Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/ etc in combination with the .dropna method, being careful to pass the inplace=True argument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:
Warning: This way is not ideal, look at edit for better way
In [11]: df1
Out[11]:
0 1 2 3
0 0.957288 -0.170286 0.406841 -3.058443
1 1.762343 -1.837631 -0.867520 1.666193
2 0.618665 0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4 0.638419 -0.271300 -0.918894 1.524009
5 0.957006 1.181246 0.513298 0.370174
6 0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700 0.099904
In [12]: df2
Out[12]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
2 0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273 0.719489
4 -0.887862 1.073306 0.351835 0.313953
5 -0.517824 -0.096929 -0.300282 0.716020
6 -1.121527 0.183219 0.938509 0.842882
7 0.003498 -2.241854 -1.146984 -0.751192
In [13]: df3
Out[13]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
2 -0.493466 -0.717872 1.090417 -0.591872
3 1.021246 -0.060453 -0.013952 0.304933
4 -0.859882 -0.947950 0.562609 1.313632
5 0.917199 1.186865 0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7 1.088068 -0.893466 -0.188419 -0.451623
In [14]: for df in df1, df2, df3:
....: df.loc[:,:] = df.loc[df[1] > 0,:]
....: df.dropna(inplace = True,axis =0)
....:
In [15]: df1
dfOut[15]:
0 1 2 3
2 0.618665 0.660312 -1.319740 -0.024854
5 0.957006 1.181246 0.513298 0.370174
In [16]: df2
Out[16]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
4 -0.887862 1.073306 0.351835 0.313953
6 -1.121527 0.183219 0.938509 0.842882
In [17]: df3
Out[17]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
5 0.917199 1.186865 0.354839 -1.771787
Edited to Add:
I think I found a better way just using the .drop method.
In [21]: df1
Out[21]:
0 1 2 3
0 -0.804913 -0.481498 0.076843 1.136567
1 -0.457197 -0.903681 -0.474828 1.289443
2 -0.820710 1.610072 0.175455 0.712052
3 0.715610 -0.178728 -0.664992 1.261465
4 -0.297114 -0.591935 0.487698 0.760450
5 1.035231 -0.108825 -1.058996 0.056320
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [22]: df2
Out[22]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [23]: df3
Out[23]:
0 1 2 3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856 1.468124 -0.961131
2 -0.048568 0.373192 -0.666330 0.867719
3 0.533597 -1.222963 0.119789 -0.037949
4 1.203075 -0.773511 0.475809 1.352943
5 -0.984069 -0.352267 -0.313516 0.138259
6 0.114596 0.354404 2.119963 -0.452462
7 -1.033029 -0.787237 0.479321 -0.818260
In [25]: for df in df1,df2,df3:
....: df.drop(df.index[df[1] < 0],axis=0,inplace=True)
....:
In [26]: df1
Out[26]:
0 1 2 3
2 -0.820710 1.610072 0.175455 0.712052
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [27]: df2
Out[27]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [28]: df3
Out[28]:
0 1 2 3
2 -0.048568 0.373192 -0.666330 0.867719
6 0.114596 0.354404 2.119963 -0.452462
Certainly faster:
In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036
In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159
If you multiple columns contains passed condition like more than 40 are passed else failed
col_list_contains_passed= ['result1','result2']
df_list = [df1,df2,df3]
for d in df_list:
for c in col_list:
d[c]=np.where(d[c]>=40,'Passed','failed')#you can put your condition
Then you can filter individual dataframe by condition
df1=df1[df1['col_name']=='xyz]
I will update the answer if for loop works. To be continued...

Python - Get group names from aggregated results in pandas

I have a dataframe like this:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
When I apply
df.groupby('minute').sum().sort('values', ascending=False)
This gives:
values
minute
3 7
2 6
4 6
1 4
I want to get first two values in minute column in an array like [3,2]. How can I access values in minute column
If what you want is the values from the minute column in the grouped dataframe (which would be the index column as well) , you can use DataFrame.index , to access that column. Example -
grouped = df.groupby('minute').sum().sort('values', ascending=False)
grouped.index[:2]
If you really want it as a list, you can use .tolist() to convert it to a list. Example -
grouped.index[:2].tolist()
Demo -
In [3]: df
Out[3]:
minute values
0 1 3
1 2 4
2 1 1
3 4 6
4 3 7
5 2 2
In [4]: grouped = df.groupby('minute').sum().sort('values', ascending=False)
In [5]: grouped.index[:2]
Out[5]: Int64Index([3, 2], dtype='int64', name='minute')
In [6]: grouped.index[:2].tolist()
Out[6]: [3, 2]

Pandas: remove rows of dataframe with unique index value

I am trying to remove the rows of my dataframe (df) which have unique value as index. This is my df:
A B
1 3.803 4.797
1 3.276 3.878
2 5.181 6.342
3 6.948 9.186
3 8.762 10.136
4 10.672 12.257
4 8.266 13.252
5 13.032 14.656
6 15.021 17.681
6 16.426 15.07
I would like to remove the rows with index=2,5 to get a new dataframe (df_new) as follow:
A B
1 3.803 4.797
1 3.276 3.878
3 6.948 9.186
3 8.762 10.136
4 10.672 12.257
4 8.266 13.252
6 15.021 17.681
6 16.426 15.07
Is there some handy function in pandas to do that?
Thank you
Use get_duplicates:
In [36]:
df.loc[df.index.get_duplicates()]
Out[36]:
A B
1 3.803 4.797
1 3.276 3.878
3 6.948 9.186
3 8.762 10.136
4 10.672 12.257
4 8.266 13.252
6 15.021 17.681
6 16.426 15.070
get_duplicates returns an array of the duplicated indices:
In [37]:
df.index.get_duplicates()
Out[37]:
[1, 3, 4, 6]

Increase index of pandas DataFrame by one

I'd like to have my dataframe start with index 1 instead of 0. But somehow I am not getting it:
In[1]: df = pd.DataFrame([[4,7],[10,11],[7,2]],columns=['one', 'two'])
In[2]: df
Out[2]:
one two
0 4 7
1 10 11
2 7 2
In[3]: df.reindex(range(1,len(df)+1))
Out[3]:
one two
1 10 11
2 7 2
3 NaN NaN
Where did my first row go? What am I getting wrong about reindex()?
To increase your index with 1 you can simply modify the index like this, df.index += 1.
Full example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[4,7],[10,11],[7,2]],columns=['one', 'two'])
In [3]: df
Out[3]:
one two
0 4 7
1 10 11
2 7 2
In [4]: df.index += 1
In [5]: df
Out[5]:
one two
1 4 7
2 10 11
3 7 2
The reindexing does not reassign the index values and preserve the order for that you can assign directly:
In [25]:
df.index = range(1,len(df)+1)
df
Out[25]:
one two
1 4 7
2 10 11
3 7 2
The docs show that you are conforming your data to the new index which will introduce NaN values where none existed hence why you lost a row, this is why there is a fillna param for reindex.

pandas: convert a CSV series into a data frame

I'm new to pandas so apologies for what I think is a trivial question, but I can't quite find the relevant function for this:
I've got a file which consists of essentially 12 different data series, with the nth element of each series grouped together; i.e.
series_A_data0
series_B_data0
series_C_data0
...
series_L_data0
series_A_data1
series_B_data1
series_C_data1
...
I can import this into pandas as a single column data frame, but how can I get it into a 12-column data series?
For reference, currently I'm doing:
data = pd.read_csv(file)
data.head(14)
0 17655029760
1 1529585664
2 1598763008
3 4936196096
4 2192232448
5 2119827456
6 2143997952
7 1549099008
8 1593683968
9 1361498112
10 1514512384
11 1346588672
12 17939451904
13 1544957952
Do you know that the series will always be in the same order? If so, I'd create a MultiIndex, and the unstack from that. Just read in the Series like you've done. I'll work with this data frame:
In [31]: df = pd.DataFrame(np.random.randn(24))
In [32]: df
Out[32]:
0
0 -1.642765
1 1.369409
2 -0.732588
3 0.357242
4 -1.259126
5 0.851803
6 -1.582394
7 -0.508507
8 0.123032
9 0.421857
10 -0.524147
11 0.381085
12 1.286025
13 -0.983004
14 0.813764
15 -0.203370
16 -1.107230
17 1.855278
18 -2.041401
19 1.352107
20 -1.630252
21 -0.326678
22 -0.080991
23 0.438606
In [33]: import itertools as it
In [34]: series_id = it.cycle(list('abcdefghijkl')) # first 12 letters.
In [60]: idx = pd.MultiIndex.from_tuples(zip(series_id, df.index.repeat(12)[:len(df)]))
We need to repeat the index so that the first observation for each Series is at index 0. Now set that as the index and unstack.
In [61]: df.index = idx
In [62]: df
Out[62]:
0
a 0 -1.642765
b 0 1.369409
c 0 -0.732588
d 0 0.357242
e 0 -1.259126
f 0 0.851803
g 0 -1.582394
h 0 -0.508507
i 0 0.123032
j 0 0.421857
k 0 -0.524147
l 0 0.381085
a 1 1.286025
b 1 -0.983004
c 1 0.813764
d 1 -0.203370
e 1 -1.107230
f 1 1.855278
g 1 -2.041401
h 1 1.352107
i 1 -1.630252
j 1 -0.326678
k 1 -0.080991
l 1 0.438606
[24 rows x 1 columns]
In [74]: df.unstack(0)[0]
Out[74]:
a b c d e f g \
0 -1.642765 1.369409 -0.732588 0.357242 -1.259126 0.851803 -1.582394
1 1.286025 -0.983004 0.813764 -0.203370 -1.107230 1.855278 -2.041401
h i j k l
0 -0.508507 0.123032 0.421857 -0.524147 0.381085
1 1.352107 -1.630252 -0.326678 -0.080991 0.438606
[2 rows x 12 columns]
The unstack(0) say to move the outer index labels to the columns.
I don't know if there is a simpler method, but if you can construct a comparable series with the desired column names and index values, you can use pd.pivot:
Suppose you have 3 times the 12 values, creating a dummy example:
data = pd.Series(np.random.randn(12*3))
Now you can construct the desired columns and indices as follows:
col = pd.Series(np.tile(list('ABCDEFGHIJKL'),3))
idx = pd.Series(np.repeat(np.arange(3), 12))
And now:
In [18]: pd.pivot(index=idx, columns=col, values=data.values)
Out[18]:
A B C D E F G \
0 1.296702 0.270532 -0.645502 0.213300 -0.224421 -0.634656 -2.362567
1 -1.986403 1.006665 -1.167412 -0.697443 -1.394925 -0.365205 -1.468349
2 0.689492 -0.410681 0.378916 1.552068 0.144651 -0.419082 -0.433970
H I J K L
0 2.102229 0.538711 -0.839540 -0.066535 1.154742
1 -1.090374 -1.344588 0.515923 -0.050190 -0.163259
2 -0.235364 0.296751 0.456884 0.237697 1.089476
PS: for some reason just using data instead of data.values does not work.
You can also do it with unstack as #TomAugspurger explained:
midx = pd.MultiIndex.from_tuples(zip(idx, col))
data.index = midx
data.unstack()

Categories