pandas index skipping values - python

I'm reading in two csv files, selecting data from a specific column, dropping NA/nulls, and then using the data that fits some condition in one file to print the associated data in another:
data1 = pandas.read_csv(filename1, usecols = ['X', 'Y', 'Z']).dropna()
data2 = pandas.read_csv(filename2, usecols = ['X', 'Y', 'Z']).dropna()
i=0
for item in data1['Y']:
if item > -20:
print data2['X'][i]
But this throws me an error:
File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7035)
File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6976)
KeyError: 6L
Turns out when I print data2['X'] I see missing numbers in the index of rows
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
7 -1.928778
...
How do I fix this and renumber the index values? Or is there a better way?

Found a solution in another question from here: Reindexing dataframes
.reset_index(drop=True) does the trick!
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
6 -1.928778
7 -1.925359

Are your two files/dataframes the same length? If so, you can leverage boolean masks and do this (and it avoids a for loop):
data2['X'][data1['Y'] > -20]
Edit: in response to the comment
What happens in between:
In [16]: df1
Out[16]:
X Y
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
In [17]: df2
Out[17]:
Y X
0 64 75
1 65 73
2 36 44
3 13 58
4 92 54
# creates a pandas Series object of True/False, which you can then use as a "mask"
In [18]: df2['Y'] > 50
Out[18]:
0 True
1 True
2 False
3 False
4 True
Name: Y, dtype: bool
# mask is applied element-wise to (in this case) the column of your DataFrame you want to filter
In [19]: df1['X'][ df2['Y'] > 50 ]
Out[19]:
0 0
1 1
4 4
Name: X, dtype: int64
# same as doing this (where mask is applied to the whole dataframe, and then you grab your column
In [20]: df1[ df2['Y'] > 50 ]['X']
Out[20]:
0 0
1 1
4 4
Name: X, dtype: int64

Related

Not able to split value of a column in dataframe on condition

I have an input data frame which looks like
0 1
0 0 10,30
1 1 10,40
2 2 20,50
Now I am trying to split the second column and store the value in to a new column. Here if the value in column A is divisible by 2 then get the first value from column B else second value like below
A B C
0 0 10,30 10
1 1 10,40 10
2 2 20,50 50
My Code :
import pandas as pd
import numpy as np
df = pd.DataFrame([(0, '10,30'), (1, '10,40'), (2, '20,50')])
df['n'] = np.where(df[0] % 2 == 0, df[0], 0 )
df[2] = (df[1]).str.split(',').str[df['n'].fillna(0)
print(df)
Its throwing an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I believe need lookup by splited column to DataFrame and cast boolean mask to int for select first column by 0 and second column by 1:
df[2] = df[1].str.split(',', expand=True).lookup(df.index, (df[0] % 2 == 0).astype(int))
print (df)
0 1 2
0 0 10,30 30
1 1 10,40 10
2 2 20,50 50
print (df[0] % 2 == 0)
0 True
1 False
2 True
Name: 0, dtype: bool
#select second, first, second column
print ((df[0] % 2 == 0).astype(int))
0 1
1 0
2 1
Name: 0, dtype: int32
Similar solution with changed condition:
df[2] = df[1].str.split(',', expand=True).lookup(df.index, (df[0] % 2 != 0).astype(int))
print (df)
0 1 2
0 0 10,30 10
1 1 10,40 40
2 2 20,50 20
print (df[0] % 2 != 0)
0 False
1 True
2 False
Name: 0, dtype: bool
#select first, second, first column
print ((df[0] % 2 != 0).astype(int))
0 0
1 1
2 0
Name: 0, dtype: int32
print (df[1].str.split(',', expand=True))
0 1
0 10 30 <-first 10
1 10 40 <-second 40
2 20 50 <-first 20
I think you can also achieve it with the method apply.
First let's put the column 1 splited together with the target index into a new dataframe df1
df1 = pd.concat({i:df[1].str.split(',').str.get(i) for i in range(2)}, axis=1)
df1['ind'] = df[0] % 2
df1
0 1 ind
0 10 30 0
1 10 40 1
2 20 50 0
Next you can put the new values into the column 2 with
df[2] = df1.apply(lambda p: p.loc[p["ind"]], axis=1)
df[2]
0 10
1 40
2 20
dtype: object
If you don't want to create a new data frame, you can also do the following to get the same result
df[2] = df1.apply(lambda p: p.loc[1].split(",")[p.loc[0] % 2], axis=1)

Replace zeros in one dataframe with values from another dataframe

I have two dataframes df1 and df2:
df1 is shown here:
age
0 42
1 52
2 36
3 24
4 73
df2 is shown here:
age
0 0
1 0
2 1
3 0
4 0
I want to replace all the zeros in df2 with their corresponding entries in df1. In more technical words, if the element at a certain index in df2 is zero, then I would want this element to be replaced by the corresponding entry in df1.
Hence, I want df2 to look like:
age
0 42
1 52
2 1
3 24
4 73
I tried using the replace method but it is not working. Please help :)
Thanks in advance.
You could use where:
In [19]: df2.where(df2 != 0, df1)
Out[19]:
age
0 42
1 52
2 1
3 24
4 73
Above, df2 != 0 is a boolean DataFrame.
In [16]: df2 != 0
Out[16]:
age
0 False
1 False
2 True
3 False
4 False
df2.where(df2 != 0, df1) returns a new DataFrame. Where df2 != 0 is True, the corresponding value of df2 is used. Where it is False, the corresponding value of df1 is used.
Another alternative is to make an assignment with df.loc:
df2.loc[df2['age'] == 0, 'age'] = df1['age']
df.loc[mask, col] selects rows of df where the boolean Series, mask is True, and where the column label is col.
In [17]: df2.loc[df2['age'] == 0, 'age']
Out[17]:
0 0
1 0
3 0
4 0
Name: age, dtype: int64
When used in an assignment, such as df2.loc[df2['age'] == 0, 'age'] = df1['age'],
Pandas performs automatic index label alignment. (Notice the index labels above are 0,1,3,4 -- with 2 being skipped). So the values in df2.loc[df2['age'] == 0, 'age'] are replaced by the corresponding values from d1['age']. Even though d1['age'] is a Series with index labels 0,1,2,3, and 4, the 2 is ignored because there is no corresponding index label on the left-hand side.
In other words,
df2.loc[df2['age'] == 0, 'age'] = df1.loc[df2['age'] == 0, 'age']
would work as well, but the added restriction on the right-hand side is unnecessary.
In [30]: df2.mask(df2==0).combine_first(df1)
Out[30]:
age
0 42.0
1 52.0
2 1.0
3 24.0
4 73.0
or "negating" beautiful #unutbu's solution:
In [46]: df2.mask(df2==0, df1)
Out[46]:
age
0 42
1 52
2 1
3 24
4 73
Or try mul
df1.mul(np.where(df2==1,0,1)).replace({0:1})

Apply a for loop to multiple DataFrames in Pandas

I have multiple DataFrames that I want to do the same thing to.
First I create a list of the DataFrames. All of them have the same column called 'result'.
df_list = [df1,df2,df3]
I want to keep only the rows in all the DataFrames with value 'passed' so I use a for loop on my list:
for df in df_list:
df =df[df['result'] == 'passed']
...this does not work, the values are not filtered out of each DataFrame.
If I filter each one separately then it does work.
df1 =df1[df1['result'] == 'passed']
df2 =df2[df2['result'] == 'passed']
df3 =df3[df3['result'] == 'passed']
This is because every time you do a subset like this df[<whatever>] you are returning a new dataframe, and assigning it to the df looping variable, which gets obliterated each time you go to the next iteration (although you do keep the last one). This similar to slicing lists:
>>> list1 = [1,2,3,4]
>>> list2 = [11,12,13,14]
>>> for lyst in list1,list2:
... lyst = lyst[1:-1]
...
>>> list1, list2
([1, 2, 3, 4], [11, 12, 13, 14])
>>> lyst
[12, 13]
Usually, you need to use a mutator method if you want to actually modify the lists in-place. Equivalently, with a dataframe, you could use assignment on an indexer, e.g. .loc/.ix/.iloc/ etc in combination with the .dropna method, being careful to pass the inplace=True argument. Suppose I have three dataframes and I want to only keep the rows where my second column is positive:
Warning: This way is not ideal, look at edit for better way
In [11]: df1
Out[11]:
0 1 2 3
0 0.957288 -0.170286 0.406841 -3.058443
1 1.762343 -1.837631 -0.867520 1.666193
2 0.618665 0.660312 -1.319740 -0.024854
3 -2.008017 -0.445997 -0.028739 -0.227665
4 0.638419 -0.271300 -0.918894 1.524009
5 0.957006 1.181246 0.513298 0.370174
6 0.613378 -0.852546 -1.778761 -1.386848
7 -1.891993 -0.304533 -1.427700 0.099904
In [12]: df2
Out[12]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
2 0.550845 -0.102224 -0.575909 -0.404770
3 -1.171828 -0.912451 -1.197273 0.719489
4 -0.887862 1.073306 0.351835 0.313953
5 -0.517824 -0.096929 -0.300282 0.716020
6 -1.121527 0.183219 0.938509 0.842882
7 0.003498 -2.241854 -1.146984 -0.751192
In [13]: df3
Out[13]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
2 -0.493466 -0.717872 1.090417 -0.591872
3 1.021246 -0.060453 -0.013952 0.304933
4 -0.859882 -0.947950 0.562609 1.313632
5 0.917199 1.186865 0.354839 -1.771787
6 -0.694799 -0.695505 -1.077890 -0.880563
7 1.088068 -0.893466 -0.188419 -0.451623
In [14]: for df in df1, df2, df3:
....: df.loc[:,:] = df.loc[df[1] > 0,:]
....: df.dropna(inplace = True,axis =0)
....:
In [15]: df1
dfOut[15]:
0 1 2 3
2 0.618665 0.660312 -1.319740 -0.024854
5 0.957006 1.181246 0.513298 0.370174
In [16]: df2
Out[16]:
0 1 2 3
0 -0.521018 0.407258 -1.167445 -0.363503
1 -0.879489 0.008560 0.224466 -0.165863
4 -0.887862 1.073306 0.351835 0.313953
6 -1.121527 0.183219 0.938509 0.842882
In [17]: df3
Out[17]:
0 1 2 3
0 0.240411 0.795132 -0.305770 -0.332253
1 -1.162097 0.055346 0.094363 -1.254859
5 0.917199 1.186865 0.354839 -1.771787
Edited to Add:
I think I found a better way just using the .drop method.
In [21]: df1
Out[21]:
0 1 2 3
0 -0.804913 -0.481498 0.076843 1.136567
1 -0.457197 -0.903681 -0.474828 1.289443
2 -0.820710 1.610072 0.175455 0.712052
3 0.715610 -0.178728 -0.664992 1.261465
4 -0.297114 -0.591935 0.487698 0.760450
5 1.035231 -0.108825 -1.058996 0.056320
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [22]: df2
Out[22]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [23]: df3
Out[23]:
0 1 2 3
0 -0.002327 -2.054557 -1.752107 -0.911178
1 -0.998328 -1.119856 1.468124 -0.961131
2 -0.048568 0.373192 -0.666330 0.867719
3 0.533597 -1.222963 0.119789 -0.037949
4 1.203075 -0.773511 0.475809 1.352943
5 -0.984069 -0.352267 -0.313516 0.138259
6 0.114596 0.354404 2.119963 -0.452462
7 -1.033029 -0.787237 0.479321 -0.818260
In [25]: for df in df1,df2,df3:
....: df.drop(df.index[df[1] < 0],axis=0,inplace=True)
....:
In [26]: df1
Out[26]:
0 1 2 3
2 -0.820710 1.610072 0.175455 0.712052
6 1.579931 0.958331 -0.653261 -0.171245
7 0.685427 1.447411 0.001002 0.241999
In [27]: df2
Out[27]:
0 1 2 3
0 1.660864 0.110002 0.366881 1.765541
1 -0.627716 1.341457 -0.552313 0.578854
2 0.277738 0.128419 -0.279720 -1.197483
3 -1.294724 1.396698 0.108767 1.353454
4 -0.379995 0.215192 1.446584 0.530020
5 0.557042 0.339192 -0.105808 -0.693267
6 1.293941 0.203973 -3.051011 1.638143
7 -0.909982 1.998656 -0.057350 2.279443
In [28]: df3
Out[28]:
0 1 2 3
2 -0.048568 0.373192 -0.666330 0.867719
6 0.114596 0.354404 2.119963 -0.452462
Certainly faster:
In [8]: timeit.Timer(stmt="df.loc[:,:] = df.loc[df[1] > 0, :];df.dropna(inplace = True,axis =0)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[8]: 23.69621358400036
In [9]: timeit.Timer(stmt="df.drop(df.index[df[1] < 0],axis=0,inplace=True)", setup="import pandas as pd,numpy as np; df = pd.DataFrame(np.random.random((8,4)))").timeit(10000)
Out[9]: 11.476448250003159
If you multiple columns contains passed condition like more than 40 are passed else failed
col_list_contains_passed= ['result1','result2']
df_list = [df1,df2,df3]
for d in df_list:
for c in col_list:
d[c]=np.where(d[c]>=40,'Passed','failed')#you can put your condition
Then you can filter individual dataframe by condition
df1=df1[df1['col_name']=='xyz]
I will update the answer if for loop works. To be continued...

output from summing two dataframes

I have two dataframe, df1 (1 row, 10 columns) and df2 ( 7 rows, 10 columns). Now I want to add values from those two dataframe:
final = df1[0] + df2[0][0]
print final
output:
0 23.8458
Name: 0, dtype: float64
I believe 23.8458 is what I want, but I don't understand what "0" stand for, and what datatype of "final", but I just want to keep 23.8458 as a float number. How can I do that? Thanks,
What you are printing is the Series where it has a single value with index value 0 and column value 23.8458 if you wanted just the value then print final[0] would give you what you want
You can see the type if you do print type(final)
Example:
In [42]:
df1 = pd.DataFrame(np.random.randn(1,7))
print(df1)
df2 = pd.DataFrame(np.random.randn(7,10))
print(df2)
0 1 2 3 4 5 6
0 -1.575662 0.725688 -0.442072 0.622824 0.345227 -0.062732 0.197771
0 1 2 3 4 5 6 \
0 -0.658687 0.404817 -0.786715 -0.923070 0.479590 0.598325 -0.495963
1 1.482873 -0.240854 -0.987909 0.085952 -0.054789 -0.576887 -0.940805
2 1.173126 0.190489 -0.809694 -1.867470 0.500751 -0.663346 -1.777718
3 -0.111570 -0.606001 -0.755202 -0.201915 0.933065 -0.833538 2.526979
4 1.537778 -0.739090 -1.813050 0.601448 0.296994 -0.966876 0.459992
5 -0.936997 -0.494562 0.365359 -1.351915 -0.794753 1.552997 -0.684342
6 0.128406 0.016412 0.461390 -2.411903 3.154070 -0.584126 0.136874
7 8 9
0 -0.483917 -0.268557 1.386847
1 0.379854 0.205791 -0.527887
2 -0.307892 -0.915033 0.017231
3 -0.672195 0.110869 1.779655
4 0.241685 1.899335 -0.334976
5 -0.510972 -0.733894 0.615160
6 2.094503 -0.184050 1.328208
In [43]:
final = df1[0] + df2[0][0]
print(final)
0 -2.234349
Name: 0, dtype: float64
In [44]:
print(type(final))
<class 'pandas.core.series.Series'>
In [45]:
final[0]
Out[45]:
-2.2343491912631328

Pandas column.sum() without having the index values multiply

I have a pd like this:
When I take the .sum() of the columns, Pandas is multiplying each row entry by the index value.
I need just a raw count at the end of each column, not a "sum" per se. What is the best way?
To find the sum of the values, use .sum(). To find a count of the non-empty cells, use .count(). To find a count of the cells which have a value greather than 0, try df[df>0].count().
In [29]: df=pd.read_table('data.csv', delim_whitespace=True)
In [30]: df
Out[30]:
BPC B-S
0 2 1
1 5 2
2 0 1
3 0 0
4 0 0
5 2 1
6 8 3
7 38 12
[8 rows x 2 columns]
In [31]: df.sum()
Out[31]:
BPC 55
B-S 20
dtype: int64
In [32]: df[df>0].count()
Out[32]:
BPC 5
B-S 6
dtype: int64

Categories