pandas not setting column correctly - python

I have the following program in python
# input
import pandas as pd
import numpy as np
data = pd.DataFrame({'a':pd.Series([1.,2.,3.]), 'b':pd.Series([4.,np.nan,6.])})
Here the data is:
In: print data
a b
0 1 4
1 2 NaN
2 3 6
Now I want a isnull column indicating if the row has any nan:
# create data
data['isnull'] = np.zeros(len(data))
data['isnull'][pd.isnull(data).any(axis=1)] = 1
The output is not correct (the second one should be 1):
In: print data
a b isnull
0 1 4 0
1 2 NaN 0
2 3 6 0
However, if I execute the exact command again, the output will be correct:
data['isnull'][pd.isnull(data).any(axis=1)] = 1
print data
a b isnull
0 1 4 0
1 2 NaN 1
2 3 6 0
Is this a bug with pandas or am I missing something obvious?
my python version is 2.7.6. pandas is 0.12.0. numpy is 1.8.0

You're chain indexing which doesn't give reliable results in pandas. I would do the following:
data['isnull'] = pd.isnull(data).any(axis=1).astype(int)
print data
a b isnull
0 1 4 0
1 2 NaN 1
2 3 6 0
For more on the problems with chained indexing, see here:
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-view-versus-copy

Related

Use drop duplicates in Pandas DF but choose keep column based on a preference list

I have dataframe with many columns. There is a datetime column, and there are duplicated entries for the datetime with data for those duplicates coming from different sources. I would like to drop the duplicates based on column "dt", but I want to keep the result based on what is in column "pref". I have provided simplified data below, but the reason for this is that I also have a value column, and the "Pref" column is the data source. I prefer certain data sources, but I only need one entry per date (column "dt"). I would like this code to work so that I don't have to provide a complete list of preferences either.
Artificial Data Code
import pandas as pd
import numpy as np
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
df
Out[1]:
dt Pref Value String_col
0 1 1 -0.479593 A
1 1 2 0.553963 A
2 1 3 0.194266 A
3 2 2 0.598814 A
4 2 3 -0.909138 A
5 3 1 -0.297539 A
6 3 3 -1.100855 A
7 4 1 0.747354 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 1 (CASE 1):
In this case I my preference list matters all the way down. I prefer data source 2 the most, followed by 1, but will take 3 if that is all I have.
preference_list=[2,1,3]
Out[2]:
dt Pref Value String_col
1 1 2 0.553963 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 2 (CASE 2)
In this case I just want to look for data source 1. If it is not present I don't actually care what the other data source is.
preference_list2=[1]
Out[3]:
dt Pref Value String_col
0 1 1 -0.479593 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
7 4 1 0.747354 A
9 5 3 0.301373 A
I can imagine doing this in a really slow and complicated loop, but I feel like there should be a command to accomplish this. Another important thing: I need to keep some other text columns in the data frame so .agg may cause issue for those metadata. I have experimented with sorting and using the keep argument in drop_duplicates, but with no success.
You are actually looking for sorting by category, which can be done by pd.Categorical:
df["Pref"] = pd.Categorical(df["Pref"], categories=preference_list, ordered=True)
print (df.sort_values(["dt","Pref"]).drop_duplicates("dt"))
dt Pref Value String_col
1 1 2 -1.004362 A
3 2 2 -1.316961 A
5 3 1 0.513618 A
8 4 2 -1.859514 A
9 5 3 1.199374 A
here is a very efficient and simple solution, I hope it helps !
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
preference_list = [2,3]
df_clean = df[df['Pref'].isin(preference_list)]
print(df)
print(df_clean)
Output:
dt Pref Value String_col
0 1 1 1.404505 A
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
5 3 1 -1.208514 A
6 3 3 -0.456773 A
7 4 1 0.574463 A
8 4 2 -1.682750 A
9 5 3 0.719394 A
dt Pref Value String_col
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
6 3 3 -0.456773 A
8 4 2 -1.682750 A
9 5 3 0.719394 A

pivot table to expanded dataframe in Python/Pandas

I want to build on a previous question of mine.
Let's look at some Python code.
import numpy as np
import pandas as pd
mat = np.array([[1,2,3],[4,5,6]])
df_mat = pd.DataFrame(mat)
df_mat_tidy = (df_mat.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
df_mat_tidy
This takes me from a pivot table (mat) to a "tidy" (in the Tidyverse sense) version of the data that gives one variable as the column from which the number came, one variable as the row from which the number came, and one variable as the number in the pivot table at the row-column position.
Now I want to expand on that to get the row-column pair repeated the number of times the pivot table specifies. In other words, if position 1,1 has value 3 and position 2,1 has value 4, I want the data frame to go
col row
1 1
1 1
1 1
1 2
1 2
1 2
1 2
instead of
col row value
1 1 3
1 2 4
I think I know how to loop over the rows of the second example and produce that, but I want something faster.
Is there a way to "melt" the pivot table the way that I am describing?
Have a look at the parts of pandas' docs entitled "Reshaping and pivot tables".
Both .pivot(), .pivot_table() and .melt() are all existing functions. It looks like you are reinventing some wheels.
You could just rebuild a DataFrame from a comprehension:
pd.DataFrame([i for j in [[[rec['V1'], rec['V2']]] * rec['value']
for rec in df_mat_tidy.to_dict(orient='records')]
for i in j], columns=['col', 'row'])
It gives as expected:
col row
0 0 0
1 0 1
2 0 1
3 0 2
4 0 2
5 0 2
6 1 0
7 1 0
8 1 0
9 1 0
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 1 2
16 1 2
17 1 2
18 1 2
19 1 2
20 1 2

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Generating new columns as a full-combination of other columns

Could not find similar cases here.
Suppose, i have a DataFrame
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
So it is:
A B C I II
0 2 2 3 1 0
1 2 2 3 0 1
2 1 3 3 0 0
3 2 3 4 1 1
I want to make a full pairwise combination between {A,B,C} and {I,II}, so i get {I-A,I-B,I-C,II-A,II-B,II-C}:
Each of a new column is just an elementwise multiplication of corresponding base columns
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
ATM i dont have any working solution. I'am trying to use loops(not succeding in this), but i hope there's more sufficient way.
It's pretty simple, really. You have two sets of columns that you want to combine pairwise. I won't even bother with permutation tools:
>>> new_df = pd.DataFrame()
>>>
>>> for i in ["I", "II"]:
for a in ["A", "B", "C"]:
new_df[i+"-"+a] = df[i] * df[a]
>>> new_df
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Of course you could obtain the lists of column names as slices off df.columns, or in whatever other way is convenient. E.g. for your example dataframe you could write
>>> for i in df.columns[3:]:
for a in df.columns[:3]:
new_df[i+"-"+a] = df[i] * df[a]
Using loops, you can use this code. It's definitely not the most elegant solution but should work for your purpose. It only requires that you specify the columns that you'd like to use for the pairwise multiplication. It seems to be quite readable though, which is something you may want.
def element_wise_mult(first, second):
element_wise_mult = []
for i, el in enumerate(first):
element_wise_mult.append(el * second[i])
return element_wise_mult
if __name__ == '__main__':
import pandas as pd
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
fs = ['I', 'II']
sc = ['A', 'B', 'C']
series = []
names = []
for i in fs:
for j in sc:
names.append(i + '-' + j)
series.append(pd.Series(element_wise(df[i], df[j]))) # append array creates as a pandas series
print(pd.DataFrame(series, index=names).T) # reconstruct dataframe from the series and names stored
Returns:
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Here is a solution without for loops for your specific example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
cross_vals=np.tile(df[df.columns[:3]].values,(1,2))*np.repeat(df[df.columns[3:]].values,3,axis=1)
cros_cols=np.repeat(df.columns[3:].values,3)+np.array('-')+np.tile(df.columns[:3].values,(1,2))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
Then new_df is
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
You could generalize it to any size as long as the columns A,B,C,... are consecutive and similarly the columns I,II,... are consecutive.
For the general case, if the columns are not necessarily consecutive, you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
let=np.array(['A','B','C'],dtype=object)
num=np.array(['I','II'],dtype=object)
cross_vals=np.tile(df[let].values,(1,len(num)))*np.repeat(df[num].values,len(let),axis=1)
cros_cols=np.repeat(num,len(let))+np.array('-')+np.tile(let,(1,len(num)))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
And the result is the same as above.

Conditional statement and split in a Dataframe

I am looking for a conditional statement in python to look for a certain information in a specified column and put the results in a new column
Here is an example of my dataset:
OBJECTID CODE_LITH
1 M4,BO
2 M4,BO
3 M4,BO
4 M1,HP-M7,HP-M1
and what I want as results:
OBJECTID CODE_LITH M4 M1
1 M4,BO 1 0
2 M4,BO 1 0
3 M4,BO 1 0
4 M1,HP-M7,HP-M1 0 1
What I have done so far:
import pandas as pd
import numpy as np
lookup = ['M4']
df.loc[df['CODE_LITH'].str.isin(lookup),'M4'] = 1
df.loc[~df['CODE_LITH'].str.isin(lookup),'M4'] = 0
Since there is multiple variables per rows in "CODE_LITH" it seems like the script in not able to find only "M4" it can find "M4,BO" and put 1 or 0 in the new column
I have also tried:
if ('M4') in df['CODE_LITH']:
df['M4'] = 0
else:
df['M4'] = 1
With the same results.
Thanks for your help.
PS. The dataframe contains about 2.6 millions rows and I need to do this operation for 30-50 variables.
I think this is the Pythonic way to do it:
for mn in ['M1', 'M4']: # Add other "M#" as needed
df[mn] = df['CODE_LITH'].map(lambda x: mn in x)
Use str.contains accessor:
>>>> for key in ('M4', 'M1'):
... df.loc[:, key] = df['CODE_LITH'].str.contains(key).astype(int)
>>> df
OBJECTID CODE_LITH M4 M1
0 1 M4,BO 1 0
1 2 M4,BO 1 0
2 3 M4,BO 1 0
3 4 M1,HP-M7,HP-M1 0 1
I was able to do:
for index,data in enumerate(df['CODE_LITH']):
if "I1" in data:
df['Plut_Felsic'][index] = 1
else:
df['Plut_Felsic'][index] = 0
It does work, but takes quite some time to calculate.

Categories