I'm observing a behavior that's weird to me, can anyone tell me how I can define filter once and re-use throughout my code?
>>> df = pd.DataFrame([1,2,3], columns=['A'])
>>> my_filter = df.A == 2
>>> df.loc[1] = 5
>>> df[my_filter]
A
1 5
I expect my_filter to return empty dataset since none of the A columns are equal to 2.
I'm thinking about making a function that returns the filter and re-use that but is there any more pythonic as well as pandaic way of doing this?
def get_my_filter(df):
return df.A == 2
df[get_my_filter()]
change df
df[get_my_filter()]
Masks are not dynamic, they stay how you defined them when you defined them.
So if you still need to change the dataframe value, you should swap lines 2 and 3.
That would work.
you applied the filter in the first place. Changing a value in the row won't help.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = df.A == 2
print(my_filter)
'''
A
0 False
1 True
2 False
'''
as you can see, it returns a series. If you change the data after this process, it will not work. because this represents the first version of the df. But you can use define filter as a string. You can achieve what you want if you use the string filter inside the eval() function.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = 'df.A == 2'
df.loc[1] = 5
df[eval(my_filter)]
'''
Out[205]:
Empty DataFrame
Columns: [A]
Index: []
'''
Related
I'm trying to do some data cleaning using pandas. Imagine I have a data frame which has a column call "Number" and contains data like: "1203.10", "4221","3452.11", etc. I want to add an "M" before the numbers, which have a point and a zero at the end. For this example, it would be turning the "1203.10" into "M1203.10".
I know how to obtain a data frame containing the numbers with a point and ending with zero.
Suppose the data frame is call "df".
pointzero = '[0-9]+[.][0-9]+[0]$'
pz = df[df.Number.str.match(pointzero)]
But I'm not sure on how to add the "M" at the beginning after having "pz". The only way I know is using a for loop, but I think there is a better way. Any suggestions would be great!
You can use boolean indexing:
pointzero = '[0-9]+[.][0-9]+[0]$'
m = df.Number.str.match(pointzero)
df.loc[m, 'Number'] = 'M' + df.loc[m, 'Number']
Alternatively, using str.replace and a slightly different regex:
pointzero = '([0-9]+[.][0-9]+[0]$)'
df['Number'] = df['Number'].str.replace(pointzero, r'M\1', regex=True))
Example:
Number
0 M1203.10
1 4221
2 3452.11
you should make dataframe or seires example for answer
example:
s1 = pd.Series(["1203.10", "4221","3452.11"])
s1
0 M1203.10
1 4221
2 3452.11
dtype: object
str.contains + boolean masking
cond1 = s1.str.contains('[0-9]+[.][0-9]+[0]$')
s1.mask(cond1, 'M'+s1)
output:
0 M1203.10
1 4221
2 3452.11
dtype: object
I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.
I have a dataframe with headers 'Category', 'Factor1', 'Factor2', 'Factor3', 'Factor4', 'UseFactorA', 'UseFactorB'.
The value of 'UseFactorA' and 'UseFactorB' are one of the strings ['Factor1', 'Factor2', 'Factor3', 'Factor4'], keyed based on the value in 'Category'.
I want to generate a column, 'Result', which equals dataframe[UseFactorA]/dataframe[UseFactorB]
Take the below dataframe as an example:
[Category] [Factor1] [Factor2] [Factor3] [Factor4] [useFactor1] [useFactor2]
A 1 2 5 8 'Factor1' 'Factor3'
B 2 7 4 2 'Factor3' 'Factor1'
The 'Result' series should be [2, .2]
However, I cannot figure out how to feed the value of useFactor1 and useFactor2 into an index to make this happen--if the columns to use were fixed, I would just give
df['Result'] = df['Factor1']/df['Factor2']
However, when I try to give
df['Results'] = df[df['useFactorA']]/df[df['useFactorB']]
I get the error
ValueError: Wrong number of items passed 3842, placement implies 1
Is there a method for doing what I am trying here?
Probably not the prettiest solution (because of the iterrows), but what comes to mind is to iterate through the sets of factors and set the 'Result' value at each index:
for i, factors in df[['UseFactorA', 'UseFactorB']].iterrows():
df.loc[i, 'Result'] = df[factors['UseFactorA']] / df[factors['UseFactorB']]
Edit:
Another option:
def factor_calc_for_row(row):
factorA = row['UseFactorA']
factorB = row['UseFactorB']
return row[factorA] / row[factorB]
df['Result'] = df.apply(factor_calc_for_row, axis=1)
Here's the one liner:
df['Results'] = [df[df['UseFactorA'][x]][x]/df[df['UseFactorB'][x]][x] for x in range(len(df))]
How it works is:
df[df['UseFactorA']]
Returns a data frame,
df[df['UseFactorA'][x]]
Returns a Series
df[df['UseFactorA'][x]][x]
Pulls a single value from the series.
I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)
I import a CSV as a DataFrame using:
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
Then I'm trying to do a simple replace based on IDs:
df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson'
I get the following error:
AttributeError: 'list' object has no attribute 'loc'
Note, when I do print pd.version() I get 0.12.0, so it's not a problem (at least as far as I understand) with having pre-11 version. Any ideas?
To pickup from the comment: "I was doing this:"
df = [df.hc== 2]
What you create there is a "mask": an array with booleans that says which part of the index fulfilled your condition.
To filter your dataframe on your condition you want to do this:
df = df[df.hc == 2]
A bit more explicit is this:
mask = df.hc == 2
df = df[mask]
If you want to keep the entire dataframe and only want to replace specific values, there are methods such replace: Python pandas equivalent for replace. Also another (performance wise great) method would be creating a separate DataFrame with the from/to values as column and using pd.merge to combine it into the existing DataFrame. And using your index to set values is also possible:
df[mask]['fname'] = 'Johnson'
But for a larger set of replaces you would want to use one of the two other methods or use "apply" with a lambda function (for value transformations). Last but not least: you can use .fillna('bla') to rapidly fill up NA values.
The traceback indicates to you that df is a list and not a DataFrame as expected in your line of code.
It means that between df = pd.read_csv("test.csv") and df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson' you have other lines of codes that assigns a list object to df. Review that piece of code to find your bug
#Boud answer is correct. Loc assignment works fine if the right-hand-side list matches the number of replacing elements
In [56]: df = DataFrame(dict(A =[1,2,3], B = [4,5,6], C = [7,8,9]))
In [57]: df
Out[57]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [58]: df.loc[1,['A','B']] = -1,-2
In [59]: df
Out[59]:
A B C
0 1 4 7
1 -1 -2 8
2 3 6 9