I am attempting to run a simple query in PowerBI using python. Sadly, most python libraries are not supported in PowerBI, so I'm limited to pandas and numpy. The dataset is a set of projects that are either in pipeline or active. I want to filter the dataset down to rows that are just in pipeline based on a set of or conditions. So it would like look
dataframe = pd.DataFrame(where project = 'Pipeline'), set of other conditions to filter pipeline launches by)
Is that possible in python, similar to a nested where statement?
You can use isin to lookup multiple values in a column. If you want to filter based on multiple columns then you need to chain you loc or iloc conditions. Basically whatever you get returned from a iloc or loc is also a dataframe, below is the working example of both conditions.
>>> import pandas as pd
>>> import numpy as np
>>> d = {'col1': [1, 2, 2, 3], 'col2': [3, 4, 5, 6], 'col3': ['NULL','NULL', np.nan, 'virgo']}
>>> df = pd.DataFrame(data=d)
We will query the dataframe to get rows where col3 is either virgo or NULL
>>> df.loc[df['col3'].isin(['virgo','NULL'])]
col1 col2 col3
0 1 3 NULL
1 2 4 NULL
3 3 6 virgo
Now we will query the dataframe to get rows where col3 is NULL and col2 is 4
>>> df.loc[df['col3'] == 'NULL'].loc[df['col2'] == 4]
col1 col2 col3
1 2 4 NULL
If all conditions in one variable do:
df [ df.column_name.isin([value_1, value_2, value_n]) ]
if conditions in multiple columns do
df [ (df.column_1 == "value_1") & (df.column_2 == "value_2") & (df.column_n.isin([val_1, val_2, val_3])) ]
Note:
& means AND, so if both (left and right) are True the result is True, else result False.
if you need OR condition use |.
Related
I have dataset, df with some empty values in second column col2.
so I create a new table with same column names and the lenght is equal to number of missings in col2 for df. I call the new dataframe df2.
df[df['col2'].isna()] = df2
But this will return nan for the entire rows where col2 was missing. which means that df[df['col1'].isna()] is now missins everywhere and not only in col2.
Why is that and how Can I fix that?
Assuming that by df2 you really meant a Series, so renaming as s:
df.loc[df['col2'].isna(), 'col2'] = s.values
Example
nan = float('nan')
df = pd.DataFrame({'col1': [1,2,3], 'col2': [nan, 0, nan]})
s = pd.Series([10, 11])
df.loc[df['col2'].isna(), 'col2'] = s.values
>>> df
col1 col2
0 1 10.0
1 2 0.0
2 3 11.0
Note
I don't like this, because it is relying on knowing that the number of NaNs in df is the same length as s. It would be better to know how you create the missing values. With that information, we could probably propose a better and more robust solution.
5 columns (col1 - col5) in a 10-column dataframe (df) should be either blank or have text values only. If any row in these 5 columns has an all numeric value, i need to trigger an error. Wrote the following code to identify rows where the value is all-numeric in 'col1'. (I will cycle through all 5 columns using the same code):
df2 = df[df['col1'].str.isnumeric()]
I get the following error: ValueError: cannot mask with array containing NA / NaN values
This is triggered because the blank values create NaNs instead of False. I see this when I created a list instead using the following:
lst = df['col1'].str.isnumeric()
Any suggestions on how to solve this? Thanks
Try this to work around the NaN
import pandas as pd
df = pd.DataFrame([{'col1':1}, {'col1': 'a'}, {'col1': None}])
lst = df['col1'].astype(str).str.isnumeric()
if lst.any():
raise ValueError()
Here's a way to do:
import string
df['flag'] = (df
.applymap(lambda x: any(i for i in x if i in string.digits))
.apply(lambda x: f'Fail: {",".join(df.columns[x].tolist())} is numeric', 1))
print(df)
col1 col2 flag
0 a 2.04 Fail: col2 is numeric
1 2.02 b Fail: col1 is numeric
2 c c Fail: is numeric
3 d e Fail: is numeric
Explanation:
We iterate through each value of the dataframe and check if it is a digit and return a boolean value.
We use that boolean value to subset the column names
Sample Data
df = pd.DataFrame({'col1': ['a','2.02','c','d'],
'col2' : ['2.04','b','c','e']})
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def calculation(text):
return text*2
for idx, row in df.iterrows():
df.at[idx, 'col3'] = dict(cats=calculation(row['col1']))
df
So as you can see from the code above I have tried a few different things.
Basically I am trying to get the dictionary in to col3.
However, when you run for the first time on new dataframe - you get a
col1 col2 col3
0 1 3 cats
1 2 4 {'cats': 4}
If you run the for loop again on the same dataframe you get what I am looking for which is
col1 col2 col3
0 1 3 {'cats': 2}
1 2 4 {'cats': 4}
How do I go straight to having the dictionary in there to start without having to run the loop again?
I have tried other ways like df.loc and others, still no joy.
Try to stay away from df.iterrows().
You can use df.apply instead:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def calculation(text):
return text*2
def calc_dict(row):
return dict(cats=calculation(row['col1']))
df['col3'] = df.apply(calc_dict, axis=1)
df
Which outputs the result you expect.
The error seems to creep in with the creation and assignment of an object datatype to col col3. I tried to pre-allocate to NaNs with df['col3'] = pd.np.NaN which did not have an effect (inspect with print(df.dtypes)). Anyway this seems like buggy behaviour. Use df.apply instead, its faster and less prone to these types of issues.
I am using this code
searchfor = ["s", 'John']
df = df[~df.iloc[1].astype(str).str.contains('|'.join(searchfor),na=False)]
This returns the error
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
However this works fine if run as a column search
df = df[~df.iloc[;,1].astype(str).str.contains('|'.join(searchfor),na=False)]
I am trying to remove a row based on if the row contains a certain phrase
To drop rows
Create a mask which returns True or False depending whether that cell contains your strings
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
Then use filter .any to check for at least one True per row with boolean indexing and take only the rows where no True was found.
df_filtered = df[~mask.any(axis=1)]
To drop columns
search_for = ["s", "John"]
mask = data.applymap(lambda x: any(s in str(x) for s in search_for))
axis=0 instead of 1 to check for each column:
columns_analysis = mask.any(axis=0)
get the indexes when True to drop
columns_to_drop = columns_analysis[columns_analysis == True].index.tolist()
df_filtered = data.drop(columns_to_drop, axis=1)
This is related to the way you are splitting your data.
In the first statement, you are asking python to split your dataframe and give you the second (index 1 is second if you want first change index to 0) row, while in the second case, you are asking for the second column and in your dataframe these have different lengths (my mistake, is shapes). See this example:
d = {'col1': [1, 2], 'col2': [3, 4], 'col3':[23,23]}
df = pd.DataFrame(data=d)
print(df)
col1 col2 col3
1 3 23
2 4 23
First row:
df.iloc[0]
col1 1
col2 3
col3 23
Name: 0, dtype: int64
First column:
df.iloc[:,]
1
2
Name: col2, dtype: int64
Try this and if you like the answer vote...
Good luck.
Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.
If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3
What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.
Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2