Streamlit / Pandas query / filtering for existing values and ignoring not existing - python

I can imagine this question has been asked a lot, but I can´t find a needed approach:
I built a dashboard in streamlit and everything works fine. I want to show data based on some filters:
Date = st.sidebar.multiselect(
"Please select the month:",
options=df["Date"].unique(),
default=month
)
Type = st.sidebar.multiselect(
"Please select the type:",
options=df["Type"].unique(),
default=df["Type"].unique()
)
df_selection = df.query(
"Date == #Date & Type == #Type"
)
In column "Type" there are only three types: "A", "B", "C". Some months have only types "A" and "B" and some all three.
If I select the months with only "A" and "B" streamlit throws an error: KeyError: 'C' and
File "app.py", line 106, in <module>
c = typen.loc["C"]
I avoided it with:
if 'C' in types.index:
c = types.loc["C"]
else:
c = 0
a = types.loc["A"]
b = types.loc["B"]
I tried to do it for other types ("A" and "B") but it does not really work. Maybe I need a combination of for loop and try & catch to make multiselect more flexible.
So basically I need a way to scan my DataFrame for missing data based on filters, ignore them or replace with 0 and return the existing ones.
But somehow I can´t come further and need an advice :)

Related

How to apply if condition to two different columns and put the result to a new column

I have a data frame df2 and want to generate a new column called 'tag' based on a if logic on two existing columns.
import pandas as pd
df2 = pd.DataFrame({'NOTES': ["PREPAID_HOME_SCREEN_MAMO","SCREEN_MAMO",
"> Unable to connect internet>4G Compatible>Set",
"No>Not Barred>Active>No>Available>Others>",
"Internet Not Working>>>Unable To Connect To"],
'col_1': ["voice", "voice","data","other","voice"],
'col_2': ["DATA", "voice","VOICE","VOICE","voice"]})
The logic and my attempt are:
df2['Tag'] =
if df['col_1']=='data':
return "Yes"
elif df['col_2']:
return "Yes"
else:
return "No"
But I got a syntax error:
The problem is that you are trying to assign a value with if-statement, which causes the syntax error.
There are many ways to do this, I provide one using pandas.DataFrame.apply.
trans_fn = lambda row: "Yes" if row['col_1']=='data' && row['col_2'] else "No"
df2['tag'] = df2.apply(trans_fn, axis=1) # apply trans_fn to each row

Is it possible to check if a column exists or not, inside a pyspark select dataframe?

I have a JSON where some columns are sometimes not present in the structure. I'm trying to put a condition but it's giving an error.
My code is:
v_id_row = df.schema.simpleString().find('id:')
df1 = df.select ('name','age','city','email',when(v_id_row > 0,'id').otherwise(lit(""))
I am getting the following error:
TypeError: condition should be a Column
How can I do this validation?
Could you try:
df1 = df.select (col('name'),col('age'),col('city'),col('email'),when(col('v_id_row') > 0, col('id')).otherwise(lit(""))
That works for me.
You can do like this:
v_id_row = df.schema.simpleString().find('id:')
col_list = ['name','age','city','email']
if v_id_row > 0:
col_list.append("id")
df1 = df.select(col_list)

how to replace values on a dataframe using pandas and streamlit in python?

i have a python script that read dataframe using pandas and display its content using streamlit.
What i want is to replace current value with a new value based on the user input.
Where user select the required column and than enter the current value in a text field than the new value in the second text field when button replace is pressed the old value is replaced by the new value and the new dataframe is displayed.
the problem is that when it display the dataframe nothing is changed
code:
import pandas as pd
import streamlit as st
df =pd.DataFrame({
"source_number": [
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1","loc3","loc1","loc2","loc2","loc3","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
columns = st.selectbox("Select column", df.columns)
old_values = st.multiselect("Current Values",list(df[columns].unique()),list(df[columns].unique()))
col1,col2 = st.beta_columns(2)
with col1:
old_val = st.text_input("old value")
with col2:
new_val = st.text_input("new value")
if st.button("Replace"):
df[columns]=df[columns].replace({old_val:new_val})
st.dataframe(df)
There is a little error in your code.
df[columns]=df[columns].replace({old_val:new_val})
When you look at a pandas docs in examples there is a code like that
s.replace({'a': None}) - it replaces value 'a' with None value
When looking at your code what it means, that you are trying to replace value that is a list with another list, but it does not work like that, because your column doesn't have a list as an element so can't change it like that.
When I ran your code in Jupyter I got an error that list is unhashable
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-41a02888936d> in <module>
30 oldVal = [11199,11328,11287,32345]
31
---> 32 df["source_number"] = df["source_number"].replace({oldVal:newVal})
33 df
TypeError: unhashable type: 'list'
And this a reason why it doesn't change anything for you.
If you want to change all values using lists you will have to write it like that:
df[column] = df[column].replace(old_values, new_values)
This code works just fine.
I hope I was clear enough and it will work for you.
Your code works for text columns (location and category). It doesn't work for the numeric source_number column as you're trying to replace one string by another.
For numeric columns you'll need to use number_input instead of text_input:
import pandas as pd
from pandas.api.types import is_numeric_dtype
import streamlit as st
df = pd.DataFrame({
"source_number":
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1","loc3","loc1","loc2","loc2","loc3","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
columns = st.selectbox("Select column", df.columns)
old_values = st.multiselect("Current Values",list(df[columns].unique()),list(df[columns].unique()))
col1,col2 = st.beta_columns(2)
st_input = st.number_input if is_numeric_dtype(df[columns]) else st.text_input
with col1:
old_val = st_input("old value")
with col2:
new_val = st_input("new value")
if st.button("Replace"):
df[columns]=df[columns].replace({old_val:new_val})
st.dataframe(df)
Update as per comment: you could cache the df to prevent re-initalization upon each widget interaction (you'll have to manually clear the cache to start over):
#st.cache(allow_output_mutation=True)
def get_df():
return pd.DataFrame({
"source_number":
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1","loc3","loc1","loc2","loc2","loc3","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
df = get_df()

Pandas 'None' type does not apply to all empty columns

I have some code that uses the 'openpyexcel' package to read Excel spreadsheets from a file. I need to know which columns are empty, but have found that for some reason not all empty fields are 'None' type and instead just appear blank. I've been trying to clear out whatever is in there using .replace, but haven't had much luck. I'm not sure if my regular expression is wrong or just doesn't apply to something in the column. I'm still working getting comfortable with Numpy data types and am a little lost.
Regrex I've been trying to use to replace the invisible data:
df = df.replace(r'^[?!a-zA-Z0-9_]+$', r'', regex=True)
Function to create a list of empty columns:
def lister(table, table_name):
try:
lst = []
table.replace(to_replace='', value= None)
for c in table.columns:
if c == 'NOT USED':
continue
elif (table[c].isna().all()) == True:
print(c)
lst.append(c)
else:
continue
return lst

Pandas Python user input an attribute for dataframe object

I'm trying to allow the user to input the attribute for the dataframe object.
I've tried changing my input into a string. I've also tried using my input saved to a variable. Both these options do not work.
data = pd.read_csv('2019FallEnrollees.csv')
input1_col = input("Enter comparison group A: ")
input2_col = input("Enter comparison group B: ")
input1_str= str(input1_col)
input2_str = str(input2_col)
test = data[['CUM_GPA', input1_str, input2_str]]
# error here! 'test' does not have attribute 'input1_str' or 'input1_col'
df_1 = test[(test.input1_str == 0) & (test.input2_str == 0)]
df_2 = test[(test.input1_col == 1) & (test.input2_col == 0)]
print(stats.ttest_ind(df_1.CUM_GPA, df_2.CUM_GPA, equal_var = False))
The error messages says
"AttributeError: 'DataFrame' object has no attribute 'input1_str'
or
"AttributeError: 'DataFrame' object has no attribute 'input1_col'
Welcome!
To access a column in pandas you cannot use data.column
Try data[column] or in your case test[input1_col]
Before you do so, make sure the column does exist and the user is not inputting a nonexistant column.
Sometimes the column name can be an integer and converting to a string may also be a concern
You can get a list of all the dataframe columns through running data.columns (if you want a regular array: list(data.columns)) and infact you can alter the column names through running data.columns = ["Column Header 1" , "Column Header 2" etc.]

Categories