KeyError warning from pandas dataframe inside plotly dash chained callback - python

I got multiple dropdowns that I'd like to populate depending on what the user chooses in the previous dropdown. I populate the first dropdown with:
schools_requests = requests.get("http://ipwhatever:portwhatever/list_all_schools")
schools_data = schools_requests.json()
df = pd.DataFrame(schools_data)
nome = df['nome'].tolist()
It gives me the names of the schools I got listed. I then send it (nome) to the first dropdown like this:
html.Label('Escola'),
dcc.Dropdown(
options = nome,
id = "escola",
)
The first callback works fine and it's the one down below:
#callback(
Output('id_school', 'children'),
Input('escola','value')
)
def find_id_school(school_name):
all_schools = requests.get(
"http://ipwhatever:portwhatever/list_all_schools")
all_schools_data = all_schools.json()
for element in all_schools_data:
if school_name == element['nome']:
id_school = element['id_escola']
return id_school
It basically searches for the corresponding school id given the name of the school the user chose in the first dropdown and stores this id in a hidden html.Div.
Now comes the second callback, where I use pandas and don't understand why it's different from the first time.
#callback(
Output('ano', 'options'),
Input('id_school', 'children')
)
def render_grade_from_school(chosen_id):
grade = requests.get(
"http://ipwhatever:portwhatever/grade?school_id="+str(chosen_id))
grade_data = grade.json()
indices = list(range(0,len(grade_data)))
df = pd.DataFrame(grade_data, index=indices)
ano = df['serie'].tolist()
return ano
So it takes the school id, requests the grades from another endpoint and basically does the same thing as the first time I used pandas in the code.
The only difference is the index argument. It started complaining about the lack of index. So I check the length of the list of jsons, generate a list of indices like [0,1,2,...] and passes it as argument to dataframe. So it stopped complaining about it.
But now...I get a KeyError: 'serie'. The warning highlights this: return self._engine.get_loc(casted_key) as the source, I don't know. Still, the dropdown 'ano' (grade) correctly updates and shows it in the dropdown. But the warning never goes away.

Related

Understanding streamlit data flow and how to submit form in a sequential way

Below is a simple reproducible example that works to illustrate the problem in its simple form. You can jump to the code and expected behaviour as the problem description can be long.
The main concept
There are 3 dataframes stored in a list, and a form on the sidebar shows the supplier_name and po_number from the relevant dataframe. When the user clicks the Next button, the information inside the supplier_name and po_number text_input will be saved (in this example, they basically got printed out on top of the sidebar).
Problem
This app works well when the user don't change anything inside the text_input, but if the user changes something, it breaks the app. See below pic for example, when I change the po_number to somethingrandom, the saved information is not somethingrandom but p123 from the first dataframe.
What's more, if the information from the next dataframe is the same as the first dataframe, the changed value inside the text_input will be unchanged for the next display. For example, because the first and second dataframe's supplier name are both S1, if I change the supplier name to S10, then click next, the supplier_name is still S10 on the second dataframe, while the second dataframe's supplier_name should be S1. But if the supplier name for the next dataframe changed, the information inside the text_input will be changed.
Justification
If you are struggling to understand why I want to do this, the original use for this is for the sidebar input area to extract information from each PDFs, then when the user confirms the information are all correct, they click next to review the next PDF. But if something is wrong, they can change the information inside the text_input, then click next, and the information of the changed value will be recorded, and for the next pdf, the extracted information should reflect on what the next pdf is. I did this in R shiny quite simply, but can't figure out how the data flow works here in streamlit, please help.
Reproducible Example
import streamlit as st
import pandas as pd
# 3 dataframes that are stored in a list
data1 = {
"supplier_name": ["S1"],
"po_number": ["P123"],
}
data2 = {
"supplier_name": ["S1"],
"po_number": ["P124"],
}
data3 = {
"supplier_name": ["S2"],
"po_number": ["P125"],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
list1 = [df1, df2, df3]
# initiate a page session state, every time next button is clicked
# it will go to the next dataframe in the list
if 'page' not in st.session_state:
st.session_state.page = 0
def next_page():
st.sidebar.write(f"Submitted! supplier_name: {supplier_name} po_number: {po_number}")
st.session_state.page += 1
supplier_name_value = list1[st.session_state.page]["supplier_name"][0]
po_number_value = list1[st.session_state.page]["po_number"][0]
# main area
list1[st.session_state.page]
# sidebar form
with st.sidebar.form("form"):
supplier_name = st.text_input(label="Supplier Name", value=supplier_name_value)
po_number = st.text_input(label="PO Number", value=po_number_value)
next_button = st.form_submit_button("Next", on_click=next_page)
Expected behaviour
The dataframe's info are extracted into the sidebar input area. The user can change the input if they wish, then click next, and the values inside the input areas will be saved. When it goes to the next dataframe, the values inside the text input will be refreshed to extract from the next dataframe, and repeats.
I'm not totally sure what you're going for, but after some messing around, the only way I was able to achieve this sort of sequential form submission handling is with st.experimental_rerun(). I hate to resort to that since it may be removed any time, so hopefully there's a better way.
Without experimental_rerun(), forms take two submits to actually update state. I wasn't able to find a "correct" way to achieve an immediate update to support the expected behavior.
Here's my attempt:
import pandas as pd # 1.5.1
import streamlit as st # 1.18.1
def initialize_state():
data = [
{
"supplier_name": ["S1"],
"po_number": ["P123"],
},
{
"supplier_name": ["S1"],
"po_number": ["P124"],
},
{
"supplier_name": ["S2"],
"po_number": ["P125"],
},
]
state.dfs = state.get("dfs", [pd.DataFrame(x) for x in data])
first_vals = [{x: df[x][0] for x in df.columns} for df in state.dfs]
state.selections = state.get("selections", first_vals)
state.pages_expanded = state.get("pages_expanded", 0)
state.current_page = state.get("current_page", 0)
state.just_modified_page = state.get("just_modified_page", -1)
def handle_submit(i):
st.session_state.selections[i] = {
"supplier_name": state.new_supplier_name,
"po_number": state.new_po_number,
}
state.current_page = i
state.just_modified_page = i
if i < len(state.dfs) - 1 and state.pages_expanded == i:
state.pages_expanded += 1
st.experimental_rerun()
def render_form(i):
with st.sidebar.form(key=f"form-{i}"):
supplier_name = state.selections[i]["supplier_name"]
po_number = state.selections[i]["po_number"]
if i == state.just_modified_page:
st.sidebar.write(
f"Submitted! supplier_name: {supplier_name} "
f"po_number: {po_number}"
)
state.just_modified_page = -1
state.new_supplier_name = st.text_input(
label="Supplier Name",
value=supplier_name,
)
state.new_po_number = st.text_input(
label="PO Number",
value=po_number,
)
if st.form_submit_button("Next"):
handle_submit(i)
state = st.session_state
initialize_state()
for i in range(state.pages_expanded + 1):
render_form(i)
# debug
st.write("state.pages_expanded", state.pages_expanded)
st.write("state.current_page", state.current_page)
st.write("state.just_modified_page", state.just_modified_page)
st.write("state.dfs[state.current_page]", state.dfs[state.current_page])
st.write("state.selections", state.selections)
I'm assuming you want to keep track of the user's selections, but not actually modify the dataframes. If you do want to modify the dataframes, that's simpler: replace state.selections with actual writes to dfs by index and column:
# ...
def handle_submit(i):
st.session_state.dfs[i]["supplier_name"] = state.new_supplier_name,
st.session_state.dfs[i]["po_number"] = state.new_po_number,
#st.session_state.selections[i] = {
# "supplier_name": state.new_supplier_name,
# "po_number": state.new_po_number,
#}
# ...
def render_form(i):
with st.sidebar.form(key=f"form-{i}"):
supplier_name = state.dfs[i]["supplier_name"][0]
po_number = state.dfs[i]["po_number"][0]
#supplier_name = state.selections[i]["supplier_name"]
#po_number = state.selections[i]["po_number"]
# ...
Now, it's possible to make this 100% dynamic, but I hardcoded supplier_name and po_number to avoid premature generalization that you may not need. If you do want to generalize, use df.columns like initialize_state does throughout the code.
I'm not sure I quite understand what you're trying to accomplish, but it seems like you're never updating the supplier name in list1 after the user updates the name via the text input widget.

How can I align columns if rows have different number of values?

I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).

Python (Datapane) : How to pass dynamic variables into a datapane report function

I am working on a charting module where I can pass on dataframe and the module will create reports based on plots generated by calling few functions as mentioned below.
I am using Altair for plotting and "Datapane" for creating the report, the documentation of the same can be found here : https://datapane.github.io/datapane/
My DataFrame looks like this
d = {'Date': ['2021-01-01', '2021-01-01','2021-01-01','2021-01-01','2021-01-02','2021-01-03'],
'country': ['IND','IND','IND','IND','IND','IND' ],
'channel': ['Organic','CRM','Facebook','referral','CRM','CRM' ],
'sessions': [10000,8000,4000,2000,7000,6000 ],
'conversion': [0.1,0.2,0.1,0.05,0.12,0.11 ],
}
country_channel = pd.DataFrame(d)
Plotting functions :
def plot_chart(source,Y_axis_1,Y_axis_2,chart_caption):
base = alt.Chart(source).encode(
alt.X('Date:T', axis=alt.Axis(title="Date"))
)
line_1 = base.mark_line(opacity=1, color='#5276A7').encode(
alt.Y(Y_axis_1,
axis=alt.Axis( titleColor='#5276A7'))
)
line_2 = base.mark_line(opacity=0.3,color='#57A44C', interpolate='monotone').encode(
alt.Y(Y_axis_2,
axis=alt.Axis( titleColor='#57A44C'))
)
chart_ae=alt.layer(line_1, line_2).resolve_scale(
y = 'independent'
).interactive()
charted_plot = dp.Plot(chart_ae , caption=chart_caption)
return charted_plot
def channel_plot_split(filter_1,filter_2,country,channel):
channel_split_data = country_channel[(country_channel[filter_1]==country.upper())]
channel_split_data =channel_split_data[(channel_split_data[filter_2].str.upper()==channel.upper())]
channel_split_data=channel_split_data.sort_values(by='Date',ascending = True)
channel_split_data=channel_split_data.reset_index(drop=True)
channel_split_data.head()
plot_channel_split = plot_chart(source=channel_split_data,Y_axis_1='sessions:Q',Y_axis_2='conversion:Q',chart_caption="Sessions-Conversion Plot for Country "+country.upper()+" and channel :"+ channel)
channel_plot=dp.Group(dp.HTML("<div class='center'> <h3> Country : "+country.upper()+" & Channel : "+channel.upper()+"</h3></div>"),plot_channel_split,rows=2)
return channel_plot
def grpplot(plot_1,plot_2):
gp_plot = dp.Group(plot_1,plot_2,columns=2)
return gp_plot
The above functions when called, will filter the dataframe, create plot for each filters and group 2 plots in a row.
row_1 = grpplot(channel_plot_split('country','channel','IND','Organic'),channel_plot_split('country','channel','IND','CRM'))
row_2 = grpplot(channel_plot_split('country','channel','IND','Facebook'),channel_plot_split('country','channel','IND','referral'))
I can now generate a report by calling datapane.Report() function as follows
r= dp.Report(row_1,row_2)
Problem: This works fine when I know how many channels are present, but my channel list is dynamic.I am thing of using "for" loop to generate rows, but not sure how can I pass on these rows as kwargs in dp.Report() function. For example, if I have 10 channels, I need to pass 10 rows dynamically.
I had a similar problem and solved it as follows
Create a list to store the pages or elements of the report, such as
report_pages=[]
report_pages.append(dp.Page)
report_pages.append(dp.Table)
report_pages.append(dp.Plot)
At the end just generate the report with a pointer to the list
dp.Report(*pages)
In your case, I think you can do the following
create a list
rows=[]
add the rows to the list
rows.append(row_1)
rows.append(row_2)
and then create the report with
r= dp.Report(*rows)
I found this solution on datapane's GitHub and in this notebook in the last line of code.
So here is how I solved this problem.
channel_graph_list=[]
for i in range(0,len(unique_channels),1):
channel_1_name = unique_channels[i]
filtered_data = filter_the_data(source=channel_data,filter_1='channel',fv_1=channel_1_name)
get_chart = plot_chart(filtered_data,Y_axis_1='sessions:Q',Y_axis_2='conversion:Q',chart_title='Session & Conv. Chart for '+channel_1_name)
#This is where the trick starts - The below code creates a dynamic variable
vars() ["channel_row_"+str(i)] = get_chart
channel_graph_list.append("dp.Plot(channel_row_"+str(i)+",label='"+channel_1_name+"')")
#convert the list to a string
channel_graph_row = ','.join(channel_graph_list)
# assign the code you want to run
code="""channel_graph = dp.Select(blocks=["""+channel_graph_row+ """],type=dp.SelectType.TABS)"""
#execute the code
exec(code)
Hope the above solution helps others looking to pass dynamically generated parameters into any function.

Using a Python list to populate in a drop down validation

I am using openpyxl to manipulate a spreadsheet from Python.
I am trying to create a drop-down validation in a workbook tab called organisation. Is it possible to use a Python list to populate the elements in the drop down selection?
When I hardcode the drop down options to into the DataValidation line like so:
dv = DataValidation(type="list", formula1="The", "earth", "revolves", "around", "sun", allow_blank=True)
The drop down is created in the spreadsheet tab and populated with the options as expected.
However when I try to add the drop down options using Python list and then pass to the DataValidation line like so:
valid = ['"The,earth,revolves,around,sun"']
dv = DataValidation(type="list", formula1=valid, allow_blank=True)
the drop down list is not created.
For extra information please see the full script:
def addValidationDropDowns(path):
valid = ['"The,earth,revolves,around,sun"']
wb = openpyxl.load_workbook(path)
ws = wb['organisation']
dv = DataValidation(type="list", formula1=valid, allow_blank=True)
ws.add_data_validation(dv)
for x in range(0, 3):
dv.add(ws["A"+str(x+10)])
wb.save(path)
return
I struggle with this the first time i did it. It is curious that if the 'type' paramater or DataValidation is "list" you think ¡ok, let's use a list! but no! it is expecting a string. I think your example will work if you remove the square brackets to the 'valid' variable.
valid = '"The,earth,revolves,around,sun"'

Python : Separating a .txt file into columns and finding the most frequent data item in one of the columns

I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'

Categories