pandas groupby multiple columns with python and streamlit

pandas groupby multiple columns with python and streamlit - python

I have a groupby function that i want to group multiple columns in order to plot a chart later.
The dataframe's columns are dynamic where user select it from a selectbox and multiselect widgets
The problem is that i am able now just to take the first or the last item from the multiselect widget like so:
some_columns_df = df.loc[:,['gender','country','city','hoby','company','status']]
some_collumns = some_columns_df.columns.tolist()
select_box_var= st.selectbox("Choose X Column",some_collumns)
multiselect_var= st.multiselect("Select Columns To GroupBy",some_collumns)
test_g3 = df.groupby([select_box_var,multiselect_var[0]]).size().reset_index(name='count')
if user select more than 1 item from the multiselect let say he choose 4 item it becomes like below:
test_g3 = df.groupby([select_box_var,multiselect_var[0,1,2,3]]).size().reset_index(name='count')
is this possible ?

multiselect_var is a list while select_box_var is a single variable. Put it inside a list and add both lists together.
Try this:
test_g3 = df.groupby([select_box_var] + multiselect_var).size().reset_index(name='count')

From streamlit docs for multiselect here, the api returns a list always. And your selectbox returns a string as you have a list of strings as option.
So your code can be modified to,
df.groupby([select_box_var] + multiselect_var).size().reset_index(name='count')

Related

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps

Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.

I think you want to try something like len(eval('df%s' % 0))

How to insert multiple values into specific treeview columns?

I have a database returning the total of several column and I am trying to display it in a treeview. If I do
for i in backend2.calc_total()[0]:
treeviewtotal.insert("", END, values=i)
I get
which is not what i want as i want everything to start from "food" column onwards. I cant make date a iid as i already have an iid that I am referencing to my database.
If I do
list2 = ['Date', 'Food', 'Transport', 'Insurance', 'Installments', 'Others']
for i in range(len(backend2.calc_total()[0][0])):
treeviewtotal.insert("", 0, list2[i+1], values=backend2.calc_total()[0][0][i])
I get this
instead, all the totals get stacked into 1 column (which is scrollable).
Any way to achieve my aim of allocating the respective totals to the respective column in a same row? Thanks!

With reference to the first attempt, the following solves the problem:
for i in backend2.calc_total()[0]:
treeviewtotal.insert("", END, values=([], *i))
values= takes in a list. Therefore we add an empty space by using [], but since i itself is already a list, we need to "flatten out" the list by doing *i.
Please correct me if I used any parts of the code wrongly. Still trying to learn =)

I have a list where I want each element of the list to be in as a single row

I have a list of lists and I want to assign each of the lists to a specific column, I have created the columns of the Dataframe. But in each column, the elements are coming as a list. I want each element of this list to be a separate row as part of that particular column.
Here's what I did:
df = pd.DataFrame([np.array(dataset).T],columns=list1)
print(df)
Attached screenshot for the output.
I want each element of that list to be a row, as my output.

This should do the work for you:
import pandas as pd
Fasteners = ['Screws & Bolts', 'Threaded Rods & Studs', 'Eyebolts', 'U-Bolts']
Adhesives_and_Tape = ['Adhesives','Tape','Hook & Loop']
Weld_Braz_Sold = ['Electrodes & Wire','Gas Regulators','Welding Gloves','Welding Helmets & Glasses','Protective Screens']
df = pd.DataFrame({'Fastener': pd.Series(Fasteners), 'Adhesives_and_Tape': pd.Series(Adhesives_and_Tape), 'Weld_Braz_Sold': pd.Series(Weld_Braz_Sold)})
print(df)
Please provide the structure of the database you are starting from or the structure of the respective lists. I can give you are more focussed answer to your specific problem then.
If the structure is getting larger, you can also iterate through all lists when generating the data frame. This is just the basic process to solve your question.
Feel free to comment for further help.
EDIT
If you want to loop through a database of lists. Use the following code additionally:
for i in range(len(list1)): df.iloc[:,i] = pd.Series(dataset[i])

Creating a column variable taking the mean of a variable conditional on two other variables

I have a data frame that shows the mean 'dwdime' for each of the given conditions:
DIMExCand_means = DIMExCand.groupby(['cycle', 'coded_state', 'party.orig', 'comtype']).mean()
I have created a pivot table from DIMExCand_means with the following command and output:
DIMExCand_master = pd.pivot_table(DIMExCand_means,index=["Cycle","State"])
However, some data gets lost in the process. I would like to add columns to the 'DIMExCand_master' dataframe that includes the mean 'dwdime' score given each possible combination of 'party.orig' and 'comptype', as this will allow me to have one entry per 'cycle'-'coded_state'.

Let's try:
DIMExCand_means = DIMExCand_means.reset_index()
DIMExCand_master = DIMExCand_master.reset_index()
pd.merge(DIMExCand_means, DIMExCand_master, left_on=['cycle','coded_state'], right_on=['Cycle','State'])

Thanks!
I ended up going with:
DIMExCand_dime = pd.pivot_table(DIMExCand, values = 'dwdime', index ["Cycle","State"], columns='ID', aggfunc=np.mean)

Python pandas - trying to a dict of df's into a panel (or loop the df items into a panel)

I have stock data in a dataframe with column headings like AAPL, AAPL_ma, MSFT, MSFT_ma -- and would like to somehow get the data into a panel with items = stock symbols (so AAPL item would include AAPL and AAPL_ma).
I am new to pandas and am struggling to come up with a coherent plan. I can't figure out if I should be: (1) working through MultiIndex functionality, (2) looping through lists to write data into new df's named as stock symbols, or (3) splitting the existing dataframe by symbol (eg, 'AAPL' in 'AAPL_ma').
Any direction would be MUCH appreciated. Thanks in advance!
UPDATE:
On EdChum's advise, I am using the following to create a dict of the column headings for my df in string form. Unsure if this is what you meant - work in progress.
y = [df['Date']]
dict_stocks = {}
# create dict for multiindexing
for stock in list_stocks:
i=0
x=[df[stock]]
for heading in list_headings:
data_series = df[stock + list_headings[i]]
i = i + 1
x.append(data_series)
dict_stocks[stock] = y + x
The above produces a dict of df's, though the axes are not what I expected. However, I am having no luck with either of:
my_panel = pd.Panel(df)
my_panel = pd.Panel.from_dict(dict_stocks)
which generate errors:
--PandasError: Panel constructor not properly called!
--AttributeError: 'list' object has no attribute 'shape'

Your easiest way to the promised land would be to create a multi index dictionary with keys being tuples like (aapl, aapl) and (aapl,aapl_ma) and then doing a pandas.Dataframe() on the dictionary. http://pandas.pydata.org/pandas-docs/dev/advanced.html
If you want to do a panel I would recommend going with EdChums answer of creating a dict of dataframes with the key being the symbol, you can then use that dict to create a panel with pandas.Panel().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby multiple columns with python and streamlit - python

multiselect_var is a list while select_box_var is a single variable. Put it inside a list and add both lists together. Try this: test_g3 = df.groupby([select_box_var] + multiselect_var).size().reset_index(name='count')

From streamlit docs for multiselect here, the api returns a list always. And your selectbox returns a string as you have a list of strings as option. So your code can be modified to, df.groupby([select_box_var] + multiselect_var).size().reset_index(name='count')

Related

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

How to insert multiple values into specific treeview columns?

I have a list where I want each element of the list to be in as a single row

Creating a column variable taking the mean of a variable conditional on two other variables

Python pandas - trying to a dict of df's into a panel (or loop the df items into a panel)

Categories

Resources