Pandas - Iterate through lists / dictionaries for calculations - python

I am new to coding & I am looking for a pythonic way to implement the following code. Here is a sample dataframe with code:
np.random.seed(1111)
df2 = pd.DataFrame({
'Product':np.random.choice( ['Prod 1','Prod 2','Prod 3', 'Prod 4','Prod 5','Prod 6','Box 1','Box 2','Box 3'], 10000),
'Transaction_Type': np.random.choice(['Produced','Transferred','Scrapped','Sold'], 10000),
'Quantity':np.random.randint(1,100, size=(10000)),
'Date':np.random.choice( pd.date_range('1/1/2017','12/31/2018',
freq='D'), 10000)})
idx = pd.IndexSlice
In the data set, each 'Box' ('Box 1', 'Box 2', etc.) is a raw material that corresponds to multiple products. For example, 'Box 1' is used for 'Prod 1' & 'Prod 2', 'Box 2' is used for 'Prod 3' & 'Prod 4', & 'Box 3' is used for 'Prod 5' & 'Prod 6'.
The data set I'm working with is much larger, but I have these data sets stored as lists, for example I have 'Box 1' = ['Prod 1', 'Prod 2', 'Prod 3']. If need be, I could store as a dictionary with a tuple like Box1 = {'Box 1':('Prod 1','Prod 2') - whatever is best.
For each grouping, I'm looking to calculate the total number of boxes used which is the sum of 'Produced' + 'Scrapped' inventory. To get this value, I'm currently doing a manual filter on a groupby of each product & filtering manually. You can see I'm manually writing a list of the products as a the second assign statement.
For example, to calculate how much of 'Box 1' to relieve from inventory, each month, you would sum the values of 'Box 1' that was produced & scrapped. Then, you would calculate the values of 'Prod 1' through 'Prod 3' (since they use 'Box 1') that were produced & scrapped & add them all together to get a total 'Box 1' used & scrapped for each time frame. Here's an example of what I'm currently doing:
box1 = ['Box 1','Prod 1','Prod 2']
df2[df2['Transaction_Type'].isin(['Produced','Scrapped'])].groupby([pd.Grouper(key='Date',freq='A' ),'Product','Transaction_Type']).agg({'Quantity':'sum'})\
.unstack()\
.loc[idx[:,box1],idx[:]]\
.assign(Box_1 = lambda x: 'Box 1')\
.assign(List_of_Products = lambda x: 'Box 1, Prod 1, Prod 2')\
.reset_index()\
.set_index(['Box_1','List_of_Products','Date','Product'])\
.groupby(level=[0,1,2]).sum()\
I'd then have to do the same clunky manual same exercise for 'Box 2', etc.
Is there a more pythonic way? I would like to complete this analysis each month going forward. The actual data is much more complex with roughly 20 different 'Boxes' that have a varying number of products associated with each. I'm not sure if I should be looking to create a function or use a dictionary vs. lists, but would appreciate any help along the way. As a last request, I'd love to have the flexibility to write each of these 'Box_1' to a different excel worksheet.
Thanks in advance!

Not sure how you want the result at the end, but as each Prod uses only one Box, then you can replace the Prod by its Box and do the groupby like you do. Let's suppose you have a dictionary such as:
box_dict = {'Box 1': ('Prod 1', 'Prod 2'),
'Box 2': ('Prod 3', 'Prod 4'),
'Box 3': ('Prod 5', 'Prod 6')}
then you want to reverse it to get the prod as the key and the box as the value:
dict_prod = { prod:box for box, l_prod in box_dict.items() for prod in l_prod}
Now you can use replace:
print (df2[df2['Transaction_Type'].isin(['Produced','Scrapped'])]
.replace({'Product':dict_prod}) #here to change the prod to the box used
.groupby([pd.Grouper(key='Date',freq='A' ),'Product','Transaction_Type'])['Quantity']
.sum().unstack())
Quantity
Transaction_Type Produced Scrapped
Date Product
2017-12-31 Box 1 20450 19152
Box 2 20848 21145
Box 3 22475 21518
2018-12-31 Box 1 19404 16964
Box 2 21655 20753
Box 3 21343 21576

I think I would filter my source dataframe down to just want I need to query first off then do you grouping and aggregrations:
df2.query('Transaction_Type in ["Produced","Scrapped"] and Product in ["Box 1","Prod 1","Prod 2"]')\
.groupby([pd.Grouper(key='Date',freq='A'),'Product','Transaction_Type'])['Quantity'].sum()\
.unstack().reset_index(level=1).groupby(level=0).agg({'Product':lambda x: ', '.join(x),'Produced':'sum','Scrapped':'sum'})
Output:
Product Produced Scrapped
Date
2017-12-31 Box 1, Prod 1, Prod 2 20450 19152
2018-12-31 Box 1, Prod 1, Prod 2 19404 16964

I do not understand why such a long expression is needed. It seems you only care about the total number of rows satisfying the condition, if I am not totally wrong.
d = {'Box 1': ('Box 1', 'Prod 1', 'Prod 2')}
d_type = {'Box 1': ('Produced', 'Scrapped')}
selected = df2[df2['Product'].isin(d['Box 1']) & df2['Transaction_Type'].isin(d_type['Box 1'])]
print(len(selected))
For your excel exporting needs, something as below would work.
writer = pd.ExcelWriter("test.xlsx")
selected.to_excel(writer, 'Sheet1')
writer.save()

Related

How to calculate statistics from dataframe and insert them into new dataframe, matching by name?

I have a dataframe (df1) of every single NBA shot taken with columns of time, shot location, player, etc. (there are duplicates of player names because every shot taken is its own row) and I want to create a new dataframe with calculated figures from the original dataframe. The new dataframe will have one row per player with various statistics like "Total Shots" and "Make %".
I have created the new dataframe with just the names of players:
df_names = df1[['Player Name']].drop_duplicates()
And now I would like to know how to go through df1, count the shots taken per player, insert that into my new df_names as a new column.
Welcome to StackOverflow!
The feature of pandas that I'd recommend diving into is .groupby() (pandas documentation for .groupby()). Rather than doing these in two different operations, you can do them as one:
data = {'Player Name': ['Player 1', 'Player 1', 'Player 1', 'Player 2', 'Player 2', 'Player 2']
, 'Shot Time': ['1/1/2022 10:00:0000', '1/1/2022 10:01:0000', '1/1/2022 10:02:0000', '1/1/2022 10:03:0000', '1/1/2022 10:04:0000', '1/1/2022 10:05:0000']
, 'Shot Made': [True,True,False,False,True,False]}
df = pd.DataFrame(data)
df_agg = df.groupby(['Player Name'],as_index=False)['Shot Time'].count()
You can group by Player Name and do various aggregations over your columns, such as counts of shot times for number of shots (as above), mode of position, average time between shots, etc.
Eventually, you may need to join multiple of these dataframes together. If so, you would use the .join() or .merge() function. Here's the pandas documentation about various ways to smoosh data together: Pandas Join, Merge, Concatenate and Compare
You should accept K. Thorspear's solution, as I am just adding to it.
You can aggregate on a groupby of the player name. Count the 'Shot Time' column (will simply count how many times a player took a shot), then sum the 'Shot Made columns (since True = 1, and False = 0, the sum will give the total made).
Then just divide those out to get a shot percent:
import pandas as pd
data = {'Player Name': ['Player 1', 'Player 1', 'Player 1', 'Player 2', 'Player 2', 'Player 2']
, 'Shot Time': ['1/1/2022 10:00:0000', '1/1/2022 10:01:0000', '1/1/2022 10:02:0000', '1/1/2022 10:03:0000', '1/1/2022 10:04:0000', '1/1/2022 10:05:0000']
, 'Shot Made': [True,True,False,False,True,False]}
df = pd.DataFrame(data)
df = df.groupby(['Player Name']).agg({'Shot Time':'count', 'Shot Made':'sum'}).rename(columns={'Shot Time':'Shot Attempt'})
df['Shot %'] = df['Shot Made'] / df['Shot Attempt']
Output:
print(df)
Shot Attempt Shot Made Shot %
Player Name
Player 1 3 2 0.666667
Player 2 3 1 0.333333

Convert categorical rows to column header with string values (no aggregation)

I would like to transform my categorical rows to columns, without any aggregation. I have tried it with pivot but I get NANs.
This is the data frame:
data = pd.DataFrame({'Art':['blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red'],
'Description':['Some text 1', 'Some text 2', 'Some text 3', 'Some text 4', 'Some text 5', 'Some text 6', 'Some text 7', 'Some text 8']})
When I try to pivot:
data.pivot(columns='Art')
I get:
And I have solved the NAN problem like this:
data.pivot(columns='Art').apply(lambda x: pd.Series(x.dropna().values))
This is the desired outcome:
However, I would like to know if there is a smarter way to simply get my classes as column headers.
Thank you!
your solution already seems pretty clear and concise.. with one line you have the result:
data.pivot(columns='Art').apply(lambda x: pd.Series(x.dropna().values))

How to filter a pandas column by list of strings?

The standard code for filtering through pandas would be something like:
output = df['Column'].str.contains('string')
strings = ['string 1', 'string 2', 'string 3']
Instead of 'string' though, I want to filter such that it goes through a collection of strings in list, "strings". So I tried something such as
output = df['Column'].str.contains('*strings')
This is the closest solution I could find, but did not work
How to filter pandas DataFrame with a list of strings
Edit: I should note that I'm aware of the | or operator. However, I'm wondering how to tackle all cases in the instance list strings is changing and I'm looping through varying lists of changing lengths as the end goal.
You can create a regex string and search using this string.
Like this:
df['Column'].str.contains('|'.join(strings),regex=True)
you probably should look into using isin() function (pandas.Series.isin) .
check the code below:
df = pd.DataFrame({'Column':['string 1', 'string 1', 'string 2', 'string 2', 'string 3', 'string 4', 'string 5']})
strings = ['string 1', 'string 2', 'string 3']
output = df.Column.isin(strings)
df[output]
output:
Column
0 string 1
1 string 1
2 string 2
3 string 2
4 string 3

Access dynamically created data frames

Hello Python community,
I have a problem with my code creation.
I wrote a code that creates dynamically dataframes in a for loop. The problem is that I don't know how to access to them.
Here is a part of code
list = ['Group 1', 'Group 2', 'Group 3']
for i in list:
exec('df{} = pd.DataFrame()'.format(i))
for i in list:
print(df+i)
The dataframes are created but i can not access them.
Could someone help me please?
Thank you in advance
I'm not sure exactly how your data is stored/accessed but you could create a dictionary to pair your list items with each dataframe as follows:
list_ = ['Group 1', 'Group 2', 'Group 3']
dataframe_dict = {}
for i in list_:
data = np.random.rand(3,3) #create data for dataframe here
dataframe_dict[i] = pd.DataFrame(data, columns=["your_column_one", "two","etc"])
Can then retrieve each dataframe by calling its associated group name as the key of the dictionary as follows:
for key in dataframe_dict.keys():
print(key)
print(dataframe_dict[key])

Web2Py - using starred expression for rendering HTML table

This question is an extension of: Web2Py - rendering AJAX response as HTML table
Basically, I come up with a dynamic list of response rows that I need to display on the UI as an HTML table.
Essentially the code looks like this,
response_results = []
row_one = ['1', 'Col 11', 'Col 12', 'Col 13']
response_results.append(row_one)
row_two = ['2', 'Col 21', 'Col 22', 'Col 23']
response_results.append(row_two)
html = DIV(TABLE(THEAD(TR(TH('Row #'), TH('Col 1'), TH('Col 2'), TH('Col 3')),
_id=0), TR([*response for response in response_results]),
_id='records_table', _class='table table-bordered'),
_class='table-responsive')
return html
When I use this kind of code: TR([request.vars[input] for input in inputs]) or TR(*the_list), it works fine.
However, I have come up with a need to use a hybrid of these two i.e. TR([*response for response in response_results]). But, it fails giving an error message:
"Python version 2.7 does not support this syntax. Starred expressions are not allowed as assignment targets in Python 2."
When I run this code instead i.e. without a '*': TR([response for response in response_results]) it runs fine but puts all the columns of my row together in the first column of the generated HTML table, leaving all other columns blank.
Can someone kindly help me resolve this issue and guide on how can I achieve the required result of displaying each column of the rows at their proper spots in the generated HTML table?
You need to generate a TR for each item in response_results, which means you need a list of TR elements with which you can then use Python argument expansion (i.e., the * syntax) to treat each TR as a positional argument to TABLE.
html = DIV(TABLE(THEAD(TR(TH('Row #'), TH('Col 1'), TH('Col 2'), TH('Col 3')), _id=0),
*[TR(response) for response in response_results],
_id='records_table', _class='table table-bordered'),
_class='table-responsive')
Note, because each response is itself a list, you could also use argument expansion within the TR:
*[TR(*response) for response in response_results]
But that is not necessary, as TR optionally takes a list, converting each item in the list into a table cell.
Another option is to make response_results a list of TR elements, starting with the THEAD element, and then just pass that list to TABLE:
response_results = [THEAD(TR(TH('Row #'), TH('Col 1'), TH('Col 2'), TH('Col 3')), _id=0)]
row_one = ['1', 'Col 11', 'Col 12', 'Col 13']
response_results.append(TR(row_one))
row_two = ['2', 'Col 21', 'Col 22', 'Col 23']
response_results.append(TR(row_two))
html = DIV(TABLE(response_results, _id='records_table', _class='table table-bordered'),
_class='table-responsive')
Again, you could do TABLE(*response_results, ...), but the * is not necessary, as TABLE can take a list of row elements.

Categories