Access dynamically created data frames - python

Hello Python community,
I have a problem with my code creation.
I wrote a code that creates dynamically dataframes in a for loop. The problem is that I don't know how to access to them.
Here is a part of code
list = ['Group 1', 'Group 2', 'Group 3']
for i in list:
exec('df{} = pd.DataFrame()'.format(i))
for i in list:
print(df+i)
The dataframes are created but i can not access them.
Could someone help me please?
Thank you in advance

I'm not sure exactly how your data is stored/accessed but you could create a dictionary to pair your list items with each dataframe as follows:
list_ = ['Group 1', 'Group 2', 'Group 3']
dataframe_dict = {}
for i in list_:
data = np.random.rand(3,3) #create data for dataframe here
dataframe_dict[i] = pd.DataFrame(data, columns=["your_column_one", "two","etc"])
Can then retrieve each dataframe by calling its associated group name as the key of the dictionary as follows:
for key in dataframe_dict.keys():
print(key)
print(dataframe_dict[key])

Related

for loop dataframe last row of a group

I'm struggling with a for loop for a dataframe.
I want a function where I loop through a dataframe with object names and their properties.
Suggest the dataframe looks like this:
data = [['object 1', 'property 1'], ['object 1','property 11'], ['object 2', 'property 2'],['object 2','property 22'], ['object 3', 'property 3'],['object 3','property 33']]
I want to generate a string where the last row of each object doesn't contain a comma and all other rows don't.
def addProperties(objects):
obj = objects
for index, row in obj.iterrows():
if row['label'] =! #last element
string = row['label'] + row['attribuutLabel'] + ','
else:
string = row['label'] + row['attribuutLabel']
return string
Output should be something like this:
string = 'object 1 property 1, property 11, property 111 object 2 property 2, property 22 property 3, property 33'
I'm quite new to python so don't know what the best way is to achieve this.
Can someone help out?

Pandas iterrows not working within function

I'm creating a small scale scheduling script, and having some issues with iterrows. These are very small df's so time is minimal (6 rows and maybe 7/8 columns), although i'm guessing these loops are not the most efficient - I am pretty new to this!
Here's what I have already
data = {'Staff 1': ['9-5', '9-5', '9-5', '9-5', '9-5'],
'Staff 2': ['9-5', '9-5', '9-5', '9-5', '9-5'],
'Staff 3': [ '9-5', '9-5', '9-5', '9-5', '9-5']}
dataframe_1 = pd.DataFrame.from_dict(data, orient='index',
columns=['9/2/19', '9/3/19', '9/4/19', '9/5/19', '9/6/19'])
data2 = {'Name': ['Staff 1','Staff 2','Staff 3'], 'Site': ['2','2','2'], 'OT':['yes','yes','no'],
'Days off':['','9/4/19','9/4/19'], '':['','','9/5/19']}
dataframe_2 = pd.DataFrame.from_dict(data2)
def annual_leave(staff, df):
df = df.reset_index(drop=True)
for index, row in df.iterrows():
days_off = []
if df.loc[index,'Name'] == '{}'.format(staff):
for cell in row:
days_off.append(cell)
del days_off[0:3]
else:
pass
return days_off
for index, row in dataframe_1.iterrows():
print(annual_leave(index, dataframe_2))
I added in a few 'print(index)' in places to see if I could work out where it was going wrong.
I found that the bottom iterrows loop is running through each row. However, the itterrows loop in the function is only looking at the first row and I don't understand why.
I am trying to go through each staff name (the index) in dataframe_1, and checking that staff name against a column name in dataframe_2. I then want to get rid of the first 3 columns of that particular row in dataframe_2 (hence the list and del days_off[0:3]).
However in this example it is running the bottom for iterrows loop (outside of the function) for 'Staff 1', 'Staff 2', and 'Staff 3'. But the itterrows loop inside the function is only checking against the 'Staff 1' name.
This means it only works for 'Staff 1', but when the function is called for 'Staff 2', it is only checking for 'Staff 2' in the first row of dataframe_2 - and not finding it because its in the second row.
Does this make any sense?
Any help is greatly appreciated.
Have you tried calling the function ? Executing the above code which you have given will just create a function and you need to call it by passing proper arguments to see the output. I couldn't find any other mistakes in the code. Please correct if I am wrong.
It is because your return is inside the iterrows for loop. return needs to be unindented in order to be outside the for loop.

How to filter a pandas column by list of strings?

The standard code for filtering through pandas would be something like:
output = df['Column'].str.contains('string')
strings = ['string 1', 'string 2', 'string 3']
Instead of 'string' though, I want to filter such that it goes through a collection of strings in list, "strings". So I tried something such as
output = df['Column'].str.contains('*strings')
This is the closest solution I could find, but did not work
How to filter pandas DataFrame with a list of strings
Edit: I should note that I'm aware of the | or operator. However, I'm wondering how to tackle all cases in the instance list strings is changing and I'm looping through varying lists of changing lengths as the end goal.
You can create a regex string and search using this string.
Like this:
df['Column'].str.contains('|'.join(strings),regex=True)
you probably should look into using isin() function (pandas.Series.isin) .
check the code below:
df = pd.DataFrame({'Column':['string 1', 'string 1', 'string 2', 'string 2', 'string 3', 'string 4', 'string 5']})
strings = ['string 1', 'string 2', 'string 3']
output = df.Column.isin(strings)
df[output]
output:
Column
0 string 1
1 string 1
2 string 2
3 string 2
4 string 3

Pandas - Iterate through lists / dictionaries for calculations

I am new to coding & I am looking for a pythonic way to implement the following code. Here is a sample dataframe with code:
np.random.seed(1111)
df2 = pd.DataFrame({
'Product':np.random.choice( ['Prod 1','Prod 2','Prod 3', 'Prod 4','Prod 5','Prod 6','Box 1','Box 2','Box 3'], 10000),
'Transaction_Type': np.random.choice(['Produced','Transferred','Scrapped','Sold'], 10000),
'Quantity':np.random.randint(1,100, size=(10000)),
'Date':np.random.choice( pd.date_range('1/1/2017','12/31/2018',
freq='D'), 10000)})
idx = pd.IndexSlice
In the data set, each 'Box' ('Box 1', 'Box 2', etc.) is a raw material that corresponds to multiple products. For example, 'Box 1' is used for 'Prod 1' & 'Prod 2', 'Box 2' is used for 'Prod 3' & 'Prod 4', & 'Box 3' is used for 'Prod 5' & 'Prod 6'.
The data set I'm working with is much larger, but I have these data sets stored as lists, for example I have 'Box 1' = ['Prod 1', 'Prod 2', 'Prod 3']. If need be, I could store as a dictionary with a tuple like Box1 = {'Box 1':('Prod 1','Prod 2') - whatever is best.
For each grouping, I'm looking to calculate the total number of boxes used which is the sum of 'Produced' + 'Scrapped' inventory. To get this value, I'm currently doing a manual filter on a groupby of each product & filtering manually. You can see I'm manually writing a list of the products as a the second assign statement.
For example, to calculate how much of 'Box 1' to relieve from inventory, each month, you would sum the values of 'Box 1' that was produced & scrapped. Then, you would calculate the values of 'Prod 1' through 'Prod 3' (since they use 'Box 1') that were produced & scrapped & add them all together to get a total 'Box 1' used & scrapped for each time frame. Here's an example of what I'm currently doing:
box1 = ['Box 1','Prod 1','Prod 2']
df2[df2['Transaction_Type'].isin(['Produced','Scrapped'])].groupby([pd.Grouper(key='Date',freq='A' ),'Product','Transaction_Type']).agg({'Quantity':'sum'})\
.unstack()\
.loc[idx[:,box1],idx[:]]\
.assign(Box_1 = lambda x: 'Box 1')\
.assign(List_of_Products = lambda x: 'Box 1, Prod 1, Prod 2')\
.reset_index()\
.set_index(['Box_1','List_of_Products','Date','Product'])\
.groupby(level=[0,1,2]).sum()\
I'd then have to do the same clunky manual same exercise for 'Box 2', etc.
Is there a more pythonic way? I would like to complete this analysis each month going forward. The actual data is much more complex with roughly 20 different 'Boxes' that have a varying number of products associated with each. I'm not sure if I should be looking to create a function or use a dictionary vs. lists, but would appreciate any help along the way. As a last request, I'd love to have the flexibility to write each of these 'Box_1' to a different excel worksheet.
Thanks in advance!
Not sure how you want the result at the end, but as each Prod uses only one Box, then you can replace the Prod by its Box and do the groupby like you do. Let's suppose you have a dictionary such as:
box_dict = {'Box 1': ('Prod 1', 'Prod 2'),
'Box 2': ('Prod 3', 'Prod 4'),
'Box 3': ('Prod 5', 'Prod 6')}
then you want to reverse it to get the prod as the key and the box as the value:
dict_prod = { prod:box for box, l_prod in box_dict.items() for prod in l_prod}
Now you can use replace:
print (df2[df2['Transaction_Type'].isin(['Produced','Scrapped'])]
.replace({'Product':dict_prod}) #here to change the prod to the box used
.groupby([pd.Grouper(key='Date',freq='A' ),'Product','Transaction_Type'])['Quantity']
.sum().unstack())
Quantity
Transaction_Type Produced Scrapped
Date Product
2017-12-31 Box 1 20450 19152
Box 2 20848 21145
Box 3 22475 21518
2018-12-31 Box 1 19404 16964
Box 2 21655 20753
Box 3 21343 21576
I think I would filter my source dataframe down to just want I need to query first off then do you grouping and aggregrations:
df2.query('Transaction_Type in ["Produced","Scrapped"] and Product in ["Box 1","Prod 1","Prod 2"]')\
.groupby([pd.Grouper(key='Date',freq='A'),'Product','Transaction_Type'])['Quantity'].sum()\
.unstack().reset_index(level=1).groupby(level=0).agg({'Product':lambda x: ', '.join(x),'Produced':'sum','Scrapped':'sum'})
Output:
Product Produced Scrapped
Date
2017-12-31 Box 1, Prod 1, Prod 2 20450 19152
2018-12-31 Box 1, Prod 1, Prod 2 19404 16964
I do not understand why such a long expression is needed. It seems you only care about the total number of rows satisfying the condition, if I am not totally wrong.
d = {'Box 1': ('Box 1', 'Prod 1', 'Prod 2')}
d_type = {'Box 1': ('Produced', 'Scrapped')}
selected = df2[df2['Product'].isin(d['Box 1']) & df2['Transaction_Type'].isin(d_type['Box 1'])]
print(len(selected))
For your excel exporting needs, something as below would work.
writer = pd.ExcelWriter("test.xlsx")
selected.to_excel(writer, 'Sheet1')
writer.save()

Convert Pandas dataframe to list of list with index, data, and columns

I have a Pandas dataframe that I'd like to convert to a list of list where each sublist is a row in the dataframe. How would I also include the index values so that I can later output it to PDF table with ReportLab
import pandas as pd
df = pd.DataFrame(index=['Index 1', 'Index 2'],
data=[[1,2],[3,4]],
columns=['Column 1', 'Column 2'])
list = [df.columns[:,].values.astype(str).tolist()] + df.values.tolist()
print list
output:
[['Column 1', 'Column 2'], [1L, 2L], [3L, 4L]]
desired output:
[['Column 1', 'Column 2'], ['Index 1', 1L, 2L], ['Index 2', 3L, 4L]]
In [29]:
[df.columns.tolist()] + df.reset_index().values.tolist()
Out[29]:
[['Column 1', 'Column 2'], ['Index 1', 1L, 2L], ['Index 2', 3L, 4L]]
Possibly also include the index name (df.index.name) so that all rows have the same number of columns.
[[df.index.name] + df.columns.tolist()] + df.reset_index().values.tolist()
add a list comprehension in this line:
list = [df.columns[:,].values.astype(str).tolist()] + [[index] + vals for index,value in zip(df.index.tolist(),df.values.tolist())]
also, because you have your columns in your first item as item[colindex] = column at colindex, I would maybe change: ['Index 1',x1,y1] to [x1,y1,'Index 1']? I dont know if the position of Index item matters but this seems to make more sense so that the columns line up? Although I dont know how you are using your data so maybe not :)
EDITTED: df.index.tolist() is better than df.index.values.tolist() and i think it returns just items so you need to initialize [index] as a list instead of just index

Categories