Convert categorical rows to column header with string values (no aggregation) - python

I would like to transform my categorical rows to columns, without any aggregation. I have tried it with pivot but I get NANs.
This is the data frame:
data = pd.DataFrame({'Art':['blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red'],
'Description':['Some text 1', 'Some text 2', 'Some text 3', 'Some text 4', 'Some text 5', 'Some text 6', 'Some text 7', 'Some text 8']})
When I try to pivot:
data.pivot(columns='Art')
I get:
And I have solved the NAN problem like this:
data.pivot(columns='Art').apply(lambda x: pd.Series(x.dropna().values))
This is the desired outcome:
However, I would like to know if there is a smarter way to simply get my classes as column headers.
Thank you!

your solution already seems pretty clear and concise.. with one line you have the result:
data.pivot(columns='Art').apply(lambda x: pd.Series(x.dropna().values))

Related

How to extract data from list with RegExp OR in Python3?

I have the following list
['Amount', 'amount', 'Line Total', 'lineTotal', 'AMOUNT DUE']
and I'm trying to extract 'Amount', 'amount', 'Line Total', 'lineTotal' but not 'AMOUNT DUE' field from that list.
Here is how I do that:
>>> lst = ['Amount', 'amount', 'Line Total', 'lineTotal', 'AMOUNT DUE']
>>> import re
>>> r = re.compile("(line*|amount\b)", re.IGNORECASE)
I'm using regexp OR expression because in real use case I can have different values (e.g. in one of them can be only 'Amount' value and smth else, in the other one - 'Line Total' etc) and I want to extract exactly that one value depending on what I have.
So in example above I expect to get
['Amount', 'amount', 'Line Total', 'lineTotal']
But instead I'm getting
>>> list(filter(r.match, lst))
['Line Total', 'lineTotal']
Why is it happaning like so I how could I fix that? Thank you in advance!
Try (regex101):
import re
lst = ["Amount", "amount", "Line Total", "lineTotal", "AMOUNT DUE"]
r = re.compile(r"(?=line|amount$).*", flags=re.IGNORECASE)
lst = list(filter(r.match, lst))
print(lst)
Prints:
['Amount', 'amount', 'Line Total', 'lineTotal']

How to filter a pandas column by list of strings?

The standard code for filtering through pandas would be something like:
output = df['Column'].str.contains('string')
strings = ['string 1', 'string 2', 'string 3']
Instead of 'string' though, I want to filter such that it goes through a collection of strings in list, "strings". So I tried something such as
output = df['Column'].str.contains('*strings')
This is the closest solution I could find, but did not work
How to filter pandas DataFrame with a list of strings
Edit: I should note that I'm aware of the | or operator. However, I'm wondering how to tackle all cases in the instance list strings is changing and I'm looping through varying lists of changing lengths as the end goal.
You can create a regex string and search using this string.
Like this:
df['Column'].str.contains('|'.join(strings),regex=True)
you probably should look into using isin() function (pandas.Series.isin) .
check the code below:
df = pd.DataFrame({'Column':['string 1', 'string 1', 'string 2', 'string 2', 'string 3', 'string 4', 'string 5']})
strings = ['string 1', 'string 2', 'string 3']
output = df.Column.isin(strings)
df[output]
output:
Column
0 string 1
1 string 1
2 string 2
3 string 2
4 string 3

Access dynamically created data frames

Hello Python community,
I have a problem with my code creation.
I wrote a code that creates dynamically dataframes in a for loop. The problem is that I don't know how to access to them.
Here is a part of code
list = ['Group 1', 'Group 2', 'Group 3']
for i in list:
exec('df{} = pd.DataFrame()'.format(i))
for i in list:
print(df+i)
The dataframes are created but i can not access them.
Could someone help me please?
Thank you in advance
I'm not sure exactly how your data is stored/accessed but you could create a dictionary to pair your list items with each dataframe as follows:
list_ = ['Group 1', 'Group 2', 'Group 3']
dataframe_dict = {}
for i in list_:
data = np.random.rand(3,3) #create data for dataframe here
dataframe_dict[i] = pd.DataFrame(data, columns=["your_column_one", "two","etc"])
Can then retrieve each dataframe by calling its associated group name as the key of the dictionary as follows:
for key in dataframe_dict.keys():
print(key)
print(dataframe_dict[key])

Pandas - Iterate through lists / dictionaries for calculations

I am new to coding & I am looking for a pythonic way to implement the following code. Here is a sample dataframe with code:
np.random.seed(1111)
df2 = pd.DataFrame({
'Product':np.random.choice( ['Prod 1','Prod 2','Prod 3', 'Prod 4','Prod 5','Prod 6','Box 1','Box 2','Box 3'], 10000),
'Transaction_Type': np.random.choice(['Produced','Transferred','Scrapped','Sold'], 10000),
'Quantity':np.random.randint(1,100, size=(10000)),
'Date':np.random.choice( pd.date_range('1/1/2017','12/31/2018',
freq='D'), 10000)})
idx = pd.IndexSlice
In the data set, each 'Box' ('Box 1', 'Box 2', etc.) is a raw material that corresponds to multiple products. For example, 'Box 1' is used for 'Prod 1' & 'Prod 2', 'Box 2' is used for 'Prod 3' & 'Prod 4', & 'Box 3' is used for 'Prod 5' & 'Prod 6'.
The data set I'm working with is much larger, but I have these data sets stored as lists, for example I have 'Box 1' = ['Prod 1', 'Prod 2', 'Prod 3']. If need be, I could store as a dictionary with a tuple like Box1 = {'Box 1':('Prod 1','Prod 2') - whatever is best.
For each grouping, I'm looking to calculate the total number of boxes used which is the sum of 'Produced' + 'Scrapped' inventory. To get this value, I'm currently doing a manual filter on a groupby of each product & filtering manually. You can see I'm manually writing a list of the products as a the second assign statement.
For example, to calculate how much of 'Box 1' to relieve from inventory, each month, you would sum the values of 'Box 1' that was produced & scrapped. Then, you would calculate the values of 'Prod 1' through 'Prod 3' (since they use 'Box 1') that were produced & scrapped & add them all together to get a total 'Box 1' used & scrapped for each time frame. Here's an example of what I'm currently doing:
box1 = ['Box 1','Prod 1','Prod 2']
df2[df2['Transaction_Type'].isin(['Produced','Scrapped'])].groupby([pd.Grouper(key='Date',freq='A' ),'Product','Transaction_Type']).agg({'Quantity':'sum'})\
.unstack()\
.loc[idx[:,box1],idx[:]]\
.assign(Box_1 = lambda x: 'Box 1')\
.assign(List_of_Products = lambda x: 'Box 1, Prod 1, Prod 2')\
.reset_index()\
.set_index(['Box_1','List_of_Products','Date','Product'])\
.groupby(level=[0,1,2]).sum()\
I'd then have to do the same clunky manual same exercise for 'Box 2', etc.
Is there a more pythonic way? I would like to complete this analysis each month going forward. The actual data is much more complex with roughly 20 different 'Boxes' that have a varying number of products associated with each. I'm not sure if I should be looking to create a function or use a dictionary vs. lists, but would appreciate any help along the way. As a last request, I'd love to have the flexibility to write each of these 'Box_1' to a different excel worksheet.
Thanks in advance!
Not sure how you want the result at the end, but as each Prod uses only one Box, then you can replace the Prod by its Box and do the groupby like you do. Let's suppose you have a dictionary such as:
box_dict = {'Box 1': ('Prod 1', 'Prod 2'),
'Box 2': ('Prod 3', 'Prod 4'),
'Box 3': ('Prod 5', 'Prod 6')}
then you want to reverse it to get the prod as the key and the box as the value:
dict_prod = { prod:box for box, l_prod in box_dict.items() for prod in l_prod}
Now you can use replace:
print (df2[df2['Transaction_Type'].isin(['Produced','Scrapped'])]
.replace({'Product':dict_prod}) #here to change the prod to the box used
.groupby([pd.Grouper(key='Date',freq='A' ),'Product','Transaction_Type'])['Quantity']
.sum().unstack())
Quantity
Transaction_Type Produced Scrapped
Date Product
2017-12-31 Box 1 20450 19152
Box 2 20848 21145
Box 3 22475 21518
2018-12-31 Box 1 19404 16964
Box 2 21655 20753
Box 3 21343 21576
I think I would filter my source dataframe down to just want I need to query first off then do you grouping and aggregrations:
df2.query('Transaction_Type in ["Produced","Scrapped"] and Product in ["Box 1","Prod 1","Prod 2"]')\
.groupby([pd.Grouper(key='Date',freq='A'),'Product','Transaction_Type'])['Quantity'].sum()\
.unstack().reset_index(level=1).groupby(level=0).agg({'Product':lambda x: ', '.join(x),'Produced':'sum','Scrapped':'sum'})
Output:
Product Produced Scrapped
Date
2017-12-31 Box 1, Prod 1, Prod 2 20450 19152
2018-12-31 Box 1, Prod 1, Prod 2 19404 16964
I do not understand why such a long expression is needed. It seems you only care about the total number of rows satisfying the condition, if I am not totally wrong.
d = {'Box 1': ('Box 1', 'Prod 1', 'Prod 2')}
d_type = {'Box 1': ('Produced', 'Scrapped')}
selected = df2[df2['Product'].isin(d['Box 1']) & df2['Transaction_Type'].isin(d_type['Box 1'])]
print(len(selected))
For your excel exporting needs, something as below would work.
writer = pd.ExcelWriter("test.xlsx")
selected.to_excel(writer, 'Sheet1')
writer.save()

Convert Pandas dataframe to list of list with index, data, and columns

I have a Pandas dataframe that I'd like to convert to a list of list where each sublist is a row in the dataframe. How would I also include the index values so that I can later output it to PDF table with ReportLab
import pandas as pd
df = pd.DataFrame(index=['Index 1', 'Index 2'],
data=[[1,2],[3,4]],
columns=['Column 1', 'Column 2'])
list = [df.columns[:,].values.astype(str).tolist()] + df.values.tolist()
print list
output:
[['Column 1', 'Column 2'], [1L, 2L], [3L, 4L]]
desired output:
[['Column 1', 'Column 2'], ['Index 1', 1L, 2L], ['Index 2', 3L, 4L]]
In [29]:
[df.columns.tolist()] + df.reset_index().values.tolist()
Out[29]:
[['Column 1', 'Column 2'], ['Index 1', 1L, 2L], ['Index 2', 3L, 4L]]
Possibly also include the index name (df.index.name) so that all rows have the same number of columns.
[[df.index.name] + df.columns.tolist()] + df.reset_index().values.tolist()
add a list comprehension in this line:
list = [df.columns[:,].values.astype(str).tolist()] + [[index] + vals for index,value in zip(df.index.tolist(),df.values.tolist())]
also, because you have your columns in your first item as item[colindex] = column at colindex, I would maybe change: ['Index 1',x1,y1] to [x1,y1,'Index 1']? I dont know if the position of Index item matters but this seems to make more sense so that the columns line up? Although I dont know how you are using your data so maybe not :)
EDITTED: df.index.tolist() is better than df.index.values.tolist() and i think it returns just items so you need to initialize [index] as a list instead of just index

Categories