Python, Pandas, Pyomo change shape according to index - python

I have written an optimization model and now I want to generate some output files (xlsx) for the different variables. I have put the whole data of the variables in one DataFrame with the following code:
block_vars = []
for var in model.component_data_objects(Var):
block_vars.append(var.parent_component())
block_vars = list(set(block_vars))
dc = {(str(bv).split('.')[0], str(bv).split('.')[-1], i): bv[i].value for bv in block_vars for i in getattr(bv, '_index')}
df = pd.DataFrame(list(dc.items()), columns=['tuple','value'])
df['variable_name'] = df['tuple'].str[-2]
df['variable_index'] = df['tuple'].str[-1]
df.drop('tuple', axis=1, inplace=True)
This works fine (even though it probably is not the smoothest way.
Now I am filtering the different variables with a block as follows:
writer = pd.ExcelWriter('UC.xlsx')
conditions = {'variable_name':'vCommit'}
df_uc = df.copy()
df_uc = df_uc[(df_uc[list(conditions)] == pd.Series(conditions)).all(axis=1)].drop('variable_name', 1)
df_uc.to_excel(writer, 'Tabelle1')
This works as well. Now comes the part I am struggeling with.
Those variables are indexed (with 2 or 3 indexes, depending on the variable), and I would like to be the output something like:
index1 index2 value
1 1 1
1 2 0
...
but those indexes are in a tuple in one row of the DataFrame and I am not sure how to access them and reshape the DataFrame correspondingly.
Does anybody know a way to do that? Thanks for your help!!!

I would expand out the index into multiple columns when first creating the DataFrame. You can try to look at the code here for inspiration: https://github.com/gseastream/pyomo/blob/fa9b8f20a0f9afafa7cbd4607baa8b4963a96f42/pyomo/repn/plugins/excel_writer.py
Grant was working on an interface to Excel, but development priorities shifted elsewhere.
Also, a quick note: you can use model.component_objects(Var) instead of what you have with list(set(block_vars)).

Related

Basic python defining function

I'm having difficulty applying my knowledge of defining functions with def to my own function.
I want to create a function where I can filter my data frame based on my 1. columns I'd like to drop + their axis 2. using .dropna
I've used it on one of my data frames like this :
total_adj_gross = ((gross.drop(columns = ['genre','rating', 'total_gross'], axis = 1)).dropna())
I've also used it on another data frame like this :
vill = (characters.drop(columns = ['hero','song'], axis = 1)).dropna(axis = 0)
Can I make a function using def so I can easily do this to any data frame?
if so would I go about it like this
def filtered_df(data, col_name, N=1):
frame = data.drop(columns = [col_name], axis = N)
frame.dropna(axis = N)
return frame
I can already feel like this above function would go wrong because what if I have different N's like in my vill object?
BTW I am a very new beginner as you can tell -- I haven't been exposed to any complex functions any help would be appreciated!
Update since I dont know how to make a code in comments:
Thank you all for your help in creating my function
but now how do I insert this in my code?
Do I have to make a script (.py) then call my function?
can I test in within my actual code?
right now if I just copy + paste any code in, and fill the column name I get an error saying the specific column code "is not found in the axis"
Based on what you want to achieve, you don't need to pass any axis parameter. Also, you want to pass a list of columns as a parameter to drop the different columns (axis=1 for drop() and axis=0 for dropna(), which is the default parameter value). And finally, dropna() is not in place by default. You have to store the returned value into a frame like you did the line above.
Your function should look like that:
def filtered_df(data, col_names):
frame = data.drop(columns = col_names, axis = 1)
result = frame.dropna()
return result
Overall, code looks good. I'd suggest 3 minor changes:-
Pass columns names as list. Do not convert them to list within the functions
Pass 2 variables for working with axis. From what i see in your eg, your axis values changes for drop and dropna. Not sure about your need for it. If you want 2 diff axis values for drop() and dropna() then please use 2 diff variables, probably like drop_axis and dropna_axis.
assigning modified frame / single line operation
So, code would look something like this:-
def filtered_df(data, col_name, drop_axis=1, dropna_axis=0):
frame = data.drop(columns = col_name, axis = drop_axis).dropna(axis = dropna_axis)
return frame
Your call to it can look like:
modified_df = filtered_df(data, ["x_col","y_col"], 0, 0)

how to append a dataframe to an existing dataframe inside a loop

I made a simple DataFrame named middle_dataframe in python which looks like this and only has one row of data:
display of the existing dataframe
And I want to append a new dataframe generated each time in a loop to this existing dataframe. This is my program:
k = 2
for k in range(2, 32021):
header = whole_seq_data[k]
if header.startswith('>'):
id_name = get_ucsc_ids(header)
(chromosome, start_p, end_p) = get_chr_coordinates_from_string(header)
if whole_seq_data[k + 1].startswith('[ATGC]'):
seq = whole_seq_data[k + 1]
df_temp = pd.DataFrame(
{
"ucsc_id":[id_name],
"chromosome":[chromosome],
"start_position":[start_p],
"end_position":[end_p],
"whole_sequence":[seq]
}
)
middle_dataframe.append(df_temp)
k = k + 2
My iterations in the for loop seems to be fine and I checked the variables that stored the correct value after using regular expression. But the middle_dataframe doesn`t have any changes. And I can not figure out why.
The DataFrame.append method returns the result of the append, rather than appending in-place (link to the official docs on append). The fix should be to replace this line:
middle_dataframe.append(df_temp)
with this:
middle_dataframe = middle_dataframe.append(df_temp)
Depending on how that works with your data, you might need also to pass in the parameter ignore_index=True.
The docs warn that appending one row at a time to a DataFrame can be more computationally intensive than building a python list and converting it into a DataFrame all at once. That's something to look into if your current approach ends up too slow for your purposes.

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Applying the output of a function to two columns using .apply

I'm working on a script that takes in an address and spits out two values: coordinates (as a list) and result (whether the geocoding was successful or not. This works fine, but since the data is returned as a list, I then have to assign new columns based on the indices of that list, which works but returns a warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy.
EDIT: Just to be clear, I think I understand from that page that I should be using .loc to access the nested values. My question is more along the lines of generating two columns directly from a function as opposed to this workaround of having to dig the information out later.
I'd like to know the correct way to approach problems like these, as I actually have this problem twice in this project.
The actual specifics of the problem aren't important, so here's a simple example of how I've been approaching it:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return coords, result
df['output'] = df['address'].apply(geo)
Since this then yields a nested list into my df column, I then extract that into new columns as such:
df['coordinates'] = None
df['gps_status'] = None
for index, row in df.iterrows():
df['coordinates'][index] = df['output'][index][0]
df['gps_status'][index] = df['output'][index][1]
And again, I get the warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Any advice on the correct way to do this would be appreciated.
Usually you want to avoid iterrows() since it is faster to operate on an entire column at once. You can assign the result from output directly to a new column.
import pandas as pd
def geo(x):
return x*2, x*3
df = pd.DataFrame({'address':[1,2,3]})
output = df['address'].apply(geo)
df['a'] = [x[0] for x in output]
df['b'] = [x[1] for x in output]
gives you
address a b
0 1 2 3
1 2 4 6
2 3 6 9
with no copy warning.
Your function should return a Series:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return pd.Series([coords, result], ['coordinates', 'gps_status'])
df['output'] = df['address'].apply(geo)
That said, this may be better written as a merge.

pandas SparseDataFrame insertion

i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!
Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).

Categories