how to append a dataframe to an existing dataframe inside a loop - python

I made a simple DataFrame named middle_dataframe in python which looks like this and only has one row of data:
display of the existing dataframe
And I want to append a new dataframe generated each time in a loop to this existing dataframe. This is my program:
k = 2
for k in range(2, 32021):
header = whole_seq_data[k]
if header.startswith('>'):
id_name = get_ucsc_ids(header)
(chromosome, start_p, end_p) = get_chr_coordinates_from_string(header)
if whole_seq_data[k + 1].startswith('[ATGC]'):
seq = whole_seq_data[k + 1]
df_temp = pd.DataFrame(
{
"ucsc_id":[id_name],
"chromosome":[chromosome],
"start_position":[start_p],
"end_position":[end_p],
"whole_sequence":[seq]
}
)
middle_dataframe.append(df_temp)
k = k + 2
My iterations in the for loop seems to be fine and I checked the variables that stored the correct value after using regular expression. But the middle_dataframe doesn`t have any changes. And I can not figure out why.

The DataFrame.append method returns the result of the append, rather than appending in-place (link to the official docs on append). The fix should be to replace this line:
middle_dataframe.append(df_temp)
with this:
middle_dataframe = middle_dataframe.append(df_temp)
Depending on how that works with your data, you might need also to pass in the parameter ignore_index=True.
The docs warn that appending one row at a time to a DataFrame can be more computationally intensive than building a python list and converting it into a DataFrame all at once. That's something to look into if your current approach ends up too slow for your purposes.

Related

calling a dataframe column name in the parameters of a function

I have a dataframe with 8 columns that i would like to run below code (i tested it works on a single column) as a function to map/apply over all 8 columns.
click here for sample of dataframe
all_adj_noun = []
for i in range(len(bigram_df)):
if len([bigram_df['adj_noun'][i]]) >= 1:
for j in range(len(bigram_df['adj_noun'][i])):
all_adj_noun.append(bigram_df['adj_noun'][i][j])
However, when i tried to define function the code returns an empty list when it is not empty.
def combine_bigrams(df_name, col_name):
all_bigrams = []
for i in range(len(df_name)):
if len([df_name[col_name][i]]) >= 1:
for j in range(len(df_name[col_name][i])):
return all_bigrams.append(df_name[col_name][i][j])
I call the function by
combine_bigrams(bigram_df, 'adj_noun')
May I know is there anything that I may be doing wrong here?
The problem is that you are returning the result of .append, which is None
However, there is a better (and faster) way to do this. To return a list with all the values present in the columns, you can leverage Series.agg:
col_name = 'adj_noun'
all_bigrams = bigram_df[col_name].agg(sum)

How to separate tuple into independent pandas columns?

I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]
first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())

Create new data frame with the name from loop number

I tried to create the loop in python. The code can be seen below.
df=pd.DataFrame.copy(mef_list)
form=['','_M3','_M6','_M9','_M12','_LN','_C']
for i in range(0, len(form)):
df=pd.DataFrame.copy(mef_list)
df['Variable_new']=df['Variable']+str(form[i])
When I run the code, the result is only from the last loop, which is variable+'_C' I think it is because the data frame (df) is always replaced when the new loop start. In order to avoid the issue, I would think that if the data frame (df) could be renamed by plus the number of loop, the problem would be solved.
I used str function and hope to get df0, df1, ...,df6 but it doesn't work with the data frame name. Please suggest me how to change the name of data frame by add number of loop and also I still open for any alternative way.
Thanks!
This isn't a pythonic thing to do, have you thought about instead creating a list of dataframes?
df=pd.DataFrame.copy(mef_list)
form=['','_M3','_M6','_M9','_M12','_LN','_C']
list_of_df = list()
for i in range(0, len(form)):
df=pd.DataFrame.copy(mef_list)
df['Variable_new']=df['Variable']+str(form[i])
list_of_df.append(df)
Then you can access 'df0' as list_of_df[0]
You also don't need to iterate through a range, you can just loop through the form list itself:
form=['','_M3','_M6','_M9','_M12','_LN','_C']
list_of_df = list()
for i in form:
df=pd.DataFrame.copy(mef_list)
df['Variable_new']=df['Variable']+str(i) ## You can remove str() if everything in form is already a string
list_of_df.append(df)
mef_list = ["UR", "CPI", "CEI", "Farm", "PCI", "durable", "C_CVM"]
form = ['', '_M3', '_M6', '_M9', '_M12', '_LN', '_C']
Variable_new = []
foo = 0
for variable in form:
Variable_new.append(mef_list[foo]+variable)
foo += 1
print(Variable_new)

Python, Pandas, Pyomo change shape according to index

I have written an optimization model and now I want to generate some output files (xlsx) for the different variables. I have put the whole data of the variables in one DataFrame with the following code:
block_vars = []
for var in model.component_data_objects(Var):
block_vars.append(var.parent_component())
block_vars = list(set(block_vars))
dc = {(str(bv).split('.')[0], str(bv).split('.')[-1], i): bv[i].value for bv in block_vars for i in getattr(bv, '_index')}
df = pd.DataFrame(list(dc.items()), columns=['tuple','value'])
df['variable_name'] = df['tuple'].str[-2]
df['variable_index'] = df['tuple'].str[-1]
df.drop('tuple', axis=1, inplace=True)
This works fine (even though it probably is not the smoothest way.
Now I am filtering the different variables with a block as follows:
writer = pd.ExcelWriter('UC.xlsx')
conditions = {'variable_name':'vCommit'}
df_uc = df.copy()
df_uc = df_uc[(df_uc[list(conditions)] == pd.Series(conditions)).all(axis=1)].drop('variable_name', 1)
df_uc.to_excel(writer, 'Tabelle1')
This works as well. Now comes the part I am struggeling with.
Those variables are indexed (with 2 or 3 indexes, depending on the variable), and I would like to be the output something like:
index1 index2 value
1 1 1
1 2 0
...
but those indexes are in a tuple in one row of the DataFrame and I am not sure how to access them and reshape the DataFrame correspondingly.
Does anybody know a way to do that? Thanks for your help!!!
I would expand out the index into multiple columns when first creating the DataFrame. You can try to look at the code here for inspiration: https://github.com/gseastream/pyomo/blob/fa9b8f20a0f9afafa7cbd4607baa8b4963a96f42/pyomo/repn/plugins/excel_writer.py
Grant was working on an interface to Excel, but development priorities shifted elsewhere.
Also, a quick note: you can use model.component_objects(Var) instead of what you have with list(set(block_vars)).

Tricky str value replacement within PANDAS DataFrame

Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!

Categories