I'm new to Python.
I am trying to add prefix (Serial number) to an element within a data frame using for loop, to do with data cleaning/preparation before analysis.
The code is
a=pd.read_excel('C:/Users/HP/Desktop/WFH/PowerBI/CMM data.xlsx','CMM_unclean')
a['Serial Number'] = a['Serial Number'].apply(str)
print(a.iloc[72,1])
for index,row in a.iterrows():
if len(row['Serial Number']) == 6:
row['Serial Number'] = 'SR0' + row['Serial Number']
print(row['Serial Number'])
print(a.iloc[72,1])
The output is
C:\Users\HP\anaconda3\envs\test\python.exe C:/Users/HP/PycharmProjects/test/first.py
101306
SR0101306
101306
I don't understand why this is happening inside the for loop, value is changing, however outside it is the same.
This will never change the actual dataframe named a.
TL;DR: The rows you get back from iterrows are copies that are no longer connected to the original data frame, so edits don't change your dataframe. However, you can use the index to access and edit the relevant row of the dataframe.
EXPLANATION
Why?
The rows you get back from iterrows are copies that are no longer connected to the original data frame, so edits don't change your dataframe. However, you can use the index to access and edit the relevant row of the dataframe.
The solution is this:
import pandas as pd
a = pd.read_excel("Book1.xlsx")
a['Serial Number'] = a['Serial Number'].apply(str)
a.head()
# ID Serial Number
# 0 1 SR0101306
# 1 2 1101306
print(a.iloc[0,1])
#101306
for index,row in a.iterrows():
row = row.copy()
if len(row['Serial Number']) == 6:
# use the index and .loc method to alter the dataframe
a.loc[index, 'Serial Number'] = 'SR0' + row['Serial Number']
print(a.iloc[0,1])
#SR0101306
In the documentation, I read (emphasis from there)
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Maybe this means in your case that a copy is made and no reference used. So the change applies temporarily to the copy but not to the data in the data frame.
Since you're already using apply, you could do this straight inside the function you call apply with:
def fix_serial(n):
n_s = str(n)
if len(n_s) == 6:
n_s = 'SR' + n_s
return n_s
a['Serial Number'] = a['Serial Number'].apply(fix_serial)
Related
I have two CSV files which I’m using in a loop. In one of the files there is a column called "Availability Score"; Is there a way that I can make the loop iterate though the records in descending order of this column? I thought I could use Ob.sort_values(by=['AvailabilityScore'],ascending=False) to change the order of the dataframe first, so that when the loop starts in will already be in the right order. I've tried this out and it doesn’t seem to make a difference.
# import the data
CF = pd.read_csv (r'CustomerFloat.csv')
Ob = pd.read_csv (r'Orderbook.csv')
# Convert to dataframes
CF = pd.DataFrame(CF)
Ob = pd.DataFrame(Ob)
#Remove SubAssemblies
Ob.drop(Ob[Ob['SubAssembly'] != 0].index, inplace = True)
#Sort the data by thier IDs
Ob.sort_values(by=['CustomerFloatID'])
CF.sort_values(by=['FloatID'])
#Sort the orderbook by its avalibility score
Ob.sort_values(by=['AvailabilityScore'],ascending=False)
# Loop for Urgent Values
for i, rowi in CF.iterrows():
count = 0
urgent_value = 1
for j, rowj in Ob.iterrows():
if(rowi['FloatID']==rowj['CustomerFloatID'] and count < rowi['Urgent Deficit']):
Ob.at[j,'CustomerFloatPriority'] = urgent_value
count+= rowj['Qty']
You need to add inplace=True, like this:
Ob.sort_values(by=['AvailabilityScore'],ascending=False, inplace=True)
sort_values() (like most Pandas functions nowadays) are not in-place by default. You should assign the result back to the variable that holds the DataFrame:
Ob = Ob.sort_values(by=['CustomerFloatID'], ascending=False)
# ...
BTW, while you can pass inplace=True as argument to sort_values(), I do not recommend it. Generally speaking, inplace=True is often considered bad practice.
I have created a new dataframe:
import pandas as pd
###creating a dataframe:
bb=pd.DataFrame(columns = ['INDate', 'INCOME', 'EXDate','EXPENSE'])
bb.to_excel('/py/deleteafter/bb_black_book.xlsx')
bb.head()
I can see new dataframe without rows:
Then I need to add a new value to the one of columns by cycle.
income_value=message.text ###It is depend from the user input
for i in range(len(bb)):
print(bb['INCOME'][i])
if bb['INCOME'][i] != 'NaN':
i += 1
#print('NOT_EMPTY_CELL')
else:
#print('ive found an empty cell=)')
bb['INCOME'][i]=income_value
break
And here I met an errors, cause my df have a 0 length:
print(range(len(bb)))
range(0, 0)
I don't sure that my solution is right, and I'm sure there is more simply solution might be. In overall, my main idea is:
How I can check a next empty cell in certain column (in my case column 'INCOME') to add the value to this FREE cell?
Or more simply - I need to add a value to the next not filled cell=)
Will be glad for your replies.
To find the last valid value you can use last_valid_index(). It outputs a nan when the dataframe is empty, so you could do:
idx = bb["INCOME"].last_valid_index())
import numpy as np
if np.isnan(idx) or idx is None:
bb.loc[0, "INCOME"] = income_value
else:
bb.loc[idx + 1, "INCOME"] = income_value
There is a much simpler way to append a row, e.g.:
row = {'INCOME': 20000, 'EXdate': '14/09/2020'}
df= df.append(pd.DataFrame(row))
If you want to add data in one column only where the index number doesn't exist, use loc
Change line
bb['INCOME'][i]=income_value
to
bb.loc[i,'INCOME']=income_value
and it should work fine.
I made a simple DataFrame named middle_dataframe in python which looks like this and only has one row of data:
display of the existing dataframe
And I want to append a new dataframe generated each time in a loop to this existing dataframe. This is my program:
k = 2
for k in range(2, 32021):
header = whole_seq_data[k]
if header.startswith('>'):
id_name = get_ucsc_ids(header)
(chromosome, start_p, end_p) = get_chr_coordinates_from_string(header)
if whole_seq_data[k + 1].startswith('[ATGC]'):
seq = whole_seq_data[k + 1]
df_temp = pd.DataFrame(
{
"ucsc_id":[id_name],
"chromosome":[chromosome],
"start_position":[start_p],
"end_position":[end_p],
"whole_sequence":[seq]
}
)
middle_dataframe.append(df_temp)
k = k + 2
My iterations in the for loop seems to be fine and I checked the variables that stored the correct value after using regular expression. But the middle_dataframe doesn`t have any changes. And I can not figure out why.
The DataFrame.append method returns the result of the append, rather than appending in-place (link to the official docs on append). The fix should be to replace this line:
middle_dataframe.append(df_temp)
with this:
middle_dataframe = middle_dataframe.append(df_temp)
Depending on how that works with your data, you might need also to pass in the parameter ignore_index=True.
The docs warn that appending one row at a time to a DataFrame can be more computationally intensive than building a python list and converting it into a DataFrame all at once. That's something to look into if your current approach ends up too slow for your purposes.
The piece of code returns 10, which is what I would expect
for i in range(5):
if i == 0:
output = i
else:
output += i
print(output)
Why does this code only return the dataframe created in the if section of the statement (i.e. when i ==0)?
for i in range(5):
if i == 0:
output = pd.DataFrame(np.random.randn(5, 2))
else:
output.append(pd.DataFrame(np.random.randn(5, 2))
print('final', output)
The above is the MVCE of an issue I am having with this below code:
More context if interested:
for index, row in per_dmd_df.iterrows():
if index == 0:
output = pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly'))
else:
output.append(pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly')))
print(output)
Where I have an input DataFrame with one row per product with balances, rates, etc. I want to the data in each DF row to call the dmd_flow function (returns a generator that when called within pd.Dataframe() returns a 12 month forward-looking balance forecast) to forecast changes in the balance of each product based on the parameters in the dmd_flow function. I would then add all of the changes to come up with the net changes in balance (done using group by on the date and summing balances).
Each call to this creates thew new DataFrame as I need:
pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly'))
but the append doesn't work to expande the output DataFrame.
Because, (unlike list.append) DataFrame.append is not an in-place operation. See the docs for more information. You're supposed to assign the result back:
df = df.append(...)
Although, in this case, I'd advice using something like apply if you are unable to vectorize your function:
df['balance'].apply(
dmd_flow, args=(dt.date(2018,1,31), 12, .05, 0, .03, 'monthly')
)
Which hides the loop, so you don't need to worry about the index. Make sure your function is written in such a way so as to support scalar arguments.
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)