Pandas not saving changes when iterating rows - python

let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row

Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']

Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.

Related

Apply if else condition in specific pandas column by location

I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)
Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967

Calculate the mean from excel sheet for specific rows

Hello guys! I am struggling to calculate the mean of certain rows from
an excel sheet using python. In particular, I would like to calculate the mean for every three rows starting from the first three and then moving to the next three and so on. My excel sheet contains 156 rows of data.
My data sheet looks like this:
And this is my code:
import numpy
import pandas as pd
df = pd.read_excel("My Excel.xlsx")
x = df.iloc[[0,1,2], [9,10,11]].mean()
print(x)
To sum up, I am trying to calculate the mean of Part 1 Measurements 1 (rows 1,2,3) and the mean of Part 2
Measurements 1 (rows 9,10,11) using one line of code, or some kind of index. I am expecting to receive two lists of numbers, one that stands for the mean of Part 1 Measurement 1 (rows 1,2,3) and the other for the mean of Part 2 Measurements 1 (rows 10,11,12). I am also familiar with the fact that python counts row number one as 0. The index should have a form of n+1.
Thank you in advance.
You could (e.g.) generate a list for each mean you want to calculate:
x1, x2 = list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())
Or you could also generate a list of lists:
x = [list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())]

How to optimally update cells based on previous cell value / How to elegantly spread values of cell to other cells?

I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T

Iterating on Pandas DataFrame to pass data into API

I am creating a script that reads a GoogleSheet, transforms the data and passes it into my ERP API to automate the creation of Purchase Orders.
I have got as far as outputting the data in a dataframe but I need help on how I can iterate through this and pass it in the correct format to the API.
DataFrame Example (dfRow):
productID vatrateID amount price
0 46771 2 1 1.25
1 46771 2 1 2.25
2 46771 2 2 5.00
Formatting of the API data:
vatrateID1=dfRow.vatrateID[0],
amount1=dfRow.amount[0],
price1=dfRow.price[0],
productID1=dfRow.productID[0],
vatrateID2=dfRow.vatrateID[1],
amount2=dfRow.amount[1],
price2=dfRow.price[1],
productID2=dfRow.productID[1],
vatrateID3=dfRow.vatrateID[2],
amount3=dfRow.amount[2],
price3=dfRow.price[2],
productID3=dfRow.productID[2],
I would like to create a function that would iterate thru the DataFrame and return the data in the correct format to pass to the API.
I'm new at Python and struggle most with iterating / loops so any help is much appreciated!
First, you can always loop over the rows of a dataframe using df.iterrows(). Each step through this iterator yields a tuple containing the row index and the row contents as a pandas Series object. So, for example, this would do the trick:
for ix, row in df.iterrows():
for column in row.index:
print(f"{column}{ix}={row[column]}")
You can also do it without resorting to loops. This is great if you need performance, but if performance isn't a concern then it is really just a matter of taste.
# first, "melt" the data, which puts all of the variables on their own row
x = df.reset_index().melt(id_vars='index')
# now join the columns together to produce the rows that we want
s = x['variable'] + x['index'].map(str) + '=' + x['value'].map(str)
print(s)
0 productID0=46771.0
1 productID1=46771.0
2 productID2=46771.0
3 vatrateID0=2.0
...
10 price1=2.25
11 price2=5.0

Calculating running total

I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?
#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.

Categories