How to iterate through cells in excel from python - python

I want to write the average between two columns(Max and Min) into another column(Mean) for each row.
Specifically, as it iterates through rows, determine the mean from first 2 cells and write this into the cell of the 3rd row.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
sheet._cell_overwrite_ok = True
df = pd.read_excel('tempMean.xlsx', sheet_name='tempMeanSheet')
listMax = df['Max']
listMin = df['Min']
listMean = df['Mean']
for df, row in df.iterrows():
print('Max', row.Max, 'Min', row.Min, 'Mean', row.Mean)
Current Results:
Max 29.7 Min 20.5 Mean nan
Max 29.2 Min 20.2 Mean nan
Max 29.1 Min 21.2 Mean nan
Results I want:
Max 29.7 Min 20.5 Mean 24.95
Max 29.2 Min 20.2 Mean 24.7
Max 29.1 Min 21.2 Mean 25.15
I have been able to iterate through rows as seen in code.
However, I am not sure how to apply the equation to find mean for each of these rows.
Consequently, the row for mean has no data.
Let me know if anything doesnt make sense

Try this:
df = pd.read_excel('tempMean.xlsx', sheet_name='tempMeanSheet')
mean = [(row["Min"] + row["Max"]) / 2 for index, row in df.iterrows()]
df = df.assign(Mean=mean)

Consider calculating column beforehand, add dummy columns for your Min, Max, Mean labels and output with to_string, avoiding any loops:
# VECTORIZED CALCULATION OF MEAN
df['Mean'] = (df['Max'] + df['Min']) / 2
# ADD LABEL COLUMNS AND RE-ORDER COLUMNS
df = (df.assign(MaxLabel='Max', MinLabel='Min', MeanLabel='Mean')
.reindex(['MaxLabel', 'Max', 'MinLabel', 'Min', 'MeanLabel', 'Mean'], axis='columns')
)
# OUTPUT TO SCREEN IN ONE CALL
print(df.to_string())

Related

Issue in executing a specific type of nested 'for' loop on columns of a panda dataframe

I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
​
combos = list(combinations(df.columns, 2))
​
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
​
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0

Concatenate arrays into a single table using pandas

I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function

Python/Pandas For Loop Time Series

I am working with panel time-series data and am struggling with creating a fast for loop, to sum up, the past 50 numbers at the current i. The data is like 600k rows, and it starts to churn around 30k. Is there a way to use pandas or Numpy to do the same at a fraction of the time?
The change column is of type float, with 4 decimals.
Index Change
0 0.0410
1 0.0000
2 0.1201
... ...
74327 0.0000
74328 0.0231
74329 0.0109
74330 0.0462
SEQ_LEN = 50
for i in range(SEQ_LEN, len(df)):
df.at[i, 'Change_Sum'] = sum(df['Change'][i-SEQ_LEN:i])
Any help would be highly appreciated! Thank you!
I tried this with 600k rows and the average time was
20.9 ms ± 1.35 ms
This will return a series with the rolling sum for the last 50 Change in the df:
df['Change'].rolling(50).sum()
you can add it to a new column like so:
df['change50'] = df['Change'].rolling(50).sum()
Disclaimer: This solution cannot compete with .rolling(). Plus, if a .groupby() case, just do a df.groupby("group")["Change"].rolling(50).sum() and then reset index. Therefore please accept the other answer.
Explicit for loop can be avoided by translating your recursive partial sum into the difference of cumulative sum (cumsum). The formula:
Sum[x-50:x] = Sum[:x] - Sum[:x-50] = Cumsum[x] - Cumsum[x-50]
Code
For showcase purpose, I have shorten len(df["Change"]) to 10 and SEQ_LEN to 5. A million records completed almost immediately in this way.
import pandas as pd
import numpy as np
# data
SEQ_LEN = 5
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"Change": np.random.normal(0, 1, 10) # a million rows
}
)
# step 1. Do cumsum
df["Change_Cumsum"] = df["Change"].cumsum()
# Step 2. calculate diff of cumsum: Sum[x-50:x] = Sum[:x] - Sum[:x-50]
df["Change_Sum"] = np.nan # or zero as you wish
df.loc[SEQ_LEN:, "Change_Sum"] = df["Change_Cumsum"].values[SEQ_LEN:] - df["Change_Cumsum"].values[:(-SEQ_LEN)]
# add idx=SEQ_LEN-1
df.at[SEQ_LEN-1, "Change_Sum"] = df.at[SEQ_LEN-1, "Change_Cumsum"]
Output
df
Out[30]:
Change Change_Cumsum Change_Sum
0 -1.133838 -1.133838 NaN
1 0.384319 -0.749519 NaN
2 1.496554 0.747035 NaN
3 -0.355382 0.391652 NaN
4 -0.787534 -0.395881 -0.395881
5 -0.459439 -0.855320 0.278518
6 -0.059169 -0.914489 -0.164970
7 -0.354174 -1.268662 -2.015697
8 -0.735523 -2.004185 -2.395838
9 -1.183940 -3.188125 -2.792244

A pandas-y way to do simple calculations on rows selected from DataFrame

Suppose I have the following data:
import pandas as pd
boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'],
'Price': [10,15,5,5,10,15,15,5]
}
df = pd.DataFrame(boxes, columns= ['Color','Shape','Price'])
How do I find the average price of every color (ignoring shape) without for-loops? Or the difference between the maximum and minimum price of every color?
In short, I want the following outcome:
Mean Range
Green 10.00 10
Blue 7.50 5
Red 11.67 10
This example has only three colors, but if we had 1000 colors, is the method still the same/the most efficient one?
You can use the following :
df = df.groupby('Color').agg([np.mean, np.ptp])
df.columns = ['Mean', 'Range']
And you will get the expected result.
Pandas groupby can use multiple aggregation functions. The easiest way to proceed is by using dataframes' native functions such as .mean() or .max(). One can also use .agg()and pass an array of functions to apply such as numpy functions, or even lambda function.
Groupby on the color column, get the aggregates, and for the range, get the difference between the max and min
result = (df.groupby("Color")
.agg(["mean","max","min"])
.droplevel(0,axis=1)
#access the max column with brackets
#rather than dot access
#as it is a built-in function
.assign(Range= lambda x: x['max'] - x['min'],
mean = lambda x: x['mean'].round(2)
)
.iloc[:,[0,-1]]
)
result
mean Range
Color
Blue 7.50 5
Green 10.00 10
Red 11.67 10
g = df.groupby('Color')['Price']
df = pd.concat([g.mean(), g.max() - g.min()], axis=1)
df.columns = ['Mean', 'Range']
print(df)
Prints:
Mean Range
Color
Blue 7.500000 5
Green 10.000000 10
Red 11.666667 10

During Transpose of Spark dataframe. Column names are not getting converted to Row headers

I have Dataframe, name 'tbl' as,
summary col1 col2 col3 col200
count 20000 20000 20000 20000
mean 3.02 789.83 8379.02 20.03
std dev 1.02 2.03 0.8 0.56
I did transpose using below code,
header = [i[0] for i in tbl.select("summary").rdd.map(tuple).collect()]
tt = tbl.select([c for c in tbl.columns if c not in ["summary"]])
rtt = tt.rdd.map(tuple)
rtt1 = rtt.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)])
rtt2 = rtt1.map(lambda(i,j,e):(j,(i,e))).groupByKey().sortByKey()
rtt3 = rtt2.map(lambda (i,x):sorted(list(x), cmp=lambda(i1,e1),(i2,e2) : cmp(i1,i2)))
rtt4 = rtt3.map(lambda x: map(lambda (i,y):y, x))
Question :
On transpose I am able to generate columns such as,
count Mean Std dev
20000 3.02 1.02
20000 789.83 2.03
But this transformation is missing column header names to identify the transpose is for which variable. I have Dataframe of '3 X 42000' dimension, and all columns are unique, looking ways to identify how I can add column header as Row header on transpose.
How about using Pandas:
df = sc.parallelize([(-1.0, 2.0, -3.0), (4.4, 5.1, -6.4)]).toDF()
pdf = df.describe().toPandas()
pdf.T[1:].rename(columns=pdf.T.iloc[0])
count mean stddev min max
_1 2 1.7000000000000002 3.818376618407357 -1.0 4.4
_2 2 3.55 2.192031021678297 2.0 5.1
_3 2 -4.7 2.4041630560342617 -6.4 -3.0
It is not like you need Spark to handle 120,000 values...

Categories