getting column name using iloc in dataframe - python

Is there a way to get the column name as a value using iloc or other functions?
i have a for loop here:
for i in range(0,18):
coef, pval = pearsonr(x.iloc[:,i],y)
print('pval of ',x.iloc[?,i], ' and allStar: ', pval)
where i want to print 'pval of column_name and allStar: pval'
is there a value I can replace ? with so that it fetches the column name for each of the columns? Or I have to use another function?

If x is Your dataframe try converting column name to column index:
col_idx = x.columns.get_loc('column_name')
Now this index can be passed to iloc method.

The short answer for your direct question is to use x.columns.
for i in range(0,18):
coef, pval = pearsonr(x.iloc[:,i],y)
print('pval of ',x.columns[i], ' and allStar: ', pval)
A cleaner approach would be to simply iterate over the columns:
for c in x.columns:
coef, pval = pearsonr(x[c], y)
print('pval of ',c, ' and allStar: ', pval)
Bonus notes (mainly to avoid the loop...):
To get the correlation coefficients (and not the pvalues just yet) of each column with y, you can simply use corrwith:
r = x.corrwith(pd.Series(y), axis=0)
To obtain the pvalues that correspond to those Pearson coefficients, you can simply calculate them directly, as follows:
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2) # n == len(y)
p = 2*dist.cdf(-abs(r)) # <= the pvalues!

Related

'Oversampling' cartesian data in a dataframe without for loop?

I have a 3D data in a pandas dataframe that I would like to 'oversample'/smooth by replacing the value at each x,y point with the average value of all the points that are within 5 units of that point. I can do it using a for loop like this (starting with a dataframe with three columns X,Y,Z):
import pandas as pd
Z_OS = []
X_OS = []
Y_OS = []
for inddex, row in df.iterrows():
Z_OS += [df[(df['X'] > row['X']-5) & (df['X']<row['X']+5) & (df['Y'] > row['Y']-5) & (df1['Y']<row['Y']+5)]['Z'].mean()]
X_OS += [row['X']]
Y_OS += [row['Y']]
dict = {
'X': X_OS,
'Y': Y_OS,
'Z': Z_OS
}
OSdf = pd.DataFrame.from_dict(dict)
but this method is very slow for large datasets and feels very 'unpythonic'. How could I do this without for loops? Is it possible via complex use of the groupby function?
xy = df[['x','y']]
df['smoothed z'] = df[['z']].apply(
lambda row: df['z'][(xy - xy.loc[row.name]).abs().lt(5).all(1)].mean(),
axis=1
)
Here I used df[['z']] to get a column 'z' as a data frame. We need an index of a row, i.e. row.name, when we apply a function to this column.
.abs().lt(5).all(1) read as absolut values which are all less then 5 along the row.
Update
The code below is actually the same but seems more consistent as it addresses directly the index:
df.index.to_series().apply(lambda i: df.loc[(xy - xy.loc[i]).abs().lt(5).all(1), 'z'].mean())
df['column_name'].rolling(rolling_window).mean()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

How to iterate each variable from a dataframe a reference to it dynamically?

It is possible to reference a dataframe column if you know its name (for example: dataframe1.time). I want to iterate each variable from a dataframe and call it like: 'for each x, dataframe1.x'
numeric_vars = ['CO_Mean','CO_Min','CO_Max','CO_Std','NO2_Mean', 'NO2_Min', 'NO2_Max','NO2_Std', 'O3_Mean','O3_Min'
,'O3_Max', 'O3_Std', 'PM2.5_Mean', 'PM2.5_Min', 'PM2.5_Max', 'PM2.5_Std', 'PM10_Mean',
'PM10_Min', 'PM10_Max', 'PM10_Std', 'SO2_Mean', 'SO2_Min', 'SO2_Max', 'SO2_Std']
for num_var in new_dataset_2[numeric_vars]:
mean, std = np.mean(new_dataset_2[var]), np.std(new_dataset_2[var])
cut_off = std * 3
lower, upper = mean - cut_off, mean + cut_off
#outliers = [new_dataset_2.index[new_dataset_2[var] == x].tolist() for x in new_dataset_2[var] if x < lower or x > upper]
new_dataset_2 = new_dataset_2[(new_dataset_2.num_var > lower) & (new_dataset_2.num_var < upper)]
#for outlier in outliers:
# new_dataset_2.drop(outlier)
The line new_dataset2.num_var does not work. Is there a way to iterate each column and reference it dynamically?
Short anwser:
You can replace new_dataset2.num_var by new_dataset2[num_var]
The notation with [] is equivalent but supports passing a variable.
Longer anwser:
When using pandas, the first reflex should be to try to avoid looping over columns (or rows, which is worse) and use pandas build-in function.
For your example, you can probably use the apply() function to apply the same function to all columns.Pandas apply() doc
Since you seem to want to apply it only to numeric columns, you can do a filter in the function you have in your apply with an if.

sum columns in dataframe python (different columns each row) [duplicate]

This question already has answers here:
Dynamically evaluate an expression from a formula in Pandas
(2 answers)
Closed 2 years ago.
I have a dataframe with 3 columns a, b, c like below:
df = pd.DataFrame({'a':[1,1,5,3], 'b':[2,0,6,1], 'c':[4,3,1,4]})
I want to add column d which is sum of some columns in df, but is not the same column for each row, for example
only row 1 and 3 is sum from the same column, row 0 and 2 is sum from others columns.
what I found on Stack over flow is always for certain column for whole dataframe, but in this case it is differnt.
How is the best way I can do it?
Because column d is randomly calculated, the only way to do it for each row, is separately.
df['d'] = 0
df['d'].iloc[0] = df['b'].iloc[0]
df['d'].iloc[1] = df['a'].iloc[1] + df['c'].iloc[1]
df['d'].iloc[2] = df['a'].iloc[2]
df['d'].iloc[3] = df['a'].iloc[3] + df['c'].iloc[3]
If rows 1 and 3, have a rule:
df['d'].loc[(df.index % 2)==1] = df['a'].iloc[df.index] + df['c'].iloc[df.index]
Also, with for-loop:
for i in range(0, 4):
if i % 2 == 1:
df['d'].iloc[i] = df['a'].iloc[i] + df['c'].iloc[i]
The dynamic way uses pd.eval(), as per [this solution][1]. This evaluates each row's formula individually, which allows df['formula'] to be different on each row, and nothing is hardcoded in your code. There's a huge amount going on in this one-liner, see the explanation in Notes below.
df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)
0 2
1 4
2 5
3 4
# ^--- this is the result
and if you want to assign that result to a dataframe column, say df['z']:
df['z'] = df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)
alternatively you could use pd.eval(..., inplace=True), but then the formula would need to contain an actual assignment, e.g. 'z=a+b', and also the 'z' column would need to have been declared already: df['z'] = np.NaN. That part is slightly annoying to implement, so I didn't.
NOTES:
we use pd.eval(...) to dynamically evaluate the ['formula'] column
...using the pd.eval(.., local_dict=...) argument to pass in the variables for that row
to evaluate an expression on each dataframe row, we use df.apply(..., axis=1). We have to provide some lambda function to tell it what to evaluate.
So how does pd.eval() know how to map the strings a,b,c to their values on that individual row?
When we call df.apply(..., axis=1) row-wise like that, each row gets passed in as an individual Series, so within our apply(... axis=1), we can no longer reference the dataframe as df or its columns as df['a'], df['b'], ...
So instead we need to pass in that row as a Python dict, hence the local_dict=row.to_dict() argument to pd.eval, inside the lambda function.
The pd.eval() approach can handle arbitrarily complicated formulas in the variables, not just simple sums; it can handle e.g. (a + c**2)/(b+c). You could reference external constants, or external functions e.g. log10.
References:
[1]: Compute dataframe columns from a string formula in variables?

using python read a column 'H' from csv and implement this function SUM(H16:H$280)/H$14*100

Using python read a column 'H' from a dataframe and implement this function:
CDF = {SUM(H1:H$266)/G$14}*100
Where:
H$266 is the last element of the column, and
G$14 is the total sum of the column H.
In sum(), the first variable iterates (H1, H2, H3 ... H266) but the last value remains the same (H$266). So the first value of CDF is obviously 100 and then it goes on decreasing downwards.
I want to implement this using dataframe.
As an example, you could do this:
from pandas import Series
s = Series([1, 2, 3]) # H1:H266 data
sum_of_s = s.sum() # G14
def calculus(subset, total_sum):
return subset.sum() / total_sum * 100
result = Series([calculus(s.iloc[i:], sum_of_s) for i in range(len(s))])
print(result)
You should adapt it to your dataset, but basically it's the idea. Let me know if it works.

How to get a value from a cell of a dataframe by position or label?

I read the questions "How to get a value from a cell of a dataframe?" and "How to select the last column of dataframe". I am a bit confused as to what is the accepted method in 2018 for getting a value from the cell of a pandas dataframe, as get_value has been deprecated and the documentation on at is somewhat sparse. I want to select the row by label and the column by position/integer. Is my method below in which I use iloc the accepted approach in 2018? I have this code:
import pandas as pd
import random
import string
import csv
import io
## Make a table with random column labels and rows in random order.
s = '\n'.join((
## Separate columns by comma.
','.join((
## Join letters without a separator.
''.join((
## Randomly choose 3 characters for the labels from ascii_letters.
random.choice(
string.ascii_letters) for i in range(3))) for j in range(3))),
## Insert a header to distinguish rows with identical index keys x and y.
'"Header I, II and III"',
## Randomly shuffle the rows under the first header.
'\n'.join(random.sample(('x,0,1', 'y,2,3',), 2)),
## Insert a header to distinguish rows with identical index keys x and y.
'"Header IV, V and VI"',
## Randomly shuffle the rows under the second header.
'\n'.join(random.sample(('x,4,5', 'y,6,7'), 2)),
))
def preprocess(s):
header = ''
for l in csv.reader(io.StringIO(s)):
if len(l) == 1:
header = l[0]
continue
## Append the header to distinguish identical labels x and y.
l[0] = header + '; ' + l[0]
yield l
print(s, end='\n\n')
## Preprocess the string to avoid duplicate row index keys x and y.
df = pd.DataFrame(preprocess(s))
print(df, end='\n\n')
## Set the index to be that of the first column.
df = df.set_index(0)
## First select the column by index using iloc
## and then select the relevant row by index label.
value = df.iloc[:,-1]['Header I, II and III; x']
print(value)
It generates a string s looking like this:
YuT,Uva,AsE
"Header I, II and III"
y,2,3
x,0,1
"Header IV, V and VI"
y,6,7
x,4,5
Which is converted to a dataframe by the function preprocess looking like this:
0 1 2
0 ; YuT Uva AsE
1 Header I, II and III; y 2 3
2 Header I, II and III; x 0 1
3 Header IV, V and VI; y 6 7
4 Header IV, V and VI; x 4 5
It is the value (the integer 1) from the last column of the row with the label Header I, II and III; x, which I'm interested in. Is this the correct approach in 2018?
value = df.iloc[:,-1]['Header I, II and III; x']
I just read some very interesting questions "Loc vs. iloc vs. ix vs. at vs. iat?" from February 2015 and "pandas iloc vs ix vs loc explanation?" from July 2015. It would be great, if I could do something like this, but I can't:
value = df.at['Header I, II and III; x', -1]
value = df['Header I, II and III; x'][-1]
First thing's first. ix is deprecated, but ix allowed you to mix labels and indexers, and made a lot of guesses about what it was passed.
In today's day and age (with the current stable release being v0.22), ix is deprecated, so stick to explicit label or positional based indexers: loc for label based slicing, iloc for index based slicing; at for label based item access, and iat for index based item access.
If you know what your labels are, use at to access a single item -
df.at['Header I, II and III; x', df.columns[-1]]
If you know the position, use iat -
df.iat[2, -1]
In general, use *at when you want to access a single element, and *loc when you want to access a row/column slice.
This solution definitely works:
value = df.at['Header I, II and III; x', df.columns[-1]]
For those like myself that like old-fashioned dictionaries it is also possible to do the following, if the dataframe does not contain duplicate row labels, which prevents the transpose from working:
d = df.transpose().to_dict('list')
value = d['Header I, II and III; x'][-1]

Categories