Pandas: calculate average

Pandas: calculate average - python

I have a csv dataset where I want to calculate the average for all rows The average is calculated from data start at column 14. This is what I have done so far but I am still not getting the average value. Can someone help me with this?
I am also getting confused with this Axis thing.
file = ('dataset.csv')
df = pd.read_csv(file)
d_col = df[df.columns[14:]]
mean_value = d_col['mean'] = d_col.mean(axis=1, skipna=True, numeric_only=True)
print mean_value
d_col.to_csv('out.csv')

It's a very strange indexing syntax you're using. A clearer way should be:
d_col = df.iloc[:, 14:]
axis = 0 means taking the average by column, and axis = 1 by the row, which you seem to be doing correctly. I'm not sure what exactly you mean by not getting the average. The d_col should contain your original data and a new column named "mean" containing the result.

Because you don't provide sample data see the following sample code. The first column is some text column that should be ignored, whereas the other columns in the DataFrame df are the ones that should be used to calculate the mean value.
# prepare some dataset
letters = 'abcdefghijklmnopqrstuvwxyz'
rows = 10
col1 = np.array(list(letters))[np.random.permutation(len(letters))[:rows]]
df = pd.concat([pd.DataFrame(col1), pd.DataFrame(np.random.randn(rows, 10))], axis=1)
result = df.iloc[:, 1:].mean(axis=1)
The result then looks like this:
0 0.693024
1 -0.356701
2 0.082385
3 -0.115622
4 -0.060414
5 0.104119
6 -0.435787
7 0.023327
8 -0.144272
9 0.363254
dtype: float64
/edit: Change answer above to use df.iloc instead of df[df.columns[...] as the latter makes problem in case two columns have the same name. Please mark peidaqi's answer as the correct one.

The issue lied here , I was saving d_col as the output csv file instead of mean_value . It's silly but i guess that's how you learn to pickup things. Thanks #peidaqi and others for your explanation.

Related

Why is dataframe.sum(axis=0) getting NAN's when every value in every column is a real number?

All column values in the selected measure_cols of the dfm DataFrame are real numbers - in fact all are between [-1.0..1.0] inclusive.
Following gives False for all Series/Columns in the dfc dataframe
[print(f"{c}: {dfc[c].hasnans}") for c in dfc.columns]
Results: all False
But all row sums in dfc['shap_sum'] are coming up as NAN's. Why would this be?
dfc['shap_sum'] = dfc[measure_cols].sum(axis=0)
Update The following has the correct results - as seen in the debugger
dfc[measure_cols].sum(axis=0)
But when assigned to a new column in the dataframe they get distorted into NaN's.
Why is this happening ?
dfc['shap_sum'] = dfc[measure_cols].sum(axis=0)

Oh I made the mistake of using axis=0 intending to do row sums. But it's axis=1 to do row sums. I will never agree with that decision on polarity.

Pandas - Lookup value for each item in list

I am relatively new to Python and Pandas. I have two dataframes, one contains a column of codes separated by a comma - the number of codes in each list can vary and can contain a string such as 'Not Applicable' or a blank. The other is a lookup table of the codes and a value. I want to lookup the value of each individual code in each list and calculate the maximum value within that list. For example ['H302','H304'] would be [18,11] and the maximum value of those two would be 18. I then want to return the maximum value of each list as a new column to df2. If it contains anything else, return blank.
This process was originally written in VBA, I solved the problem there by splitting each set of codes by delimiter to a new column, then dynamically running index/matches against each code to return the value. Then it would calculate the maximum value and delete out all the generated columns. I thought at the time it was a messy way to do it and I don't want to replicate this in the Python version.
I would post what I've tried by I can't figure out how I'd go about this - any help is appreciated!
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])

df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values

First, df2 seems to be defined wrongly (single quotes between comas are required). Also, don't generate a data frame of it since you need to be flexible to have any number of elements.
Second, you would need to define the codes as the index to look for elements in the data frame. So, you would define the data frame as:
df1 = pd.DataFrame(df1, columns=['Code', 'Value']).set_index('Code')
Third, you need to loop through the second list of lists and index the elements you want before calculating the maximum using .loc. Also, you need to filter out the codes that are not in the first data frame.
result = []
for codes in df2:
c = [_ for _ in codes if _ in df1.index]
result.append(df1.loc[c,'Value'].max())

Try:
df2.join(df2['Code'].str.split(',')\
.explode()\
.map(df1.set_index('Code')['Value']).groupby(level=0).max().rename('Value'))
Output:
Code Value
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN

Converting a column of credit ratings like AAA BB CC to a numeric category of AAA = 1, BB = .75 etc in python?

I have a column in a dataframe called 'CREDIT RATING' for a number of companies across rows. I need to assign a numerical category for ratings like AAA to DDD from 1(AAA) to 0(DDD). is there a quick simple way to do this and basically create a new column where i get numbers 1-0 by .1's? Thanks!

You could use replace:
df['CREDIT RATING NUMERIC'] = df['CREDIT RATING'].replace({'AAA':1, ... , 'DDD':0})

The easiest way is to simply create a dictionary mapping:
mymap = {"AAA":1.0, "AA":0.9, ... "DDD":0.0}
and then apply it to the dataframe:
df["CREDIT MAPPING"] = df["CREDIT RATING"].replace(mymap)

Ok, this was kinda though without nothing to work with but here we go:
# First getting a ratings list acquired from wikipedia than setting into a dataframe to replicate your scenario
ratings = ['AAA' ,'AA1' ,'AA2' ,'AA3' ,'A1' ,'A2' ,'A3' ,'BAA1' ,'BAA2' ,'BAA3' ,'BA1' ,'BA2' ,'BA3' ,'B1' ,'B2' ,'B3' ,'CAA' ,'CA' ,'C' ,'C' ,'E' ,'WR' ,'UNSO' ,'SD' ,'NR']
df_credit_ratings = pd.DataFrame({'Ratings_id':ratings})
df_credit_ratings = pd.concat([df_credit_ratings,df_credit_ratings]) # just to replicate duplicate records
# The set() command get the unique values
unique_ratings = set(df_credit_ratings['Ratings_id'])
number_of_ratings = len(unique_ratings) # counting how many unique there are
number_of_ratings_by_tenth = number_of_ratings/10 # Because from 0 to 1 by 0.1 to 0.1 there are 10 positions.
# the numpy's arange fills values in between from a range (first two numbers) and by which decimals (third number)
dec = list(np.arange(0.0, number_of_ratings_by_tenth, 0.1))
After this you'll need to mix the unique ratings to it's weigths:
df_ratings_unique = pd.DataFrame({'Ratings_id':list(unique_ratings)}) # list so it gets one value per row
EDIT: as Thomas suggested in another answer's comment, this sort probably wont fit you because it won't be the real order of importance of the ratings. So you'll probably need to first create a dataframe with them already in order and no neet to sort.
df_ratings_unique.sort_values(by='Ratings_id', ascending=True, inplace=True) # sorting so it matches the order of our weigths above.
Resuming the solution:
df_ratings_unique['Weigth'] = dec # adding the weigths to the DF
df_ratings_unique.set_index('Ratings_id', inplace=True) # setting the Rantings as index to map the values bellow
# now this is the magic, we're creating a new column at the original Dataframe and we'll map according to the `Ratings_id` by our unique dataframe
df_credit_ratings['Weigth'] = df_credit_ratings['Ratings_id'].map(df_ratings_unique.Weigth)

Calculating running total

I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?

#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.

highest number in each row in a DF and return column of numbers

Hi I have a list of sock prices and calculated 5 moving averages
I want to find the max number in each ROW. The code is returning the max number for the entire array
Here is the code
# For stock in df:
Create 10,30,50,100 and 200D MAvgs
MA10D = stock.rolling(10).mean()
MA30D = stock.rolling(30).mean()
MA50D = stock.rolling(50).mean()
MA100D = stock.rolling(100).mean()
MA200D = stock.rolling(200).mean()
max_line = pd.concat([MA10D, MA30D, MA50D, MA100D, MA200D],axis=0).max()
I want to create new column with the max number (either the 10D, 30D, 50D, 100D or 200DMA). So I should get a value on each row.
Right now all I get in the max number of the each entire array. I tried axis=1 and that did not work either.
Seems like a simple question but I can not get it written properly. Please let me know if you can help. thanks

the axis=0 in your code refers to the concatenation. You need to make that axis=1 to make each moving average a separate column. Then use axis=1 in your call to max as well. It should look like this.
max_line = pd.concat([MA10D, MA30D, MA50D, MA100D, MA200D], axis=1).max(1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: calculate average - python

The issue lied here , I was saving d_col as the output csv file instead of mean_value . It's silly but i guess that's how you learn to pickup things. Thanks #peidaqi and others for your explanation.

Related

Why is dataframe.sum(axis=0) getting NAN's when every value in every column is a real number?

Pandas - Lookup value for each item in list

Converting a column of credit ratings like AAA BB CC to a numeric category of AAA = 1, BB = .75 etc in python?

Calculating running total

highest number in each row in a DF and return column of numbers

Categories

Resources