How to find average of values in columns within iterrows in python - python

I have a dataframe with 100+ columns where all columns after col10 are of type float. What I would like to do is find the average of certain range of columns within loop. Here is what I tried so far,
for index,row in df.iterrows():
a = row.iloc[col30:col35].mean(axis=0)
This unfortunately returns unexpected values and I'm not able to get the average of col30,col31,col32,col33,col34,col35 for every row.Can someone please help.

try:
df.iloc[:, 30:35].mean(axis=1)
You may need to adjust 30:35 to 29:35 (you can remove the .mean and play around to get an idea of how the .iloc works). Generally in pandas you want to avoid loops as much as possible. The .iloc method allows you to select the index and columns based on their positional index. Then you can use the .mean() with axis=1 to sum across the 1st axis (Rows).
You really should be putting a small example where I reproduce the example, please see this below where the mentioned solution in comments works.
import pandas as pd
df = pd.DataFrame({i:val for i,val in enumerate(range(100))}, index=list(range(100)))
for i,row in df.iterrows():
a = row.iloc[29:25].mean() # a should be 31.5 for each row
print(a)

Related

Map Value to Specific Row and Column - Python Pandas

I have a data set where i want to match the index row and change the value of a column within that row.
I have looked at map and loc and have been able to locate the data use df.loc but it filters that data down, all i want to do is change the value in a column on that row when that row is found.
What is the best approach - my original post can be found here:
Original post
It's simple to do in excel but struggling with Pandas.
Edit:
I have this so far which seems to work but it includes a lot of numbers after the total calculation along with dtype: int64
import pandas as pd
df = pd.read_csv(r'C:\Users\david\Documents\test.csv')
multiply = {2.1: df['Rate'] * df['Quantity']}
df['Total'] = df['Code'].map(multiply)
df.head()
how do i get around this?
The pandas method mask is likely a good option here. Mask takes two main arguments: a condition and something with which to replace values that meet that condition.
If you're trying to replace values with a formula that draws on values from multiple dataframe columns, you'll also want to pass in an additional axis argument.
The condition: this would be something like, for instance:
df['Code'] == 2.1
The replacement value: this can be a single value, a series/dataframe, or (most valuable for your purposes) a function/callable. For example:
df['Rate'] * df['Quantity']
The axis: Because you're passing a function/callable as the replacement argument, you need to tell mask() how to find those values. It might look something like this:
axis=0
So all together, the code would read like this:
df['Total'] = df['Code'].mask(
df['Code'] == 2.1,
df['Rate'] * df['Quantity'],
axis=0
)

How to insert data into a existing dataframe, replacing values according to a conditional

I'm looking to insert information into a existing dataframe, this dataframe shape is 2001 rows × 13 columns, however, only the first column has information.
I have 12 more columns, but these are not the same dimension as the main dataframe, so I'd like to insert this additional columns into the main one using a conditional.
Example dataframe:
This in an example, I want to insert the var column into the 2001 × 13 dataframe, using the date as a conditional and in case there is no date, it skips the row or simply adds a 0.
I'm really new to python and programming in general.
Without a minimal working example it is hard to provide you with clear recommendations, but I think what you are looking for is the .loc a pd.DataFrame. What I would recommend you doing is the following:
Selection of rows with .loc works better in your case if the dates are first converted to date-time, so a first step is to make this conversion as:
# Pandas is quite smart about guessing date format. If this fails, please check the
# documentation https://docs.python.org/3/library/datetime.html to learn more about
# format strings.
df['date'] = pd.to_datetime(df['date'])
# Make this the index of your data frame.
df.set_index('date', inplace=True)
It is not clear how you intend to use conditionals/what is the content of your other columns. Using .loc this is pretty straightforward
# At Feb 1, 2020, add a value to columns 'var'.
df.loc['2020-02-01', 'var'] = 0.727868
This could also be used for ranges:
# Assuming you have a second `df2` which as a datetime columns 'date' with the
# data you wish to add to `df`. This will only work if all df2['date'] are found
# in df.index. You can workout the logic for your case.
df.loc[df2['date'], 'var2'] = df2['vals']
If the logic is to complex and the dataframe is not too large, iterating with .iterrows could be easier, specially if you are beginning with Python.
for idx, row in df.iterrows():
if idx in list_of_other_dates:
df.loc[i, 'var'] = (some code here)
Please clarify a bit your problem and you will get better answers. Do not forget to check the documentation.

How to deal with missing values in Pandas DataFrame?

I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN but if I sum a certain column, for example, the result is NaN.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it:
1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data.
So go for mean of the data which will make sure it won't affect your statistics.
So, use df.fillna(df.mean()) instead
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')

Looping through DataFrame via zip

I'm using this code to loop through a dataframe:
for r in zip(df['Name']):
#statements
How do I identify a particular row in the dataframe? For example, I want to assign a new value to each row of the Name column while looping through. How do I do that?
I've tried this:
for r in zip(df['Name']):
df['Name']= time.time()
The problem is that every single row is getting the same value instead of different values.
The main problem is in the assignment:
df['Name']= time.time()
This says to grab the current time and assign it to every cell in the Name column. You reference the column vector, rather than a particular row. Note your iteration statement:
for r in zip(df['Name']):
Here, r is the row, but you never refer to it. That makes it highly unlikely that anything you do within the loop will affect an individual row.
Putting on my "teacher" hat ...
Look up examples of how to iterate through the rows of a Pandas data frame.
Within those, see how individual cells are referenced: that technique looks a lot like indexing a nested list.
Now, alter your code so that you put the current time in one cell at a time, one on each iteration. It will look something like
df.at[row]['Name'] = time.time()
or
row['Name'] = time.time()
depending on how you define row in your iteration.
Does that get you to a solution?
The following also works:
import pandas as pd
import time
# example df
df = pd.DataFrame(data={'name': ['Bob', 'Dylan', 'Rachel', 'Mark'],
'age': [23, 27, 30, 35]})
# iterate through each row in the data frame
col_idx = df.columns.get_loc('name') # this is so we can use iloc
for i in df.itertuples():
df.iloc[i[0], col_idx] = time.time()
So, essentially we use the index of the dataframe as the indicator of the position of the row. The first index points to the first row in the dataframe, and so on.
EDIT: as pointed out in the comment, using .index to iterate rows is not a good practice. So, let's use the number of rows of the dataframe itself. This can be obtained via df.shape which returns a tuple (row, column) and so, we only need the row df.shape[0].
2nd EDIT: using df.itertuples() for performance gain and .iloc for integer based indexing.
Additionally, the official pandas doc recommends the use of loc for variable assignment to a pandas dataframe due to potential chained indexing. More information here http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories