Looping through DataFrame via zip - python

I'm using this code to loop through a dataframe:
for r in zip(df['Name']):
#statements
How do I identify a particular row in the dataframe? For example, I want to assign a new value to each row of the Name column while looping through. How do I do that?
I've tried this:
for r in zip(df['Name']):
df['Name']= time.time()
The problem is that every single row is getting the same value instead of different values.

The main problem is in the assignment:
df['Name']= time.time()
This says to grab the current time and assign it to every cell in the Name column. You reference the column vector, rather than a particular row. Note your iteration statement:
for r in zip(df['Name']):
Here, r is the row, but you never refer to it. That makes it highly unlikely that anything you do within the loop will affect an individual row.
Putting on my "teacher" hat ...
Look up examples of how to iterate through the rows of a Pandas data frame.
Within those, see how individual cells are referenced: that technique looks a lot like indexing a nested list.
Now, alter your code so that you put the current time in one cell at a time, one on each iteration. It will look something like
df.at[row]['Name'] = time.time()
or
row['Name'] = time.time()
depending on how you define row in your iteration.
Does that get you to a solution?

The following also works:
import pandas as pd
import time
# example df
df = pd.DataFrame(data={'name': ['Bob', 'Dylan', 'Rachel', 'Mark'],
'age': [23, 27, 30, 35]})
# iterate through each row in the data frame
col_idx = df.columns.get_loc('name') # this is so we can use iloc
for i in df.itertuples():
df.iloc[i[0], col_idx] = time.time()
So, essentially we use the index of the dataframe as the indicator of the position of the row. The first index points to the first row in the dataframe, and so on.
EDIT: as pointed out in the comment, using .index to iterate rows is not a good practice. So, let's use the number of rows of the dataframe itself. This can be obtained via df.shape which returns a tuple (row, column) and so, we only need the row df.shape[0].
2nd EDIT: using df.itertuples() for performance gain and .iloc for integer based indexing.
Additionally, the official pandas doc recommends the use of loc for variable assignment to a pandas dataframe due to potential chained indexing. More information here http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Related

Splitting a Pandas DataFrame based on whether the index name appears in a list

This should hopefully be a straightforward question but I'm new to Pandas.
I've got a DataFrame called RawData, and a list of permissible indexes called AllowedIndexes.
What I want to do is split the DataFrame into two new ones:
one DataFrame only with the indexes that appear on the AllowedIndexes list.
one DataFrame only with the indexes that don't appear on the
AllowedIndexes list, for data cleaning purposes.
I've provided a simplified version of the actual data I'm using which in reality contains several series.
[image]
import pandas as pd
RawData = pd.DataFrame({'Quality':['#000000', '#FF0000', '#FFFFFF', '#PURRRR','#123Z']}, index = ['Black','Red','White', 'Cat','Blcak'])
AllowedIndexes = ['Black','White','Yellow','Red']
Thanks!
.index takes the index for each row of the RawData dataframe.
.isin() checks if the element exists in the AllowedIndexes list.
allowed = RawData[(RawData.index.isin(AllowedIndexes))==True]
not_allowed = RawData[(RawData.index.isin(AllowedIndexes))==False]
Another way without checking if True, or False:
allowed = RawData[RawData.index.isin(AllowedIndexes)]
not_allowed = RawData[~(RawData.index.isin(AllowedIndexes))]
~ is not in pandas.

How to access a row in pandas?

could you explain me please the difference between those two:
#1
for index, row in df.iterrows():
#2
for x in df['city']:
Should I always use or for index, row in df.iterrows(): while trying to access data in pandas:
for index, row in df.iterrows():
for x in df['city']:
Or in some cases specifying the column name like in the second example will me enough?
Thank you
There are more ways to iterate than the ways you described. It all comes down to how simple your iteration is and the "efficiency" of it.
The second example way will be enough if you just want to iterate rows over a single column.
Also bare in mind, depending on the method of iteration, they return different dtypes. You can read about them all on pandas doc.
This is an interesting article explaining the different methods regarding performance https://medium.com/#rtjeannier/pandas-101-cont-9d061cb73bfc
for index, row in df.iterrows():
print(row['city'])
Explanation: It helps us to iterate over a data frame row-wise with row variable having values for each column of that row & 'index' having an index of that row. To access any value for that row, mention the column name as above
for x in df['city']:
print(x)
Explanation: It helps us to iterate over a Series df['city'] & not other columns in df.

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

Python: How to get a specific column value while looping through a row of data

I am trying to extract data from Quandl and I want to get the Date and 'Open' value (respectively) for each row. However, I am not sure what I should. Been trying different method that hasn't worked out. Below is an example:
data = quandl.get("EOD/PG", trim_start = "2011-12-12", trim_end =
"2011-12-30", authtoken=quandl.ApiConfig.api_key)
data = data.reset_index()
sta = data[['Date','Open']]
for row in sta:
price = row.iloc[:,1]
date = row.iloc[:, 0]
What you're doing with the code you have provided is iterating through the column names, i.e. you get 'Date' on the first iteration, and 'Open' on the next (and last).
To iterate through a dataframe by row, you can use any one the .iterrows(), .iteritems() or .itertuples() methods.
For example,
for row in data.itertuples():
price = row.Open
date = row.Date
Having said so, iterating through a pandas dataframe is really slow. Chances are, whatever you intend to do could be done faster by making use of pandas' vectorization, i.e. without a loop.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories