could you explain me please the difference between those two:
#1
for index, row in df.iterrows():
#2
for x in df['city']:
Should I always use or for index, row in df.iterrows(): while trying to access data in pandas:
for index, row in df.iterrows():
for x in df['city']:
Or in some cases specifying the column name like in the second example will me enough?
Thank you
There are more ways to iterate than the ways you described. It all comes down to how simple your iteration is and the "efficiency" of it.
The second example way will be enough if you just want to iterate rows over a single column.
Also bare in mind, depending on the method of iteration, they return different dtypes. You can read about them all on pandas doc.
This is an interesting article explaining the different methods regarding performance https://medium.com/#rtjeannier/pandas-101-cont-9d061cb73bfc
for index, row in df.iterrows():
print(row['city'])
Explanation: It helps us to iterate over a data frame row-wise with row variable having values for each column of that row & 'index' having an index of that row. To access any value for that row, mention the column name as above
for x in df['city']:
print(x)
Explanation: It helps us to iterate over a Series df['city'] & not other columns in df.
Related
I want to create a new column, V, in an existing DataFrame, df. I would like the value of the new column to be the difference between the value in the 'x' column in that row, and the value of the 'x' column in the row below it.
As an example, in the picture below, I want the value of the new column to be
93.244598 - 93.093285 = 0.151313.
I know how to create a new column based on existing columns in Pandas, but I don't know how to reference other rows using this method. Is there a way to do this that doesn't involve iterating over the rows in the dataframe? (since I have read that this is generally a bad idea)
You can use pandas.DataFrame.shift for your use case.
The last row will not have any row to subtract from so you will get the value for that cell as NaN
df['temp_x'] = df['x'].shift(-1)
df[`new_col`] = df['x'] - df['temp_x']
or one liner :
df[`new_col`] = df['x'] - df['x'].shift(-1)
the column new_col will contain the expected data
An ideal solution is to use diff:
df['new'] = df['x'].diff(-1)
I have a dataframe with 100+ columns where all columns after col10 are of type float. What I would like to do is find the average of certain range of columns within loop. Here is what I tried so far,
for index,row in df.iterrows():
a = row.iloc[col30:col35].mean(axis=0)
This unfortunately returns unexpected values and I'm not able to get the average of col30,col31,col32,col33,col34,col35 for every row.Can someone please help.
try:
df.iloc[:, 30:35].mean(axis=1)
You may need to adjust 30:35 to 29:35 (you can remove the .mean and play around to get an idea of how the .iloc works). Generally in pandas you want to avoid loops as much as possible. The .iloc method allows you to select the index and columns based on their positional index. Then you can use the .mean() with axis=1 to sum across the 1st axis (Rows).
You really should be putting a small example where I reproduce the example, please see this below where the mentioned solution in comments works.
import pandas as pd
df = pd.DataFrame({i:val for i,val in enumerate(range(100))}, index=list(range(100)))
for i,row in df.iterrows():
a = row.iloc[29:25].mean() # a should be 31.5 for each row
print(a)
I have a dataset that consists of tokenized, POS-tagged phrases as one column of a dataframe:
Current Dataframe
I want to create a new column in the dataframe, consisting only of the proper nouns in the previous column:
Desired Solution
Right now, I'm trying something like this for a single row:
if 'NNP' in df['Description_POS'][96][0:-1]:
df['Proper Noun'] = df['Description_POS'][96]
But then I don't know how to loop this for each row, and how to obtain the tuple which contains the proper noun.
I'm very new right now and at a loss for what to use, so any help would be really appreciated!
Edit: I tried the solution recommended, and it seems to work, but there is an issue.
this was my dataframe:
Original dataframe
After implementing the code recommended
df['Proper Nouns'] = df['POS_Description'].apply(
lambda row: [i[0] for i in row if i[1] == 'NNP'])
it looks like this:
Dataframe after creating a proper nouns column
You can use the apply method, which as the name suggests will apply the given function to every row of the dataframe or series. This will return a series, which you can add as a new column to your dataframe
df['Proper Nouns'] = df['POS_Description'].apply(
lambda row: [i[0] for i in row if i[1] == 'NNP'])
I am assuming the POS_Description dtype to be a list of tuples.
I'm using this code to loop through a dataframe:
for r in zip(df['Name']):
#statements
How do I identify a particular row in the dataframe? For example, I want to assign a new value to each row of the Name column while looping through. How do I do that?
I've tried this:
for r in zip(df['Name']):
df['Name']= time.time()
The problem is that every single row is getting the same value instead of different values.
The main problem is in the assignment:
df['Name']= time.time()
This says to grab the current time and assign it to every cell in the Name column. You reference the column vector, rather than a particular row. Note your iteration statement:
for r in zip(df['Name']):
Here, r is the row, but you never refer to it. That makes it highly unlikely that anything you do within the loop will affect an individual row.
Putting on my "teacher" hat ...
Look up examples of how to iterate through the rows of a Pandas data frame.
Within those, see how individual cells are referenced: that technique looks a lot like indexing a nested list.
Now, alter your code so that you put the current time in one cell at a time, one on each iteration. It will look something like
df.at[row]['Name'] = time.time()
or
row['Name'] = time.time()
depending on how you define row in your iteration.
Does that get you to a solution?
The following also works:
import pandas as pd
import time
# example df
df = pd.DataFrame(data={'name': ['Bob', 'Dylan', 'Rachel', 'Mark'],
'age': [23, 27, 30, 35]})
# iterate through each row in the data frame
col_idx = df.columns.get_loc('name') # this is so we can use iloc
for i in df.itertuples():
df.iloc[i[0], col_idx] = time.time()
So, essentially we use the index of the dataframe as the indicator of the position of the row. The first index points to the first row in the dataframe, and so on.
EDIT: as pointed out in the comment, using .index to iterate rows is not a good practice. So, let's use the number of rows of the dataframe itself. This can be obtained via df.shape which returns a tuple (row, column) and so, we only need the row df.shape[0].
2nd EDIT: using df.itertuples() for performance gain and .iloc for integer based indexing.
Additionally, the official pandas doc recommends the use of loc for variable assignment to a pandas dataframe due to potential chained indexing. More information here http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Hello I've been struggling with this problem, I'm trying to iterate over rows and select data from them and then assign them to variables. this is the first time I'm using pandas and I'm not sure how to select the data
reader = pd.read_csv(file_path, sep="\t" ,lineterminator='\r', usecols=[0,1,2,9,10],)
for row in reader:
print(row)
#id_number = row[0]
#name = row[2]
#ip_address = row[1]
#latitude = row[9]
and this is the output from the row that I want to assign to the variables:
050000
129.240.228.138
planetlab2.simula.no
59.93
Edit: Perhaps this is not a problem for pandas but for general Python. I am fairly new to python and what I'm trying to achieve is to parse tab separated file line by line and assign data to the variables and print them in one loop.
this is the input file sample:
050263 128.2.211.113 planetlab-1.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown
050264 128.2.211.115 planetlab-3.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown
The general workflow you're describing is: you want to read in a csv, find a row in the file with a certain ID, and unpack all the values from that row into variables. This is simple to do with pandas.
It looks like the CSV file has at least 10 columns in it. Providing the usecols arg should filter out the columns that you're not interested in, and read_csv will ignore them when loading into the pandas DataFrame object (which you've called reader).
Steps to do what you want:
Read the data file using pd.read_csv(). You've already done this, but I recommend calling this variable df instead of reader, as read_csv returns a DataFrame object, not a Reader object. You'll also find it convenient to use the names argument to read_csv to assign column names to the dataframe. It looks like you want names=['id', 'ip_address', 'name', 'latitude','longitude'] to get those as columns. (Assuming col10 is longitude, which makes sense that 9,10 would be lat/long pairs)
Query the dataframe object for the row with that ID that you're interested in. There are a variety of ways to do this. One is using the query syntax. Hard to know why you want that specific row without more details, but you can look up more information about index lookups in pandas. Example: row = df.query("id == 50000")
Given a single row, you want to extract the row values into variables. This is easy if you've assigned column names to your dataframe. You can treat the row as a dictionary of values. E.g. lat = row['lat'] lon = row['long]
You can use iterrows():
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row['col_name']
Or if you want to access by index of the column:
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row.ix[0]
Are the values you need to add the same for each row, or does it require processing the value to determine the value of the addition? If it is consistent you can apply this sum simply using pandas to do a matrix operation on the dataset. If it requires processing row by row, the above solution is the correct one for sure. If it is a table of variables that must be added row by row, you can do that by dumping them all into a column aligned with your dataset, do the addition by row using pandas, and simply print out the complete dataframe. Assume you have three columns to add, which you put into a new column[e].
df['e'] = df.a + df.b + df.d
or, if it is a constant:
df['e'] = df.a + df.b + {constant}
Then drop the columns you don't need (ex df['a'] and df['b'] in the above)
Obviously, then, if you need to calculate based on unique values for each row, put the values into another column and sum as above.