I am using a dataframe (df1) inside a loop to store information that I read from another dataframe (df2). df1 can have different number of rows in every iteration. I store the data row by row using df1.loc[row_number]. This could be an example:
a b c
0 9 2 3
1 8 5 6
2 3 8 9
Then I need to read the value of the first column and the first row, which I perform as
df1['a'].iloc[0]
9
The problem arises when df1 is a one row dataframe:
a 9
b 2
c 3
Name: 0, dtype: int64
It seems that with only one row, pandas stores the dataframe as a pandas series object. Trying to access the value in the same way ( df1['a'].iloc[0] ) I get the error:
AttributeError: 'numpy.int64' object has no attribute 'iloc'
Is there a way to solve this in a general case, with no need to handle the 1-row dataframe separately?
df1['a'] might be the column 'a' of the dataframe/series, which in the error case doesn't exist (no column named 'a').... Try to use df.iloc[0] directly
Related
I have the following dataframe from where I want to retrieve the cell values using index and column names. The left column indicates the index values whereas the column names are from 1 to 5. This is a dummy dataframe which looks small but going forward I will be using this code to access a dataframe with 100+ columns and it is not possible to know the column names beforehand.
1
2
3
4
5
t_1
1
0
0
0
1
t_2
1
1
0
0
0
t_3
1
0
0
0
0
t_4
1
0
1
0
1
To retrieve the values from this dataframe I am using the itertuples() to loop over the pandas dataframe. Please note that using iterrows() this can be easily done but it is much slower because of which I want to avoid using that. Here is the code snippet to iterate the dataframe:
for row in input_df.itertuples():
print(row.Index)
for col in input_df.columns[1:]:
print(row.col)
Since I won't be knowing the column names beforehand I want to get the column names from the dataframe list and then use it to fetch the cell values. For example, row t_1 column 1 should return 1. However, with the above code I am getting the following error:
AttributeError: 'Pandas' object has no attribute 'col'
If I mention the exact column name in place of col with row then I am getting the result without any error. Please help me understand what am I doing wrong here to get this error. Is there any other solution apart from iterrows() to get the cell value with column names?
Simply change row.col to getattr(row, col):
for row in input_df.itertuples():
print(row.Index)
for col in input_df.columns[1:]:
print(getattr(row, col))
When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.
Is there any way to select the row by index (i.e. integer) and column by column name in a pandas data frame?
I tried using loc but it returns an error, and I understand iloc only works with indexes.
Here is the first rows of the data frame df. I am willing to select the first row, column named 'Volume' and tried using df.loc[0,'Volume']
Use get_loc method of Index to get integer location of a column name.
Suppose this dataframe:
>>> df
A B C
10 1 2 3
11 4 5 6
12 7 8 9
You can use .iloc like this:
>>> df.iloc[1, df.columns.get_loc('B')]
5
I am going through the 'Python for Data Analysis' book and having trouble in the 'Example: 2012 Federal Election Commision Database' section reading the data to a DataFrame. The trouble is that one of the columns of data is always being set as the index column, even when the index_col argument is set to None.
Here is the link to the data : http://www.fec.gov/disclosurep/PDownload.do.
Here is the loading code (to save time in the checking, I set the nrows=10):
import pandas as pd
fec = pd.read_csv('P00000001-ALL.csv',nrows=10,index_col=None)
To keep it short I am excluding the data column outputs, but here is my output (please not the Index values):
In [20]: fec
Out[20]:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, C00410118 to C00410118
Data columns:
...
dtypes: float64(4), int64(3), object(11)
And here is the book's output (again with data columns excluded):
In [13]: fec = read_csv('P00000001-ALL.csv')
In [14]: fec
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
...
dtypes: float64(1), int64(1), object(14)
The Index values in my output are actually the first column of data in the file, which is then moving all the rest of the data to the left by one. Would anyone know how to prevent this column of data to be listed as an index? I would like to have the index just +1 increasing integers.
I am fairly new to python and pandas, so I apologize for any inconvenience. Thanks.
Quick Answer
Use index_col=False instead of index_col=None when you have delimiters at the end of each line to turn off index column inference and discard the last column.
More Detail
After looking at the data, there is a comma at the end of each line. And this quote (the documentation has been edited since the time this post was created):
index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index.
from the documentation shows that pandas believes you have n headers and n+1 data columns and is treating the first column as the index.
EDIT 10/20/2014 - More information
I found another valuable entry that is specifically about trailing limiters and how to simply ignore them:
If a file has one more column of data than the number of column names, the first column will be used as the DataFrame’s row names: ...
Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False: ...
Re: craigts's response, for anyone having trouble with using either False or None parameters for index_col, such as in cases where you're trying to get rid of a range index, you can instead use an integer to specify the column you want to use as the index. For example:
df = pd.read_csv('file.csv', index_col=0)
The above will set the first column as the index (and not add a range index in my "common case").
Update
Given the popularity of this answer, I thought i'd add some context/ a demo:
# Setting up the dummy data
In [1]: df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})
In [2]: df
Out[2]:
A B
0 1 4
1 2 5
2 3 6
In [3]: df.to_csv('file.csv', index=None)
File[3]:
A B
1 4
2 5
3 6
Reading without index_col or with None/False will all result in a range index:
In [4]: pd.read_csv('file.csv')
Out[4]:
A B
0 1 4
1 2 5
2 3 6
# Note that this is the default behavior, so the same as In [4]
In [5]: pd.read_csv('file.csv', index_col=None)
Out[5]:
A B
0 1 4
1 2 5
2 3 6
In [6]: pd.read_csv('file.csv', index_col=False)
Out[6]:
A B
0 1 4
1 2 5
2 3 6
However, if we specify that "A" (the 0th column) is actually the index, we can avoid the range index:
In [7]: pd.read_csv('file.csv', index_col=0)
Out[7]:
B
A
1 4
2 5
3 6
If pandas is treating your first row as a header, you can use header = none as such:
df = pd.read_csv ("csv-file.csv", header=None)
this way pandas will treat your first row as like any row.
I have a two different dataframe with one similar column. I am trying to apply the conditional statement in the following data.
df
a b
1 5
2 4
3 5.5
4 4.2
5 3.1
df1
a c
1 9
2 3
3 5.1
4 4.8
5 3
I am writing the below code
df.loc['comparison'] = df['b'] > df1['c']
and get the following error:
can only compare identically-labeled Series objects.
Please advise how can I fix this issue.
Your dataframe indices (not displayed in your question) are not aligned. In addition, you are attempting to add a column incorrectly: pd.DataFrame.loc with one indexer refers to a row index rather than a column.
To overcome these issues, you can reindex one of your series and use df[col] to create a new series:
df['comparison'] = df['b'] > df1['c'].reindex(df.index)
See Indexing and Selecting Data to understand how to index data in a dataframe.