python csv problem- want to use variable in loc code - python

I imported my csv file and I want to use variable in my loc code like\
a = 0; b = 1
df.loc[int(a+2),int(2*b-2)]
but it has an error
how do I fix it?

if you want to get the value from position (a+2, 2*b-2), use .iloc instead.
.loc is used to select data from index name.
try this:
df.iloc[a+2,2*b-2]
pandas docs

It seems that you want to use a slice of int numbers to select a piece of data. In pandas.DataFrame, you can use df.loc and df.iloc for indexing.
Allowed inputs of df.loc are:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c'].
A slice object with labels, e.g. 'a':'f'.
A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
An alignable boolean Series.
An alignable Index.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
Allowed inputs of df.iloc are:
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
But both of them don't support two integers for inputing.
So maybe you should use df.iloc to accpet a list of integers df.iloc[[int(a+2),int(2*b-2)]], instead of using df.loc df.loc[int(a+2),int(2*b-2)].
For more details, you can see the documents:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Related

Python iloc slice range from dictionary value

I am trying to use a dictionary value to define the slice ranges for the iloc function but I keep getting the error -- Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] . The excel sheet is built for visual information and not in any kind of real table format (not mine so I can’t change it) so I have to slice the specific ranges without column labels.
tried code - got the error
cr_dict= {'AA':'[42:43,32:65]', 'BB':'[33:34, 32:65]'}
df = my_df.iloc[cr_dict['AA']]
the results I want would be similar to
df = my_df.iloc[42:43,32:65]
I know I could change the dictionary and use the following but it looks convoluted and not as easy to read– is there a better way?
Code
cr_dict= {'AA':[42,43,32,65], 'BB':'[33,34, 32,65]'}
df = my_df.iloc[cr_dict['AA'][0]: cr_dict['AA'][0], cr_dict['AA'][0]: cr_dict['AA'][0]]
Define your dictionaries slightly differently.
cr_dict= {'AA':[42,43]+list(range(32,65)),
'BB':[33,34]+list(range(32,65))}
Then you can slice your DataFrame like so:
>>> my_df.iloc[cr_dict["AA"], cr_dict["BB"]].sort_index()

Selecting by index -1 in a df column / time series throws error

Let's assume we have a simple dataframe like this:
df = pd.DataFrame({'col1':[1,2,3], 'col2':[10,20,30]})
Then I can select elements like this
df.col2[0] or df.col2[1]
But if I want to select the last element with df.col2[-1] it results in the error message:
KeyError: -1
I know that there are workarounds to that. I could do for example df.col2[len(df)-1] or df.iloc[-1,1]. But why wouldn't be the much simpler version of indexing directly by -1 be allowed? Am I maybe missing another simple selection way for -1? Tnx
The index labels of your DataFrame are [0,1,2]. Your code df.col2[1] is an equivalent of using a loc function as df['col2'].loc[1](or df.col2.loc[1]). You can see that you index does not contain a label '-1' (which is why you get the KeyError).
For positional indexing you need to use an iloc function (which you can use on Pandas Series as well as DataFrame), so you could do df['col2'].iloc[-1] (or df.col2.iloc[-1]).
As you can see, you can use both label based ('col2') and position based (-1) indexing together, you don't need to choose one or another as df.iloc[-1,1] or df.col2[len(df)-1] (which would be equivalent to df.loc[lend(df)-1,'col2'])

pandas df.apply returns series of the same list (like map) where should return one list

I have a function that takes a row of the daraframe (pd.Series) and returns one list. The idea is to apply it to dataframe and generate a new pd.Series of lists, one per each row:
sale_candidats = closings.apply(get_candidates_3, axis=1,
sales=sales_ts,
settings=settings,
reduce=True)
However, it seems that pandas try to map the list it returns (for the first row, probably) to original row, and raises an error (even despite reduce=True):
ValueError: Shape of passed values is (10, 8), indices imply (10, 23)
When I convert function to return set instead of the list, the whole thing starts working - except returning a data frame with the same shape and index/columns name as an original data frame, except that every cell is filled with corresponding row's set().
Looks a lot like a bug to me... how can I return one pd.Series instead?
Seems that this behaviour is, indeed, a bug in the latest version of pandas. take a look at the issue:
https://github.com/pandas-dev/pandas/pull/18577
You could just apply the function in a for loop, because that's all that apply does. You wouldn't notice a large speed penalty.

What is Pandas doing here that my indexes [0] and [1] refer to the same value?

I have a dataframe with these indices and values:
df[df.columns[0]]
1 example
2 example1
3 example2
When I access df[df.columns[0]][2], I get "example1". Makes sense. That's how indices work.
When I access df[df.columns[0]], however, I get "example", and I get example when I access df[df.columns[1]] as well. So for
df[df.columns[0]][0]
df[df.columns[0]][1]
I get "example".
Strangely, I can delete "row" 0, and the result is that 1 is deleted:
gf = df.drop(df.index[[0]])
gf
exampleDF
2 example1
3 example2
But when I delete row 1, then
2 example1
is deleted, as opposed to example.
This is a bit confusing to me; are there inconsistent standards in Pandas regarding row indices, or am I missing something / made an error?
You are probably causing pandas to switch between .iloc (index based) and .loc (label based) indexing.
All arrays in Python are 0 indexed. And I notice that indexes in your DataFrame are starting from 1. So when you run df[df.column[0]][0] pandas realizes that there is no index named 0, and falls back to .iloc which locates things by array indexing. Therefore it returns what it finds at the first location of the array, which is 'example'.
When you run df[df.column[0]][1] however, pandas realizes that there is a index label 1, and uses .loc which returns what it finds at that label, which again happens to be 'example'.
When you delete the first row, your DataFrame does not have index labels 0 and 1. So when you go to locate elements at those places in the way you are, it does not return None to you, but instead falls back on array based indexing and returns elements from the 0th and 1st places in the array.
To enforce pandas to use one of the two indexing techniques, use .iloc or .loc. .loc is label based, and will raise KeyError if you try df[df.column[0]].loc[0]. .iloc is index based and will return 'example' when you try df[df.column[0]].iloc[0].
Additional note
These commands are bad practice: df[col_label].iloc[row_index]; df[col_label].loc[row_label].
Please use df.loc[row_label, col_label]; or df.iloc[row_index, col_index]; or df.ix[row_label_or_index, col_label_or_index]
See Different Choices for Indexing for more information.

Extracting just a string element from a pandas dataframe

Okay, so say I have a pandas dataframe x, and I'm interested in extracting a value from it:
> x.loc[bar==foo]['variable_im_interested_in']
Let's say that returns the following, of type pandas.core.series.Series:
24 Boss
Name: ep_wb_ph_brand, dtype: object
But all I want is the string 'Boss'. Wrapping the first line of code in str() doesn't help either, I just get:
'24 Boss\nName: ep_wb_ph_brand, dtype: object'
How do I just extract the string?
Based on your comments, this code is returning a length-1 pandas Series:
x.loc[bar==foo]['variable_im_interested_in']
If you assign this value to a variable, then you can just access the 0th element to get what you're looking for:
my_value_as_series = x.loc[bar==foo]['variable_im_interested_in']
# Assumes the index to get is number 0, but from your example, it might
# be 24 instead.
plain_value = my_value_as_series[0]
# Likewise, this needs the actual index value, not necessarily 0.
also_plain_value = my_value_as_series.ix[0]
# This one works with zero, since `values` is a new ndarray.
plain_value_too = my_value_as_series.values[0]
You don't have to assign to a variable to do this, so you could just write x.loc[bar==foo]['variable_im_interested_in'][0] (or similar for the other options), but cramming more and more accessor and fancy indexing syntax onto a single expression is usually a bad idea.
Also note that you can directly index the column of interest inside of the call to loc:
x.loc[bar==foo, 'variable_im_interested_in'][24]
Code to get the last value of an array (run in a Jupyter notebook, noted with the >s):
> import pandas
> df = pandas.DataFrame(data=['a', 'b', 'c'], columns=['name'])
> df
name
0 a
1 b
2 c
> df.tail(1)['name'].values[0]
'c'
You could use string.split function.
>>> s = '24 Boss\nName: ep_wb_ph_brand, dtype: object'
>>> s.split()[1]
'Boss'

Categories