I am trying to use a dictionary value to define the slice ranges for the iloc function but I keep getting the error -- Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] . The excel sheet is built for visual information and not in any kind of real table format (not mine so I can’t change it) so I have to slice the specific ranges without column labels.
tried code - got the error
cr_dict= {'AA':'[42:43,32:65]', 'BB':'[33:34, 32:65]'}
df = my_df.iloc[cr_dict['AA']]
the results I want would be similar to
df = my_df.iloc[42:43,32:65]
I know I could change the dictionary and use the following but it looks convoluted and not as easy to read– is there a better way?
Code
cr_dict= {'AA':[42,43,32,65], 'BB':'[33,34, 32,65]'}
df = my_df.iloc[cr_dict['AA'][0]: cr_dict['AA'][0], cr_dict['AA'][0]: cr_dict['AA'][0]]
Define your dictionaries slightly differently.
cr_dict= {'AA':[42,43]+list(range(32,65)),
'BB':[33,34]+list(range(32,65))}
Then you can slice your DataFrame like so:
>>> my_df.iloc[cr_dict["AA"], cr_dict["BB"]].sort_index()
Related
Let's assume we have a simple dataframe like this:
df = pd.DataFrame({'col1':[1,2,3], 'col2':[10,20,30]})
Then I can select elements like this
df.col2[0] or df.col2[1]
But if I want to select the last element with df.col2[-1] it results in the error message:
KeyError: -1
I know that there are workarounds to that. I could do for example df.col2[len(df)-1] or df.iloc[-1,1]. But why wouldn't be the much simpler version of indexing directly by -1 be allowed? Am I maybe missing another simple selection way for -1? Tnx
The index labels of your DataFrame are [0,1,2]. Your code df.col2[1] is an equivalent of using a loc function as df['col2'].loc[1](or df.col2.loc[1]). You can see that you index does not contain a label '-1' (which is why you get the KeyError).
For positional indexing you need to use an iloc function (which you can use on Pandas Series as well as DataFrame), so you could do df['col2'].iloc[-1] (or df.col2.iloc[-1]).
As you can see, you can use both label based ('col2') and position based (-1) indexing together, you don't need to choose one or another as df.iloc[-1,1] or df.col2[len(df)-1] (which would be equivalent to df.loc[lend(df)-1,'col2'])
So I have a pandas DataFrame that has several columns that contain values I'd like to use to create new columns using a function I've defined. I'd been planning on doing this using Python's List Comprehension as detailed in this answer. Here's what I'd been trying:
df['NewCol1'], df['NewCol2'] = [myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])]
This runs correctly until it comes time to assign the values to the new columns, at which point it fails, I believe because it hasn't been iteratively assigning the values and instead tries to assign a constant value to each column. I feel like I'm close to doing this correctly, but I can't quite figure out the assignment.
EDIT:
The data are all strings, and the function performs a fetching of some different information from another source based on those strings like so:
def myFunction(x, y):
# read file based on value of x
# search file for values a and b based on value of y
return(a, b)
I know this is a little vague, but the helper function is fairly complicated to explain.
The error received is:
ValueError: too many values to unpack (expected 4)
You can use zip()
df['NewCol1'], df['NewCol2'] = zip(*[myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])])
Here's what I have in my dataframe-
RecordType Latitude Longitude Name
L 28.2N 70W Jon
L 34.3N 56W Dan
L 54.2N 72W Rachel
Note: The dtype of all the columns is object.
Now, in my final dataframe, I only want to include those rows in which the Latitude and Longitude fall in a certain range (say 24 < Latitude < 30 and 79 < Longitude < 87).
My idea is to apply a function to all the values in the Latitude and Longitude columns to first get float values like 28.2, etc. and then to compare the values to see if they fall into my range. So I wrote the following-
def numbers(value):
return float(value[:-1])
result[u'Latitude'] = result[u'Latitude'].apply(numbers)
result[u'Longitude'] = result[u'Longitude'].apply(numbers)
But I get the following warning-
Warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I'm having a hard time understanding this since I'm new to Pandas. What's the best way to do this?
If you don't want to modify df, I would suggest getting rid of the apply and vectorising this. One option is using eval.
u = df.assign(Latitude=df['Latitude'].str[:-1].astype(float))
u['Longitude'] = df['Longitude'].str[:-1].astype(float)
df[u.eval("24 < Latitude < 30 and 79 < Longitude < 87")]
You have more options using Series.between:
u = df['Latitude'].str[:-1].astype(float))
v = df['Longitude'].str[:-1].astype(float))
df[u.between(24, 30, inclusive=False) & v.between(79, 87, inclusive=False)]
As for why Pandas threw that particular A value is trying to be set on a copy of a slice... warning and how to avoid it:
First, using this syntax should prevent the error message:
result.loc[:,'Latitude'] = result['Latitude'].apply(numbers)
Pandas gave you the warning because your .apply() function may be attempting to modify a temporary copy of Latitude/Longitude columns in your dataframe. Meaning, the column is copied to a new location in memory before the operation is performed on it. The article you referenced (http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) gives examples of why this could potentially cause unexpected problems in certain situations.
Pandas instead recommends that you instead use syntax that will ensure you are modifying a view of your dataframe's column with the .apply() operation. Doing this will ensure that your dataframe ends up being modified in the manner you expect. The code I wrote above using .loc will tell Pandas to access and modify the contents of that column in-place in memory, and this will keep Pandas from throwing the warning that you saw.
It's probably something very stupid but I can't find a solution to not print the indexes when executing the code.My code goes:
Reading the excel file and choosing a specific component
df= pd.read_excel('Components.xlsx')
component_name = 'Name'
Forcing the index to be a certain column
df = df.set_index(['TECHNICAL DATA'])
Selecting data in a cell with df.loc
component_lifetime=df.loc[['Life time of Full unit'],component_name]
print(componet_lifetime)
What I get is:
TECHNICAL DATA
Life time of Full unit 20
Is it possible to hide all the index data and only print 20? Thank you ^^
Use pd.DataFrame.at for scalar access by label:
res = df.at['Life time of Full unit', 'Name']
A short guide to indexing:
Use iat / at for scalar access / setting by integer position or label respectively.
Use iloc / loc for non-scalar access / setting by integer position or label respectively.
You can also extract the NumPy array via values, but this is rarely necessary.
Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!