Check whether column values are within range - python

Here's what I have in my dataframe-
RecordType Latitude Longitude Name
L 28.2N 70W Jon
L 34.3N 56W Dan
L 54.2N 72W Rachel
Note: The dtype of all the columns is object.
Now, in my final dataframe, I only want to include those rows in which the Latitude and Longitude fall in a certain range (say 24 < Latitude < 30 and 79 < Longitude < 87).
My idea is to apply a function to all the values in the Latitude and Longitude columns to first get float values like 28.2, etc. and then to compare the values to see if they fall into my range. So I wrote the following-
def numbers(value):
return float(value[:-1])
result[u'Latitude'] = result[u'Latitude'].apply(numbers)
result[u'Longitude'] = result[u'Longitude'].apply(numbers)
But I get the following warning-
Warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I'm having a hard time understanding this since I'm new to Pandas. What's the best way to do this?

If you don't want to modify df, I would suggest getting rid of the apply and vectorising this. One option is using eval.
u = df.assign(Latitude=df['Latitude'].str[:-1].astype(float))
u['Longitude'] = df['Longitude'].str[:-1].astype(float)
df[u.eval("24 < Latitude < 30 and 79 < Longitude < 87")]
You have more options using Series.between:
u = df['Latitude'].str[:-1].astype(float))
v = df['Longitude'].str[:-1].astype(float))
df[u.between(24, 30, inclusive=False) & v.between(79, 87, inclusive=False)]

As for why Pandas threw that particular A value is trying to be set on a copy of a slice... warning and how to avoid it:
First, using this syntax should prevent the error message:
result.loc[:,'Latitude'] = result['Latitude'].apply(numbers)
Pandas gave you the warning because your .apply() function may be attempting to modify a temporary copy of Latitude/Longitude columns in your dataframe. Meaning, the column is copied to a new location in memory before the operation is performed on it. The article you referenced (http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) gives examples of why this could potentially cause unexpected problems in certain situations.
Pandas instead recommends that you instead use syntax that will ensure you are modifying a view of your dataframe's column with the .apply() operation. Doing this will ensure that your dataframe ends up being modified in the manner you expect. The code I wrote above using .loc will tell Pandas to access and modify the contents of that column in-place in memory, and this will keep Pandas from throwing the warning that you saw.

Related

Python iloc slice range from dictionary value

I am trying to use a dictionary value to define the slice ranges for the iloc function but I keep getting the error -- Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] . The excel sheet is built for visual information and not in any kind of real table format (not mine so I can’t change it) so I have to slice the specific ranges without column labels.
tried code - got the error
cr_dict= {'AA':'[42:43,32:65]', 'BB':'[33:34, 32:65]'}
df = my_df.iloc[cr_dict['AA']]
the results I want would be similar to
df = my_df.iloc[42:43,32:65]
I know I could change the dictionary and use the following but it looks convoluted and not as easy to read– is there a better way?
Code
cr_dict= {'AA':[42,43,32,65], 'BB':'[33,34, 32,65]'}
df = my_df.iloc[cr_dict['AA'][0]: cr_dict['AA'][0], cr_dict['AA'][0]: cr_dict['AA'][0]]
Define your dictionaries slightly differently.
cr_dict= {'AA':[42,43]+list(range(32,65)),
'BB':[33,34]+list(range(32,65))}
Then you can slice your DataFrame like so:
>>> my_df.iloc[cr_dict["AA"], cr_dict["BB"]].sort_index()

Perform logical operations and add new columns of dataframe in the same time?

Since I'm new and learning python, I want to collect specific data in dataframe corresponding to logical operation and add a label to it, however, this needs to perform in many lines of code.
For example:
df = df[(df['this_col'] >= 10) & (df['anth_col'] < 100)]
result_df = df.copy()
result_df['label'] = 'medium'
I really wonder if there's a way such that I can perform in one line of code without applying a function. If it cannot perform in one line, how come?
Cheers!
query returns a copy, always.
result_df = df.query("this_col >= 10 and anth_col < 100").assign(label='medium')
Assuming your columns can pass off as valid identifier names in python, this will do.

Python: How to hide index when executing

It's probably something very stupid but I can't find a solution to not print the indexes when executing the code.My code goes:
Reading the excel file and choosing a specific component
df= pd.read_excel('Components.xlsx')
component_name = 'Name'
Forcing the index to be a certain column
df = df.set_index(['TECHNICAL DATA'])
Selecting data in a cell with df.loc
component_lifetime=df.loc[['Life time of Full unit'],component_name]
print(componet_lifetime)
What I get is:
TECHNICAL DATA
Life time of Full unit 20
Is it possible to hide all the index data and only print 20? Thank you ^^
Use pd.DataFrame.at for scalar access by label:
res = df.at['Life time of Full unit', 'Name']
A short guide to indexing:
Use iat / at for scalar access / setting by integer position or label respectively.
Use iloc / loc for non-scalar access / setting by integer position or label respectively.
You can also extract the NumPy array via values, but this is rarely necessary.

Python aggregate functions (e.g. sum) not working on object dtypes, but won't work when they're converted either?

I'm importing data from a CSV file which has text, date and numeric columns. I'm using pandas.read_csv() to read it in, but I'm not specifying what each column's dtype should be. Here's a cut of that csv file (apologies for the shoddy formatting).
Now these two columns (total_imp_pma, char_value_aa503) are imported very differently. I import all the number fields and create a new dataframe called base_varlist4, which only contains the number columns.
When I run base_varlist4.dtypes, I get:
total_imp_pma object
char_value_aa503 float64
So as you can see, total_imp_pma was imported as an object. The problem then means that if I run this:
#calculate max, and group by obs_date
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
Where varlist4 is just my list of columns, I get the wrong max value for total_imp_pma but the correct max value for char_value_aa503.
Logically, this means I should change the object total_imp_pma to either a float or an integer. However, when I run:
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
And then proceed to do the max value, I still get an incorrect result.
What's going on here? Why does pandas.read_csv() import some columns as an object dtype, and others as an int64 or float64 dtype? Why does conversion not work?
I have a theory but I'm not sure how to work around it. The only difference I see in the two columns in my source data are that total_imp_pma has mixed typed cells all the way down. For example, 66979 is a General cell, while there's a cell a little further down with a value of 1,760.60 as a number.
I think the mixed cell types in certain columns is causing pandas.read_csv() to be confused and just say "whelp, dunno what this is, import it as an object".
... how do I fix this?
EDIT: Here's an MCVE as per the request below.
Data in CSV is:
Char_Value_AA503 Total_IMP_PMA
1293 19.9
1831 0.9
1.2
243 2,666.50
Code is:
import pandas as pd
loc = r"xxxxxxxxxxxxxx"
source_data_name = 'import_problem_example.csv'
reporting_date = '01Feb2018'
source_data = pd.read_csv(loc + source_data_name)
source_data.columns = source_data.columns.str.lower()
varlist4 = ["char_value_aa503","total_imp_pma"]
base_varlist4 = source_data[varlist4]
base_varlist4['obs_date'] = reporting_date
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
""" Test some stuff"""
source_data.dtypes
output_max
source_data.dtypes
As you can see, the max value of total_imp_pma comes out as 19.9, when it should be 2666.50.

Tricky str value replacement within PANDAS DataFrame

Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!

Categories