While discovering python, i found myself stuck trying to select rows (food items) based on values of a column (macro nutriments). My condition uses a relational operation and the output is not the correct.
(particularly with the > or < operators not having the problem with the == operator).
data.loc[data['protein']=='10']
Result of my code sample
The result is correct because all the rows (food items) seem to have a protein of value 10.
data.loc[data['protein']>'10']
Result of my code sample
The result is incorrect because all the rows have a value that do not respect the condition given (you have rows with protein < 10 like you have rows with protein >10) .
data.loc[data['protein']>'10']
Any thoughts on the issue ? Do you think it's related with the file format (see code sample below) ? If so, how can i get around the problem ?
data = pd.read_excel('Documents/test.xlsx',names=col_names,usecols="D,E,F,G,H,J,M,N,P,Q,R,T,Y,Z,AA", index_col =[3]).
Thanks in advance and Happy Holidays !!
[EDITED]
So did more digging, indeed i am comparing two different things. #Daniel Mesejo the type of protein is Object. Since i want to have the protein column in a float format, i decided to convert it into string first and then into float. Unfortunately converting it to string using the .astype(str) didn't work
result
Use data['protein'] = data['protein'].astype('int64') to convert string to integer and then retry what you were doing.
Your problem is that you are comparing string rather than integers. Change data.loc[data['protein']>'10'] for data.loc[int(data['protein'])>10]
Related
After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.
So, let's say I got a list of strings just like
vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']
I want to count which values appear more than 2 times.
Consider that the column name of the dataframe based upon the list is 'veh'.
So, this piece of code works:
df['veh'].value_counts()[df['veh'].value_counts() > 2]
The question is: why the [df['veh'].value_counts() > 2] part comes right after the "()" of value_counts()? No "." or any other linking sign that could mean something.
If I use the code
df['classi'].value_counts() > 1
(which would be the logic synthax that my limited brain can abstract), it returns boolean values.
Can someone, please, help me understanding the logic behind pandas?
I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.
Thank you in advance!
The following line of code
df['veh'].value_counts()
Return a pandas Series with keys as indices and number of occurrences as values
Everything between square brackets [] are filters on keys for a pandas Series. So
df['veh'].value_counts()['car']
Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()
A pandas series also accept lists of keys as indices, So
df['veh'].value_counts()[['car','boat']]
Should return the number of occurrences for the words 'car' and 'boat' respectively
Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask
When you write
df['veh'].value_counts() > 2
You make a comparison between each value on df['veh'].value_counts() and the number 2. This returns a boolean for each value, that is a boolean mask.
So you can use the boolean mask as a filter on the series you created. Thus
df['veh'].value_counts()[df['veh'].value_counts() > 2]
Returns all the occurrences for the keys where the occurrences are greater than 2
The logic is that you can slice a series with a boolean series of the same size:
s[bool_series]
or equivalently
s.loc[bool_series]
This is also referred as boolean indexing.
Now, your code is equivalent to:
s = df['veh'].value_counts()
bool_series = s > 2
And then either the first two lines, e.g. s[s>2]
I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help
I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.
I have a problem with function:
print(df.loc[df['Kriterij1'] == '63'])
and i was tried with (the same)
df[df.Kriterij1.isin(['aaa','63'])]
When I want to try filtered by numbers the output is only the head (empty cells) its work only for word 'aaa'.
Or maybe i can use anoter function?
I think you need change '63' (string) to 63 (number), if mixed numeric with strings values:
print(df.loc[df['Kriterij1'] == 63])
print(df[df.Kriterij1.isin(['aaa',63])])
minor problem doing my head in. I have a dataframe similar to the following:
Number Title
12345678 A
34567890-S B
11111111 C
22222222-L D
This is read from an excel file using pandas in python, then the index set to the first column:
db = db.set_index(['Number'])
I then lookup Title based on Number:
lookup = "12345678"
title = str(db.loc[lookup, 'Title'])
However... Whilst anything postfixed with "-Something" works, anything without it doesn't find a location (eg. 12345678 will not find anything, 34567890-S will). My only hunch is it's to do with looking up as either strings or ints, but I've tried a few things (converting the table to all strings, changing loc to iloc,ix,etc) but so far no luck.
Any ideas? Thanks :)
UPDATE: So trying this from scratch doesn't exhibit the same behaviour (creating a test db presumably just sets everything as strings), however importing from CSV is resulting in the above, and...
Searching "12345678" (as a string) doesn't find it, but 12345678 as an int will. Likewise the opposite for the others. So the dataframe is only matching the pure numbers in the index with ints, but anything else with strings.
Also, I can't not search for the postfix, as I have multiple rows with differing postfix eg 34567890-S, 34567890-L, 34567890-X.
If you want to cast all entries to one particular type, you can use pandas.Series.astype:
db["Number"] = df["Number"].astype(str)
db = db.set_index(['Number'])
lookup = "12345678"
title = db.loc[lookup, 'Title']
Interestingly this is actually slower than using pandas.Index.map:
x1 = [pd.Series(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
x2 = [pd.Index(np.arange(n)) for n in np.logspace(1, 4, dtype=int)]
def series_astype(x1):
return x1.astype(str)
def index_map(x2):
return x2.map(str)
Consider all the indeces as strings, as at least some of them are not numbers. If you want to lookup a specific item that possibly could have a postfix, you could match it by comparing the start of the strings with .str.startswith:
lookup = db.index.str.startswith("34567890")
title = db.loc[lookup, "Title"]
All I would like to do is add .001 to each value that isn't a 0 in one column (say column 7 for example) in my csv file.
So instead of being 35, the value would be changed to 35.001 for example. I need to do this to make my ArcMap script work because if a whole number is the first read, the column is assigned as a short integer when it needs to be read as a float.
As of right now, I have:
writer.writerow([f if f.strip() =='0' else f+.001 for f in row])
This creates a concatenation error however and does not yet address the specific column I need this to work on.
Any help would be greatly appreciated.
Thank you.
The easiest thing to do is to just mutate the row in place, i.e.
if row[7].strip() != '0' and '.' not in row[7]:
row[7] = row[7] + '.001'
writer.writerow(row)
The concatenation error is caused by trying to add a float to a string, you just have to wrap the extra decimals in quotes
The extra condition on the if ensures that you don't accidentally end up with a number with 2 decimal points
It's pretty standard for numbers of the form like 35.0 to be treated as floats even though it's a whole number - check if ArcMap follows this convention, and then you can avoid reducing the accuracy of your numbers by just appending '.0'