Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")
Related
Tring to conditionally execute a query, only when ColumnA = 'New' and ColumnB = 'Left' (in each individual row). I know that str.contains() works when I only have 1 condition, however, I'm getting a ValueError ("ValueError: The truth value of a Series is ambiguous..."). Can this approach not be successfully applied, for my given scenario? Alternatively, is there a better approach?
Current code:
if df1['ColumnA'].str.contains('New') and df1['ColumnB'].str.contains('Left'):
do something...
Very basic example of the dataframe:
ColumnA
Column B
New
Left
Used
Right
Scrap
Down
New
Right
First row would be the desired row to carry forward (since it meets the criteria).
You have the right idea, however it doesn't appear like your code is expressing exactly what you want to do.
df1['ColumnA'].str.contains('New')
will return a Series with true and false values corresponding to the indices where the condition is true, not a true or false value for whether the entire column contains 'new'. To accomplish this consider doing something like the following:
'new' in df['ColumnA'].values
If you are trying to do it on a row by row basis then you must use the bitwise operator to compare truth values across Series (&).
This will return a boolean like you expected, hopefully this helps (:
Use bitwise & on two mask arrays and generate another column.
>>> import pandas as pd
>>> df = pd.DataFrame({'A':['New','Used','Scrap','New'], 'B':['Left','Right','Down','Right']})
>>> df
A B
0 New Left
1 Used Right
2 Scrap Down
3 New Right
>>> df['C'] = df['A'].str.contains('New') & df['B'].str.contains('Left')
>>> df
A B C
0 New Left True
1 Used Right False
2 Scrap Down False
3 New Right False
>>>
I have two dataframes, one with a list of dates and their corresponding holiday (df2), and another one with a list of transactions (df1). I'm trying to use the first one to flag holidays on the second one, but whenever I try to create a function and apply it, it just returns empty values for everything.
The function I'm using is as follows:
def isHoliday(t, holiday_list):
f = t.strftime('%Y-%m-%d')
if(f in (holiday_list)):
return 1
else:
return 0
And when I try to apply it:
df1.insert(3, 'isHoliday', df1['DATE'].apply(lambda x: isHoliday(x,
df2['DATE'])))
The dataframe only returns 0's. I've looked up date to date comparison and the answer I got from it was to compare them as strings, hence why the function is structured in that way.
What am I doing wrong? I've already preformatted the df2['DATE'] column as a string with the same strftime()
The only direct alternative I can think of is using df.lookup from one df to the other, but I'm not sure how to do it.
For the if statement to do what you're expecting you need to get a list or a numpy array from the Series returned by the df2['DATE'] operation. You can either do it by using the .values property or converting the series to a list list(df2['DATE']):
import pandas as pd
df2 = pd.DataFrame(data=[['2014-01-02'], ['2014-01-03']], columns=['DATE'])
print('2014-01-02' in df2['DATE']) # false
print('2014-01-02' in df2['DATE'].values) # true
print('2014-01-02' in list(df2['DATE'])) # true
Alternatively, the .str.contains() method can compare all the strings and then any() will find if there was a match.
any(df2['DATE'].str.contains('2014-01-02', regex=False)) # true
Converting your series to a list should solve your problem:
def isHoliday(t, holiday_list):
f = t.strftime('%Y-%m-%d')
if f in list(holiday_list): # convert series to list
return 1
else:
return 0
I want to perform a lookup in a dataframe via pandas. But It will be created by a series of nested if else statement similar as outlined Pandas dataframe add a field based on multiple if statements
But I want to use up to 13 different variables. This seems to pretty soon result in chaos. Is there some notation or other nice feature which allows me to specify such long and nested conditions in pandas?
So far np.where() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html might be my best option.
Is there a shortcut if I would only match for equality in all conditions?
Am I forced to write out each conditional filter? Ir could I just have a single expression which is choosing a (single) lookup value which is produced.
Edit Ideally I would not want to match
df.loc[df['column_name'] == some_value]
for each value ie. 13* number of categorical levels (lets assume 7) would be a lot of different values; especially, if df.loc[df['fist'] == some_value][df['second'] == otherValue1] combination of conditions occur i.e. they are all chained.
edit
A minimal example
df = pd.DataFrame({'ageGroup': [1, 2, 2, 1],
'first2DigitsOfPostcode': ['12', '23', '12', '12'],
'valueOfProduct': ['low', 'medum', 'high', 'low'],
'lookup_join_value': ['foo', 'bar', 'foo', 'baz']})
defines the lookup table which was generated by a sql query grouping by all the columns and aggregating the values (so due to the Cartesian product all. value combinations should be represented in the lookup table.
A new record could look like
new_values = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct': ['low']})
How can I sort of automate the lookup of all the conditions assuming all conditions require an match by equality (if this makes it easier.
I found
pd.lookup Vectorized look-up of values in Pandas dataframe which seems to work for a single column / condition
maybe a merge could be a solution? Python Pandas: DataFrame as a Lookup Table, but that not really produce the desired lookup result.
edit 2
The second answer seems to be pretty interesting. But
mask = df.drop('lookup_join_value', axis=1).isin(new_values)
print(mask)
print(df[mask])
print(df[mask]['lookup_join_value'])
will unfortunately just return NaN for the lookup value.
Now that I better know what you're after, a dataframe merge is likely a much better choice:
IN: df.merge(new_values, how='inner')
OUT: ageGroup first2DigitsOfPostcode lookup_join_value valueOfProduct
0 1 12 foo low
1 1 12 baz low
Certainly shorter than the other answer I gave! I'll leave the old one though in case it inspires someone else.
I think df.isin() is along the lines of what you're looking for.
Using your example df, and these two:
exists = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'low'})
new = pd.DataFrame({'ageGroup': [1],
'first2DigitsOfPostcode': ['12'],
'valueOfProduct' : 'high'})
Then you can check to see what values match, if all, or just some:
df.isin(exists.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True True
1 False False False
2 False True False
3 True True True
df.isin(new.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct
0 True True False
1 False False False
2 False True True
3 True True False
If your "query" wasn't a dataframe but instead a list it wouldn't need the ".values[0]" bit. The problem with a dictionary is it tries to match the index as well.
It's not clear to me from your question exactly what you want returned, but you could then subset based on whether all (or some) of the rows are the same:
# Returns matching rows
df[df.isin(exists.values[0]).values.all(True)]
# Returns rows where the first two columns match
matches = df.isin(new.values[0]).values
df[[item==[True,True,False] for item in matches.tolist()]]
...There might be a smarter way to write that last one.
df.query is an option if you can write if can the query as and expression using the column names:
so you can do:
query_string = 'some long (but valid) boolean query'
example from pandas:
>>> from numpy.random import randn
>>> from pandas import DataFrame
>>> df = DataFrame(randn(10, 2), columns=list('ab'))
>>> df.query('a > b')
# similar to this
>>> df[df.a > df.b]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
I am trying to create a series in my dataframe (sdbfile) whose values are based on several nested conditional statements using elements from sdbfile dataframe. The series reins_code is filled with string values.
The statement below works however I need to configure to say if 'reins_code' begins' with 'R' rather than == a specific 'R#'
sdbfile['product'] = np.where(sdbfile.reins_code == 'R2', 'HiredPlant','Trad')
It doesn't like the string function startswith() as its a np.series?
Can anybody help please? Have waded through the documentation but cannot see a reference to this problem.......
Use the vectorised str.startswith to return a boolean mask:
In [6]:
df = pd.DataFrame({'a':['R1asda','R2asdsa','foo']})
df
Out[6]:
a
0 R1asda
1 R2asdsa
2 foo
In [8]:
df['a'].str.startswith('R2')
Out[8]:
0 False
1 True
2 False
Name: a, dtype: bool
In [9]:
df[df['a'].str.startswith('R2')]
Out[9]:
a
1 R2asdsa
Use the pandas str attribute. http://pandas.pydata.org/pandas-docs/stable/text.html
Series and Index are equipped with a set of string processing methods
that make it easy to operate on each element of the array. Perhaps
most importantly, these methods exclude missing/NA values
automatically. These are accessed via the str attribute and generally
have names matching the equivalent (scalar) built-in string methods:
sdbfile['product'] = np.where(sdbfile.reins_code.str[0] == 'R', 'HiredPlant','Trad')
I am working with a pandas Series and I am trying to use the isin() method to find some of the members of the series. However, for pandas timestamp objects, this function does not appear to be working correctly.
import pandas
data = pandas.date_range('jan-01-2013','jan-05-2013')
s = pandas.Series(data)
print s.iloc[0] == data[0] # Returns True (correct)
print s.isin(data[0:2]) # Returns a series of all false values (incorrect)
Obviously, for the second print statement, the expected result is that the first two members of the series are true and everything else is false. However, it returns all false values. Is this a bug or am I implementing isin() incorrectly?
it's working like this:
s.isin(data[0:2].values)