Pandas: How to compare DateTime64 and Datetime [duplicate] - python

i have this excruciatingly annoying problem (i'm quite new to python)
df=pd.DataFrame[{'col1':['1','2','3','4']}]
col1=df['col1']
Why does col1[1] in col1 return False?

For check values use boolean indexing:
#get value where index is 1
print (col1[1])
2
#more common with loc
print (col1.loc[1])
2
print (col1 == '2')
0 False
1 True
2 False
3 False
Name: col1, dtype: bool
And if need get rows:
print (col1[col1 == '2'])
1 2
Name: col1, dtype: object
For check multiple values with or:
print (col1.isin(['2', '4']))
0 False
1 True
2 False
3 True
Name: col1, dtype: bool
print (col1[col1.isin(['2', '4'])])
1 2
3 4
Name: col1, dtype: object
And something about in for testing membership docs:
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use the method isin():
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
#1 is in index
print (1 in col1)
True
#5 is not in index
print (5 in col1)
False
#string 2 is not in index
print ('2' in col1)
False
#number 2 is in index
print (2 in col1)
True
You try to find string 2 in index values:
print (col1[1])
2
print (type(col1[1]))
<class 'str'>
print (col1[1] in col1)
False

I might be missing something, and this is years later, but as I read the question, you are trying to get the in keyword to work on your panda series? So probably want to do:
col1[1] in col1.values
Because as mentioned above, pandas is looking through the index, and you need to specifically ask it to look at the values of the series, not the index.

Related

Create a new variable based on 4 other variables

I have a dataframe in Python called df1 where I have 4 dichotomous variables called Ordering_1; Ordering_2, Ordering_3, Ordering_4 with True/False values.
I need to create a variable called Clean, which is based on the 4 other variables. Meaning, when Ordering_1 == True, then Clean == Ordering_1, when Ordering_2==True, then Clean == Ordering_2. Then Clean would be a combination of all the true values from Ordering_1; Ordering_2, Ordering_3, Ordering_4.
Here is an example of how I would like the variable Clean to be:
I have tried the below code but it does not work:
df1[Clean] = df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1] + df1[Ordering_1]
Would anyone please be able to help me how to do this in python?
Universal solution if there are multiple Trues per rows - filter columns by DataFrame.filter and then use DataFrame.dot for matrix multiplication:
df1 = df.filter(like='Ordering_')
df['Clean'] = df1.dot(df1.columns + ',').str.strip(',')
If there is only one "True" value per row you can use the booleans of each column "Ordering_1", "Ordering_2", etc. and the df1.loc.
Note that this is what you get with df1.Ordering_1:
0 True
1 False
2 False
3 False
Name: Ordering_1, dtype: bool
With df1.loc you can use it to filter on the "True" rows, in this case only row 0:
So you can code this:
Create a new blank "clean" column:
df1["clean"]=""
Set the rows where the series df.Ordering_1 = True to "Ordering_1":
df1.loc[df1.Ordering_1,["clean"]] = "Ordering_1"
Proceed with the remaining columns in the same way.

Python:Can df['a']str.contains() have multiple condition?

I have 4 types of value in my df column A example shown below
123
123/123/123/123/123
123,,123,,123
1234-1234-1234
i want index of those value which do not have any type of sepertor in it
I tried like this but failed to get results
mask = df["A"].str.contains(',','/' na=False)
Any help would be appreciated
If possible invert logic - get all rows if only numbers without any separator use ^\d+$ - ^ means start of string, \d+ means one or more digits and $ means end of string - together only numbers values:
mask = df["A"].str.contains('^\d+$', na=False)
print (mask)
0 True
1 False
2 False
3 False
Name: A, dtype: bool

Get amount of a certain unique value within a column [duplicate]

I am trying to find the number of times a certain value appears in one column.
I have made the dataframe with data = pd.DataFrame.from_csv('data/DataSet2.csv')
and now I want to find the number of times something appears in a column. How is this done?
I thought it was the below, where I am looking in the education column and counting the number of time ? occurs.
The code below shows that I am trying to find the number of times 9th appears and the error is what I am getting when I run the code
Code
missing2 = df.education.value_counts()['9th']
print(missing2)
Error
KeyError: '9th'
You can create subset of data with your condition and then use shape or len:
print df
col1 education
0 a 9th
1 b 9th
2 c 8th
print df.education == '9th'
0 True
1 True
2 False
Name: education, dtype: bool
print df[df.education == '9th']
col1 education
0 a 9th
1 b 9th
print df[df.education == '9th'].shape[0]
2
print len(df[df['education'] == '9th'])
2
Performance is interesting, the fastest solution is compare numpy array and sum:
Code:
import perfplot, string
np.random.seed(123)
def shape(df):
return df[df.education == 'a'].shape[0]
def len_df(df):
return len(df[df['education'] == 'a'])
def query_count(df):
return df.query('education == "a"').education.count()
def sum_mask(df):
return (df.education == 'a').sum()
def sum_mask_numpy(df):
return (df.education.values == 'a').sum()
def make_df(n):
L = list(string.ascii_letters)
df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])
return df
perfplot.show(
setup=make_df,
kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Couple of ways using count or sum
In [338]: df
Out[338]:
col1 education
0 a 9th
1 b 9th
2 c 8th
In [335]: df.loc[df.education == '9th', 'education'].count()
Out[335]: 2
In [336]: (df.education == '9th').sum()
Out[336]: 2
In [337]: df.query('education == "9th"').education.count()
Out[337]: 2
An elegant way to count the occurrence of '?' or any symbol in any column, is to use built-in function isin of a dataframe object.
Suppose that we have loaded the 'Automobile' dataset into df object.
We do not know which columns contain missing value ('?' symbol), so let do:
df.isin(['?']).sum(axis=0)
DataFrame.isin(values) official document says:
it returns boolean DataFrame showing whether each element in the DataFrame
is contained in values
Note that isin accepts an iterable as input, thus we need to pass a list containing the target symbol to this function. df.isin(['?']) will return a boolean dataframe as follows.
symboling normalized-losses make fuel-type aspiration-ratio ...
0 False True False False False
1 False True False False False
2 False True False False False
3 False False False False False
4 False False False False False
5 False True False False False
...
To count the number of occurrence of the target symbol in each column, let's take sum over all the rows of the above dataframe by indicating axis=0.
The final (truncated) result shows what we expect:
symboling 0
normalized-losses 41
...
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
Try this:
(df[education]=='9th').sum()
easy but not efficient:
list(df.education).count('9th')
Simple example to count occurrences (unique values) in a column in Pandas data frame:
import pandas as pd
# URL to .csv file
data_url = 'https://yoursite.com/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)
# pandas count distinct values in column
df['education'].value_counts()
Outputs:
Education 47516
9th 41164
8th 25510
7th 25198
6th 25047
...
3rd 2
2nd 2
1st 2
Name: name, Length: 190, dtype: int64
for finding a specific value of a column you can use the code below
irrespective of the preference you can use the any of the method you like
df.col_name.value_counts().Value_you_are_looking_for
take example of the titanic dataset
df.Sex.value_counts().male
this gives a count of all male on the ship
Although if you want to count a numerical data then you cannot use the above method because value_counts() is used only with series type of data hence fails
So for that you can use the second method example
the second method is
#this is an example method of counting on a data frame
df[(df['Survived']==1)&(df['Sex']=='male')].counts()
this is not that efficient as value_counts() but surely will help if you want to count values of a data frame
hope this helps
EDIT --
If you wanna look for something with a space in between
you may use
df.country.count('united states')
I believe this should solve the problem
I think this could be a more easy solution. Suppose you have the following data frame.
DATE LANG POSTS
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 2
2008-08-01 c 85
2008-08-01 python 11
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 62
2008-08-01 c 85
2008-08-01 python 14
you can find the occurrence of LANG item's sum like this
df.groupby('LANG').sum()
and you will have the sum of each individual language

Validate strings using regex in pandas

I need a bit of help.
I'm pretty new to Python (I use version 3.0 bundled with Anaconda) and I want to use regex to validate/return a list of only valid numbers that match a criteria (say \d{11} for 11 digits). I'm getting the list using Pandas
df = pd.DataFrame(columns=['phoneNumber','count'], data=[
['08034303939',11],
['08034382919',11],
['0802329292',10],
['09039292921',11]])
When I return all the items using
for row in df.iterrows(): # dataframe.iterrows() returns tuple
print(row[1][0])
it returns all items without regex validation, but when I try to validate with this
for row in df.iterrows(): # dataframe.iterrows() returns tuple
print(re.compile(r"\d{11}").search(row[1][0]).group())
it returns an Attribute error (since the returned value for non-matching values is None.
How can I work around this, or is there an easier way?
If you want to validate, you can use str.match and convert to a boolean mask using df.astype(bool):
x = df['phoneNumber'].str.match(r'\d{11}').astype(bool)
x
0 True
1 True
2 False
3 True
Name: phoneNumber, dtype: bool
You can use boolean indexing to return only rows with valid phone numbers.
df[x]
phoneNumber count
0 08034303939 11
1 08034382919 11
3 09039292921 11

Fail to filter pandas dataframe by categorical column

pandas 0.16.1
I converted all columns in dataframe to categoricals so it takes MUCH less space when dumped to disk. Now i want to filter dataframe. It's ok with == and .isin but fails on <, <=, etc. operations with "Unordered Categoricals can only compare equality or not"
data[data["MONTH COLUMN"]<=3]
If i comment out the following lines in categorical.py everything works fine. Is it a bug in pandas?
if not self.ordered:
if op in ['__lt__', '__gt__','__le__','__ge__']:
raise TypeError("Unordered Categoricals can only compare equality or not")
I think it was a good idea to use Categorical datatype on column which has only 12 unique values in ~1'400'000 rows.)
The documentation states:
Note New categorical data are NOT automatically ordered. You must explicity pass ordered=True to indicate an ordered Categorical.
When you first create a category you want to be ordered, just specify this:
In [1]: import pandas as pd
In [3]: s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
In [5]: s
Out[5]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a < b < c]
In [4]: s > 'a'
Out[4]:
0 False
1 True
2 True
3 False
dtype: bool

Categories