I have a very messy dataframe imported from excel with only some rows containing a date in the first column (index 0, no headers). How do I drop all the rows that don't contain a date?
I would use pd.to_datetime with errors='coerce', then drop the null dates by indexing:
For example:
>>> df
x y
0 2011-02-03 1
1 x 2
2 1 3
3 2012-03-03 4
>>> df[pd.to_datetime(df.x, errors='coerce').notnull()]
x y
0 2011-02-03 1
3 2012-03-03 4
Note: This will lead to some problems if you have different date formats in your column
Explanation:
using pd.to_datetime with errors='coerce' will look for a date-like string, and return NaT (which is null) if it is not found:
>>> pd.to_datetime(df.x, errors='coerce')
0 2011-02-03
1 NaT
2 NaT
3 2012-03-03
Name: x, dtype: datetime64[ns]
Therefore, you can get all the non-null values using notnull:
>>> pd.to_datetime(df.x, errors='coerce').notnull()
0 True
1 False
2 False
3 True
Name: x, dtype: bool
And use that as a mask on your original dataframe
Related
i have big data, i want to count, sum, average for each row only between specific range.
df = pd.DataFrame({'id0':[10.3,20,30,50,108,110],'id1':[100.5,0,300,570,400,140], 'id2':[-2.6,-3,5,12,44,53], 'id3':[-100.1,4,6,22,12,42]})
id0 id1 id2 id3
0 10.3 100.5 -2.6 -100.1
1 20.0 0.0 -3.0 4.0
2 30.0 300.0 5.0 6.0
3 50.0 570.0 12.0 22.0
4 108.0 400.0 44.0 12.0
5 110.0 140.0 53.0 42.0
for example i want to count the occurrence of value between 10-100 for each row, so it will get:
0 1
1 1
2 1
3 3
4 2
5 2
Name: count_10-100, dtype: int64
currently i get this done by iterate for each row, transverse and using groupby. But this take a time because i have ~500 column and 500000 row
You can apply the conditions with AND between them, and then sum along the row (axis 1):
((df >= 10) & (df <= 100)).sum(axis=1)
Output:
0 1
1 1
2 1
3 3
4 2
5 2
dtype: int64
For sum and mean, you can apply the conditions with where:
df.where((df >= 10) & (df <= 100)).sum(axis=1)
df.where((df >= 10) & (df <= 100)).mean(axis=1)
Credit for this goes to #anky, who posted it first as a comment :)
Below summarizes the different situations in which you'd want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).
DataFrame.count returns counts for each column as a Series since the non-null count varies by column.
DataFrameGroupBy.size returns a Series, since all columns in the same group share the same row-count.
DataFrameGroupBy.count returns a DataFrame, since the non-null count could differ across columns in the same group.
To get the group-wise non-null count for a specific column, use df.groupby(...)['x'].count() where "x" is the column to count.
#Code Examples
df = pd.DataFrame({
'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()
df
A B
0 a x
1 a x
2 b NaN
3 b x
4 c NaN
s
0 x
1 x
2 NaN
3 x
4 NaN
Name: B, dtype: object
Row Count of a DataFrame: len(df), df.shape[0], or len(df.index)
len(df)
# 5
df.shape[0]
# 5
len(df.index)
# 5
Of the three methods above, len(df.index) (as mentioned in other answers) is the fastest.
Note
All the methods above are constant time operations as they are simple attribute lookups.
df.shape (similar to ndarray.shape) is an attribute that returns a tuple of (# Rows, # Cols).
Column Count of a DataFrame: df.shape[1], len(df.columns)
df.shape[1]
# 2
len(df.columns)
# 2
Analogous to len(df.index), len(df.columns) is the faster of the two methods (but takes more characters to type).
Row Count of a Series:
len(s), s.size, len(s.index)
len(s)
# 5
s.size
# 5
len(s.index)
# 5
s.size and len(s.index) are about the same in terms of speed. But I recommend len(df).
size is an attribute, and it returns the number of elements (=count of rows for any Series). DataFrames also define a size attribute which returns the same result as
df.shape[0] * df.shape[1].
Non-Null Row Count: DataFrame.count and Series.count
The methods described here only count non-null values (meaning NaNs are ignored).
Calling DataFrame.count will return non-NaN counts for each column:
df.count()
A 5
B 3
dtype: int64
For Series, use Series.count to similar effect:
s.count()
# 3
Group-wise Row Count: GroupBy.size
For DataFrames, use DataFrameGroupBy.size to count the number of rows per group.
df.groupby('A').size()
A
a 2
b 2
c 1
dtype: int64
Similarly, for Series, you'll use SeriesGroupBy.size.
s.groupby(df.A).size()
A
a 2
b 2
c 1
Name: B, dtype: int64
In both cases, a Series is returned.
Group-wise Non-Null Row Count: GroupBy.count
Similar to above, but use GroupBy.count, not GroupBy.size. Note that size always returns a Series, while count returns a Series if called on a specific column, or else a DataFrame.
The following methods return the same thing:
df.groupby('A')['B'].size()
df.groupby('A').size()
A
a 2
b 2
c 1
Name: B, dtype: int64
df.groupby('A').count()
B
A
a 2
b 1
c 0
df.groupby('A')['B'].count()
A
a 2
b 1
c 0
Name: B, dtype: int64
There's a neat way to do that with aggregations and using pandas methods. It can be read as "aggregate by row (axis=1) where x is greater or equal to 10 and less or equal to 100".
df.agg(lambda x : (x.ge(10) & x.le(100)).sum(), axis=1)
Something like this will help you.
df["n_values_in_range"] = df.apply(
func=lambda row: count_values_in_range(row, range_min, range_max), axis=1)
Try this:
df.apply(lambda x: x.between(10, 100), axis=1).sum(axis=1)
Output:
0 1
1 1
2 1
3 3
4 2
5 2
I have a DataFrame which contains data from last year, but the dates column has some missing dates
date
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
4 2019-11-05
I want to create a dictionary of gaps between dates, so keys would be start dates and values as end dates, something like:
dates_gaps = {2019-10-21:2019-10-29, 2019-10-29:2019-11-01,2019-11-01:2019-11-04 ...}
so I created a column to indicate whether a gap exists with the following:
df['missing_dates'] = df[DATE].diff().dt.days > 1
which outputs the following:
# True indicates if there's a gap or not
0 2019-10-21 False
1 2019-10-29 True
2 2019-11-01 True
3 2019-11-04 True
4 2019-11-05 False
and I'm having trouble going forward from here
You can add condition for compare missing values, convert date columnto strings by Series.dt.strftime and last create dictionary with zip:
diff = df['date'].diff()
s = df.loc[(diff.dt.days > 1) | diff.isna(), 'date'].dt.strftime('%Y-%m-%d')
print (s)
0 2019-10-21
1 2019-10-29
2 2019-11-01
3 2019-11-04
Name: date, dtype: object
d = dict(zip(s, s.shift(-1)[:-1]))
print (d)
{'2019-10-21': '2019-10-29', '2019-10-29': '2019-11-01', '2019-11-01': '2019-11-04'}
just convert these dates into datetime and find the difference between two adjacent dates.
a = pd.to_datetime('1900-01-01', format='%Y-%m-%d')
b = pd.to_datetime('1900-02-01', format='%Y-%m-%d')
c = a-b
c.days # -31
I have a dataframe (df) which the head looks like:
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
Is there a way to remove the non date values (not the rows)? With this example the resulting frame would look like:
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4
All the examples I can see want me to drop the rows and I'd like to keep them.
pd.to_datetime
With errors='coerce'
df.assign(Date=pd.to_datetime(df.Date, errors='coerce'))
Date
0 2015-01-04
1 1996-01-09
2 NaT
3 1992-12-05
4 NaT
You can fill those NaT with empty strings if you'd like (though I don't recommend it)
df.assign(Date=pd.to_datetime(df.Date, errors='coerce').fillna(''))
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00
4
If you want to preserve whatever the things were in your dataframe and simply replace the things that don't look like dates with ''
df.assign(Date=df.Date.mask(pd.to_datetime(df.Date, errors='coerce').isna(), ''))
Date
0 01/04/2015
1 01/09/1996
2
3 12/05/1992
4
One more simple way around..
>>> df
Date
0 01/04/2015
1 01/09/1996
2 N/A
3 12/05/1992
4 NOT KNOWN
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce').fillna('')
>>> df
Date
0 2015-01-04 00:00:00
1 1996-01-09 00:00:00
2
3 1992-12-05 00:00:00
There are many questions on re indexing, I tried the solutions but they dint work for my code, may be i got something wrong, I have a data set with two variables patnum(ID), vrddat(Date) and I'm using below code to get the data frame after applying certain conditions.
data_3 = data_2.loc[(((data_2.groupby('patnum').first()['vrddat']> datetime.date(2012,1,1)) &
(data_2.groupby('patnum').first()['vrddat']> datetime.date(2012,3,31)))),['patnum','vrddat','drug']].reset_index(drop = True)
Above code is throwing below error.
IndexingError
IndexingError: Unalignable boolean Series key provided
How do I get a new data frame having all the variables as input data after applying conditions, In the above code conditions work but when i'm using loc to get a new data frame with all the variables it's throwing Indexing error, I used reset_index as well but it dint work.
Thanks.
There is problem you want use boolean indexing in DataFrame data_2 by mask created from Series s, so need isin for check values in column vrddat by vals:
data_2 = pd.DataFrame({'patnum':[1,2,3,3,1],
'vrddat':pd.date_range('2012-01-10', periods=5, freq='1m'),
'drug':[7,8,9,7,5],
'zzz ':[1,3,5,6,7]})
print (data_2)
drug patnum vrddat zzz
0 7 1 2012-01-31 1
1 8 2 2012-02-29 3
2 9 3 2012-03-31 5
3 7 3 2012-04-30 6
4 5 1 2012-05-31 7
s = data_2.groupby('patnum')['vrddat'].first()
print (s)
patnum
1 2012-01-31
2 2012-02-29
3 2012-03-31
Name: vrddat, dtype: datetime64[ns]
mask = (s > datetime.date(2012,1,1)) & (s < datetime.date(2012,3,31))
print (mask)
patnum
1 True
2 True
3 False
Name: vrddat, dtype: bool
vals = s[mask]
print (vals)
patnum
1 2012-01-31
2 2012-02-29
Name: vrddat, dtype: datetime64[ns]
data_3 = data_2.loc[data_2['vrddat'].isin(vals), ['patnum','vrddat','drug']]
.reset_index(drop = True)
print (data_3)
patnum vrddat drug
0 1 2012-01-31 7
1 2 2012-02-29 8
Another faster solution for s is drop_duplicates:
s = data_2.drop_duplicates(['patnum'])['vrddat']
print (s)
0 2012-01-31
1 2012-02-29
2 2012-03-31
Name: vrddat, dtype: datetime64[ns]
shift converts my column from integer to float. It turns out that np.nan is float only. Is there any ways to keep shifted column as integer?
df = pd.DataFrame({"a":range(5)})
df['b'] = df['a'].shift(1)
df['a']
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# Name: a, dtype: int64
df['b']
# 0 NaN
# 1 0
# 2 1
# 3 2
# 4 3
# Name: b, dtype: float64
Solution for pandas under 0.24:
Problem is you get NaN value what is float, so int is converted to float - see na type promotions.
One possible solution is convert NaN values to some value like 0 and then is possible convert to int:
df = pd.DataFrame({"a":range(5)})
df['b'] = df['a'].shift(1).fillna(0).astype(int)
print (df)
a b
0 0 0
1 1 0
2 2 1
3 3 2
4 4 3
Solution for pandas 0.24+ - check Series.shift:
fill_value object, optional
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.
Changed in version 0.24.0.
df['b'] = df['a'].shift(fill_value=0)
another solution starting from pandas version 0.24.0: simply provide a value for the parameter fill_value:
df['b'] = df['a'].shift(1, fill_value=0)
You can construct a numpy array by prepending a 0 to all but the last element of column a
df.assign(b=np.append(0, df.a.values[:-1]))
a b
0 0 0
1 1 0
2 2 1
3 3 2
4 4 3
As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN.
df = pd.DataFrame({"a":range(5)})
df = df.convert_dtypes()
df['b'] = df['a'].shift(1)
print(df['a'])
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# Name: a, dtype: Int64
print(df['b'])
# 0 <NA>
# 1 0
# 2 1
# 3 2
# 4 3
# Name: b, dtype: Int64
another solution is to use replace() function and type cast
df['b'] = df['a'].shift(1).replace(np.NaN,0).astype(int)
I don't like other answers which may change original dtypes, what if you have float, str in data?
Since we don't need the first nan row , why not skip it.
I would keep all dtypes and cast back:
dt = df.dtypes
df = df.shift(1).iloc[1:].astype(dt)