Removing the words DateTimeIndex from a list of dates - python

I have a multiple list of dates in a pandas dataframe in this format:
col1 col2
1 [DatetimeIndex(['2018-10-01', '2018-10-02',
'2018-10-03', '2018-10-04'],
dtype='datetime64[ns]', freq='D')
I would like to take off the words DatetimeIndex and dtype='datetime64[ns]', freq='D' and turn the list into a set. The format I would be looking for is:
{'2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04}

Pandas is not designed to hold collections within series values, so what you are looking to do is strongly discouraged. A much better idea, especially if you have a consistent number of values in each DatetimeIndex series value, is to join extra columns:
D = pd.DatetimeIndex(['2018-10-01', '2018-10-02', '2018-10-03', '2018-10-04'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame({'col1': [1], 'col2': [D]})
df = df.join(pd.DataFrame(df.pop('col2').values.tolist()))
print(df)
col1 0 1 2 3
0 1 2018-10-01 2018-10-02 2018-10-03 2018-10-04
If you really want a set as each series value, you can do so via map + set:
df['col2'] = list(map(set, df['col2'].values))
print(df)
col1 col2
0 1 {2018-10-01 00:00:00, 2018-10-02 00:00:00, 201...

Have you tried:
set(index_object.tolist())
I suspect this will return you a set of timestamp objects rather than strings so depends on your use case whether this is something you want
if it's the strings you want you can modify the code as follows:
set(index_object.dt.strftime("%Y-%m-%d").tolist())
For your specific format (which I don't necessarily approve of!) you can try this:
import itertools
string_lists = col2.apply(lambda x: x.dt.strftime("%Y-%m-%d").tolist())
unique_set = set(itertools.chain.from_iterable(string_lists.tolist()))

Related

how to properly remove na_values that were read with an specific format?

I am trying to remove specific NA format with .dropna() method from pandas, however when apply it, the method returns None object.
import pandas as pd
# importing data #
df = pd.read_csv(path, sep=',', na_values='NA')
# this is how the df looks like
df = {'col1': [1, 2], 'col2': ['NA', 4]}
df=pd.DataFrame(df)
# trying to drop NA
d= df.dropna(how='any', inplace=True)
This code returns a None object, expected output could look like this:
# col1 col2
#0 2 4
How could I adjust this method?
Is there any simpler method to accomplish this task?
import numpy as np
import pandas as pd
Firstly replace 'NA' values in your dataframe with actual 'NaN' value by replace() method:
df=df.replace('NA',np.nan,regex=True)
Finally:
df.dropna(how='any', inplace=True)
Now if you print df you will get your desired output:
col1 col2
1 2 4.0
If you want exact same output that you mentioned in question then just use reset_index() method:
df=df.reset_index(drop=True)
Now if you print df you will get:
col1 col2
0 2 4.0
Remove records with string 'NA'
df[~df.eq('NA').any(1)]
col1 col2
1 2 4

Parsing Column names as DateTime

Is there a way of parsing the column names themselves as datetime.? My column names look like this:
Name SizeRank 1996-06 1996-07 1996-08 ...
I know that I can convert values for a column to datetime values, e.g for a column named datetime, I can do something like this:
temp = pd.read_csv('data.csv', parse_dates=['datetime'])
Is there a way of converting the column names themselves? I have 285 columns i.e my data is from 1996-2019.
There's no way of doing that immediately while reading the data from a file afaik, but you can fairly simply convert the columns to datetime after you've read them in. You just need to watch out that you don't pass columns that don't actually contain a date to the function.
Could look something like this, assuming all columns after the first two are dates (as in your example):
dates = pd.to_datetime(df.columns[2:])
You can then do whatever you need to do with those datetimes.
You could do something like this.
df.columns = df.columns[:2] + pd.to_datetime (df.columns[2:])
It seems pandas will accept a datetime object as a column name...
import pandas as pd
from datetime import datetime
import re
columns = ["Name", "2019-01-01","2019-01-02"]
data = [["Tom", 1,0], ["Dick",1,1], ["Harry",0,0]]
df = pd.DataFrame(data, columns = columns)
print(df)
newcolumns = {}
for col in df.columns:
if re.search("\d+-\d+-\d+", col):
newcolumns[col] = pd.to_datetime(col)
else:
newcolumns[col] = col
print(newcolumns)
df.rename(columns = newcolumns, inplace = True)
print("--------------------")
print(df)
print("--------------------")
for col in df.columns:
print(type(col), col)
OUTPUT:
Name 2019-01-01 2019-01-02
0 Tom 1 0
1 Dick 1 1
2 Harry 0 0
{'Name': 'Name', '2019-01-01': Timestamp('2019-01-01 00:00:00'), '2019-01-02': Timestamp('2019-01-02 00:00:00')}
--------------------
Name 2019-01-01 00:00:00 2019-01-02 00:00:00
0 Tom 1 0
1 Dick 1 1
2 Harry 0
--------------------
<class 'str'> Name
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 2019-01-01 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 2019-01-02 00:00:00
For brevity you can use...
newcolumns = {col:(pd.to_datetime(col) if re.search("\d+-\d+-\d+", col) else col) for col in df.columns}
df.rename(columns = newcolumns, inplace = True)

Change date format of pandas column (month-day-year to day-month-year)

Got the following issue.
I have an column in my pandas with some dates and some empty values.
Example:
1 - 3-20-2019
2 -
3 - 2-25-2019
etc
I want to convert the format from month-day-year to day-month-year, and when its empty, i just want to keep it empty.
What is the fastest approach?
Thanks!
One can initialize the data for the days using strings, then convert the strings to datetimes. A print can then deliver the objects in the needed format.
I will use an other format (with dots as separators), so that the conversion is clear between the steps.
Sample code first:
import pandas as pd
data = {'day': ['3-20-2019', None, '2-25-2019'] }
df = pd.DataFrame( data )
df['day'] = pd.to_datetime(df['day'])
df['day'] = df['day'].dt.strftime('%d.%m.%Y')
df[ df == 'NaT' ] = ''
Comments on the above.
The first instance of df is in the ipython interpreter:
In [56]: df['day']
Out[56]:
0 3-20-2019
1 None
2 2-25-2019
Name: day, dtype: object
After the conversion to datetime:
In [58]: df['day']
Out[58]:
0 2019-03-20
1 NaT
2 2019-02-25
Name: day, dtype: datetime64[ns]
so that we have
In [59]: df['day'].dt.strftime('%d.%m.%Y')
Out[59]:
0 20.03.2019
1 NaT
2 25.02.2019
Name: day, dtype: object
That NaT makes problems. So we replace all its occurrences with the empty string.
In [73]: df[ df=='NaT' ] = ''
In [74]: df
Out[74]:
day
0 20.03.2019
1
2 25.02.2019
Not sure if this is the fastest way to get it done. Anyway,
df = pd.DataFrame({'Date': {0: '3-20-2019', 1:"", 2:"2-25-2019"}}) #your dataframe
df['Date'] = pd.to_datetime(df.Date) #convert to datetime format
df['Date'] = [d.strftime('%d-%m-%Y') if not pd.isnull(d) else '' for d in df['Date']]
Output:
Date
0 20-03-2019
1
2 25-02-2019

How to apply pandas.DataFrame.replace on selected columns with inplace = True?

import pandas as pd
df = pd.DataFrame({
'col1':[99,99,99],
'col2':[4,5,6],
'col3':[7,None,9]
})
col_list = ['col1','col2']
df[col_list].replace(99,0,inplace=True)
This generates a Warning and leaves the dataframe unchanged.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I want to be able to apply the replace method on a subset of the columns specified by the user. I also want to use inplace = True to avoid making a copy of the dataframe, since it is huge. Any ideas on how this can be accomplished would be appreciated.
When you select the columns for replacement with df[col_list], a slice (a copy) of your dataframe is created. The copy is updated, but never written back into the original dataframe.
You should either replace one column at a time or use nested dictionary mapping:
df.replace(to_replace={'col1' : {99 : 0}, 'col2' : {99 : 0}},
inplace=True)
The nested dictionary for to_replace can be generated automatically:
d = {col : {99:0} for col in col_list}
You can use replace with loc. Here is a slightly modified version of your sample df:
d = {'col1':[99,99,9],'col2':[99,5,6],'col3':[7,None,99]}
df = pd.DataFrame(data=d)
col_list = ['col1','col2']
df.loc[:, col_list] = df.loc[:, col_list].replace(99,0)
You get
col1 col2 col3
0 0 0 7.0
1 0 5 NaN
2 9 6 99.0
Here is a nice explanation for similar issue.

Count the number of observations that occur per day

I have a pandas dataframe indexed by time. I want to know the total number of observations (i.e. dataframe rows) that happen each day.
Here is my dataframe:
import pandas as pd
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data, columns = ['date', 'value'])
print(df)
What I want is a dataframe (or series) that looks like this:
date value
0 2014-05-01 2
1 2014-05-02 3
2 2014-05-03 2
3 2014-05-04 2
After reaching a bunch of StackOverflow questions, the closest I can get is:
df['date'].groupby(df.index.map(lambda t: t.day))
But that doesn't produce anything of use.
Use resampling. You'll need the date columns to be datetime data type (as is, they are strings) and you'll need to set it as the index to use resampling.
In [13]: df['date'] = pd.to_datetime(df['date'])
In [14]: df.set_index('date').resample('D', 'count')
Out[14]:
value
date
2014-05-01 2
2014-05-02 4
2014-05-03 2
2014-05-04 2
You can use any arbitrary function or built-in convenience functions given as strings, included 'count' and 'sum' etc.
Wow, #Jeff wins:
df.resample('D',how='count')
My worse answer:
The first problem is that your date column is strings, not datetimes. Using code from this thread:
import dateutil
df['date'] = df['date'].apply(dateutil.parser.parse)
Then it's trivial, and you had the right idea:
grouped = df.groupby(df['date'].apply(lambda x: x.date()))
grouped['value'].count()
I know nothing about pandas, but in Python you could do something like:
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'],
'value': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
import datetime
dates = [datetime.datetime.strptime(ts, '%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d') for ts in data['date']]
cnt = {}
for d in dates: cnt[d] = (cnt.get(d) or 0) + 1
for i, k in enumerate(sorted(cnt)):
print("%d %s %d" % (i,k,cnt[k]))
Which would output:
0 2014-05-01 2
1 2014-05-02 4
2 2014-05-03 2
3 2014-05-04 2
If you didn't care about parsing and reformatting your datetime strings, I suppose something like
dates = [d[0:10] for d in data['date']]
could replace the longer dates=... line, but it seems less robust.
As exp1orer mentions, you'll need to convert string date to date format. Or if you simply want to count obs but don't care date format, you can take the first 10 chars of date column. Then use the value_counts() method (Personally, I prefer this to groupby + sum for this simple obs counts.
You can achive what you need by one liner:
In [93]: df.date.str[:10].value_counts()
Out[93]:
2014-05-02 4
2014-05-04 2
2014-05-01 2
2014-05-03 2
dtype: int64

Categories