Insert rows to fill gaps in year column in Pandas DataFrame - python

I have the following DataFrame:
import pandas as pd
data = {'id': ['A', 'A','B','C'],
'location':['loc1', 'loc2','loc1','loc3'],
'year_data': [2013,2015,2014,2015],
'c': [10.5, 13.5,12.3,9.75]}
data = pd.DataFrame(data)
For each groupby(['id','location']), I want to insert rows in the DataFrame starting from min(year) till 2015.
The desired output:
data = {'id': ['A', 'A', 'A','A','B','B','C'],
'location':['loc1', 'loc1', 'loc1', 'loc2','loc1','loc1','loc3'],
'year_data': [2013,2014,2015,2015,2014,2015,2015],
'c': [10.5,10.5,10.5, 13.5,12.3,12.3,9.75]}
data = pd.DataFrame(data)

Use lambda function with get minimal year from index created by DataFrame.set_index in range for Series.reindex with method='ffill' per groups:
f = lambda x: x.reindex(range(x.index.min(), 2016), method='ffill')
df = data.set_index("year_data").groupby(['id','location'])['c'].apply(f).reset_index()
print (df)
id location year_data c
0 A loc1 2013 10.50
1 A loc1 2014 10.50
2 A loc1 2015 10.50
3 A loc2 2015 13.50
4 B loc1 2014 12.30
5 B loc1 2015 12.30
6 C loc3 2015 9.75

Related

Sort dates in mm/dd/yy and dd/mm/yy where I know the month they are from

I have a column of date strings I know are from a single month, in this case the dates are all between January and February 2020. I want to sort them in ascending order. However, they are in different formats some in mm/dd/yy, some in dd/mm/yy. How can I sort them?
data = {
'date': ['1/1/2020','20/1/2020', '1/1/2020', '1/28/2020','21/1/2020', '1/25/2020', '29/1/2020'],
}
df = pd.DataFrame(data)
print(df)
Edit
Another sample of dates I'd like to be sorted
import pandas as pd
data = {'Tgl': {
1: '1/1/2023',
2: '1/1/2023',
3: '1/3/2023',
4: '1/5/2023',
5: '1/5/2023',
6: '1/9/2023',
7: '10/1/2023',
8: '12/1/2023',
9: '16/1/2023'}}
df = pd.DataFrame(data)
df = pd.to_datetime(df['Tgl'])
df = pd.to_datetime(df['Tgl'], dayfirst = True)
In the provided example, there is limited ambiguity as you don't have cases for which a day ≤ 12 is different from the month.
So you can use pandas.to_datetime(pd.to_datetime(df['date'])) to convert to a clean datetime, or, to sort while keeping the original strings:
df.sort_values(by='date', key=pd.to_datetime)
Output:
date
0 1/1/2020
2 1/1/2020
1 20/1/2020
4 21/1/2020
5 1/25/2020
3 1/28/2020
6 29/1/2020
If you have ambiguous dates (like 1/2/2020) you can choose to give priority to days/months with the dayfirst parameter:
df.sort_values(by='date', key=lambda x: pd.to_datetime(x, dayfirst=True))
Example:
date
2 2/1/2020 # Jan 2nd
1 20/1/2020
4 21/1/2020
5 1/25/2020
3 1/28/2020
6 29/1/2020
0 1/2/2020 # Feb 1st
custom logic
Let's assume the first number is the day, unless the value is > 2, in which case we convert it to month.
def custom_date(s):
return (
pd.to_datetime(s, dayfirst=True)
.mask(lambda x: x.dt.month>2,
pd.to_datetime(s, dayfirst=False))
)
df.sort_values(by='date', key=custom_date)
Output (with an additional column to see the result of the custom conversion):
date converted
2 2/1/2020 2020-01-02
7 10/1/2020 2020-01-10 # both converted
8 1/10/2020 2020-01-10 # to Jan 10
1 20/1/2020 2020-01-20
4 21/1/2020 2020-01-21
5 1/25/2020 2020-01-25
3 1/28/2020 2020-01-28
6 29/1/2020 2020-01-29
0 1/2/2020 2020-02-01

change a column that contain date and time with two columns containing date and time separately

I have some columns in a dataset that contain date and time and my goal is to obtain two separate columns that contain date and time separately.
Example:
Name Dataset: A
Starting
Name column: Cat
12/01/2021 20:15:06
02/01/2021 12:15:07
01/01/2021 15:05:03
01/01/2021 15:05:03
Goal
Name column: Cat1
12/01/2021
02/01/2021
01/01/2021
01/01/2021
Name Column: Cat2
20:15:06
12:15:07
15:05:03
15:05:03
I assume that you 're using pandas, and that you want to use the same dataframe.
# df = A (?)
df['Cat1'] = [d.date() for d in df['Cat']]
df['Cat2'] = [d.time() for d in df['Cat']]
Working example:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame.from_dict(
{'A': [1, 2, 3],
'B': [4, 5, 6],
'Datetime': [datetime.strftime(datetime.now()-timedelta(days=_),
"%m/%d/%Y, %H:%M:%S") for _ in range(3)]},
orient='index',
columns=['A', 'B', 'C']).T
df['Datetime'] = pd.to_datetime(df['Datetime'], format="%m/%d/%Y, %H:%M:%S")
# A B Datetime
# A 1 4 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59
df['Cat1'] = [d.date() for d in df['Datetime']]
df['Cat2'] = [d.time() for d in df['Datetime']]
# A B Datetime Cat1 Cat2
# A 1 4 2021-03-05 14:07:59 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59 2021-03-03 14:07:59

pandas time-series data preprocessing

I have dataframe look likes this :
> dt
text timestamp
0 a 2016-06-13 18:00
1 b 2016-06-20 14:08
2 c 2016-07-01 07:41
3 d 2016-07-11 19:07
4 e 2016-08-01 16:00
And I want to summarise every month's data like:
> dt_month
count timestamp
0 2 2016-06
1 2 2016-07
2 1 2016-08
the original dataset(dt) can be generated by:
import pandas as pd
data = {'text': ['a', 'b', 'c', 'd', 'e'],
'timestamp': ['2016-06-13 18:00', '2016-06-20 14:08', '2016-07-01 07:41', '2016-07-11 19:07', '2016-08-01 16:00']}
dt = pd.DataFrame(data)
And are there any ways can plot a time-frequency plot by dt_month ?
You can groupby by timestamp column converted to_period and aggregate size:
print (df.text.groupby(df.timestamp.dt.to_period('m'))
.size()
.rename('count')
.reset_index())
timestamp count
0 2016-06 2
1 2016-07 2
2 2016-08 1

How to store list of Pandas data frame for easy access

I have a list of data frame,
df1 =
Stock Year Profit CountPercent
AAPL 2012 1 38.77
AAPL 2013 1 33.33
df2 =
Stock Year Profit CountPercent
GOOG 2012 1 43.47
GOOG 2013 1 32.35
df3 =
Stock Year Profit CountPercent
ABC 2012 1 40.00
ABC 2013 1 32.35
The out put of a function is [df1,df2,df3,......] like that,
all the columns in the data frame will be same but the rows will be different,
how i can store these in hard disk and retrieve as a list again in most fast and efficient way?
If values in columns Stock are same, you can remove this column by iloc and use dict comprehension (key is first value of column Stock in each df):
dfs = {df.ix[0,'Stock']: df.iloc[:, 1:] for df in [df1,df2,df3]}
print (dfs['AAPL'])
Year Profit CountPercent
0 2012 1 38.77
1 2013 1 33.33
print (dfs['ABC'])
Year Profit CountPercent
0 2012 1 40.00
1 2013 1 32.35
print (dfs['GOOG'])
Year Profit CountPercent
0 2012 1 43.47
1 2013 1 32.35
For storing in disk I think the best is use hdf5 pytables.
If values in each Stack column are same, you can concat all df and then store it:
df = pd.concat([df1.set_index('Stock'), df2.set_index('Stock'), df3.set_index('Stock')])
print (df)
Year Profit CountPercent
Stock
AAPL 2012 1 38.77
AAPL 2013 1 33.33
GOOG 2012 1 43.47
GOOG 2013 1 32.35
ABC 2012 1 40.00
ABC 2013 1 32.35
store = pd.HDFStore('store.h5')
store['df'] = df
print (store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame (shape->[1,4])
I think if all your DFs have the same shape, then it would be more natural to store your data as pandas.Panel instead of list of DFs - this is how pandas_datareader is working
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO("""
Stock,Year,Profit,CountPercent
AAPL,2012,1,38.77
AAPL,2013,1,33.33
"""
))
df2 = pd.read_csv(io.StringIO("""
Stock,Year,Profit,CountPercent
GOOG,2012,1,43.47
GOOG,2013,1,32.35
"""
))
df3 = pd.read_csv(io.StringIO("""
Stock,Year,Profit,CountPercent
ABC,2012,1,40.0
ABC,2013,1,32.35
"""
))
store = pd.HDFStore('c:/temp/stocks.h5')
# i had to drop `Stock` column and make it Panel-Axis, because of ERROR:
# TypeError: Cannot serialize the column [%s] because its data contents are [mixed-integer] object dtype
# when saving Panel to HDFStore ...
p = pd.Panel({df.iat[0, 0]:df.drop('Stock', 1) for df in [df1,df2,df3]})
store = pd.HDFStore('c:/temp/stocks.h5')
store.append('stocks', p, data_columns=True, mode='w')
store.close()
# read panel from HDFStore
store = pd.HDFStore('c:/temp/stocks.h5')
p = store.select('stocks')
Store:
In [18]: store
Out[18]:
<class 'pandas.io.pytables.HDFStore'>
File path: c:/temp/stocks.h5
/stocks wide_table (typ->appendable,nrows->6,ncols->3,indexers->[major_axis,minor_axis],dc->[AAPL,ABC,GOOG])
Panel dimensions:
In [19]: p['AAPL']
Out[19]:
Year Profit CountPercent
0 2012.0 1.0 38.77
1 2013.0 1.0 33.33
In [20]: p[:, :, 'Profit']
Out[20]:
AAPL ABC GOOG
0 1.0 1.0 1.0
1 1.0 1.0 1.0
In [21]: p[:, 0]
Out[21]:
AAPL ABC GOOG
Year 2012.00 2012.0 2012.00
Profit 1.00 1.0 1.00
CountPercent 38.77 40.0 43.47

Pandas Python- can datetime be used with vectorized inputs

My pandas dataframe has year, month and date in the first 3 columns. To convert them into a datetime type, i use a for loop that loops over each row taking the content in the first 3 columns of each row as inputs to the datetime function. Any way i can avoid the for loop here and get the dates as a datetime?
I'm not sure there's a vectorized hook, but you can use apply, anyhow:
>>> df = pd.DataFrame({"year": [1992, 2003, 2014], "month": [2,3,4], "day": [10,20,30]})
>>> df
day month year
0 10 2 1992
1 20 3 2003
2 30 4 2014
>>> df["Date"] = df.apply(lambda x: pd.datetime(x['year'], x['month'], x['day']), axis=1)
>>> df
day month year Date
0 10 2 1992 1992-02-10 00:00:00
1 20 3 2003 2003-03-20 00:00:00
2 30 4 2014 2014-04-30 00:00:00

Categories