How to add rows with no values for some columns - python

I am using python 3.6.4 and pandas 0.23.0. I have referenced pandas 0.23.0 documentation for constructor and append. It does not mention anything about non-existent values. I didn't find any similar example.
Consider following code:
import pandas as pd
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
index_yrs = [2016, 2017, 2018]
r2016 = [26, 27, 25, 22, 20, 23, 22, 20, 20, 18, 18, 19]
r2017 = [20, 21, 18, 16, 15, 15, 15, 15, 13, 13, 14, 15]
r2018 = [16, 18, 18, 18, 17]
df = pd.DataFrame([r2016], columns = months, index = [index_yrs[0]])
df = df.append(pd.DataFrame([r2017], columns = months, index = [index_yrs[1]]))
Now how to add r2018 which has data only till Month of May?

I agree with RafaelC that padding your list for 2018 data with NaNs for missing values is the best way to do this. You can use np.nan from Numpy (which you will already have installed since you have Pandas) to generate NaNs.
import pandas as pd
import numpy as np
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
index_yrs = [2016, 2017, 2018]
As a small change to your code I've put data for all three years into a years list which we can pass as the data parameter for pd.DataFrame. This eliminates the need to append each row to the previous ones.
r2016 = [26, 27, 25, 22, 20, 23, 22, 20, 20, 18, 18, 19]
r2017 = [20, 21, 18, 16, 15, 15, 15, 15, 13, 13, 14, 15]
r2018 = [16, 18, 18, 18, 17]
years = [r2016] + [r2017] + [r2018]
This is what years looks like: [[26, 27, 25, 22, 20, 23, 22, 20, 20, 18, 18, 19],
[20, 21, 18, 16, 15, 15, 15, 15, 13, 13, 14, 15],
[16, 18, 18, 18, 17]].
As for padding your year 2018 with NaNs something like this might do the trick. We are just ensuring that if a year only has values for the first n months that the remaining months will be filled out with NaNs.
for year in years:
if len(year) < 12:
year.extend([np.nan] * (12 - len(year)))
Finally we can create your dataframe using the one liner below instead of appending row by row.
df = pd.DataFrame(years, columns=months, index=index_yrs).astype(float)
Output:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 26.0 27.0 25.0 22.0 20.0 23.0 22.0 20.0 20.0 18.0 18.0 19.0
2017 20.0 21.0 18.0 16.0 15.0 15.0 15.0 15.0 13.0 13.0 14.0 15.0
2018 16.0 18.0 18.0 18.0 17.0 NaN NaN NaN NaN NaN NaN NaN
You may notice that I converted the dtype of the values in the dataframe to float using .astype(float). I did this to make all of your columns as the same dtype. If we don't call .astype(float) then Jan-May will be dtype int and Jun-Dec will be dtype float64.

You can add a row using pd.DataFrame.loc via a series. So you only need to convert your array into a pd.Series object before adding a row:
df.loc[index_yrs[2]] = pd.Series(r2018, index=df.columns[:len(r2018)])
print(df)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 26.0 27.0 25.0 22.0 20.0 23.0 22.0 20.0 20.0 18.0 18.0 19.0
2017 20.0 21.0 18.0 16.0 15.0 15.0 15.0 15.0 13.0 13.0 14.0 15.0
2018 16.0 18.0 18.0 18.0 17.0 NaN NaN NaN NaN NaN NaN NaN
However, I strongly recommend you form a list of lists (with padding) before a single append. This is because list.append, or construction via a list comprehension, is cheap relative to repeated pd.DataFrame.append or pd.DataFrame.loc.
The above solution is recommended if you absolutely must add one row at a time.

Related

Multiple Date Formate to a single date pattern in pandas dataframe

I have a pandas date column in the following format
Date
0 March 13, 2020, March 13, 2020
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021
2 NaN
3 May 20, 2022, May 21, 2022
I tried to convert the pattern to a single pattern to store to a new column.
import pandas as pd
import dateutil.parser
# initialise data of lists.
data = {'Date':['March 13, 2020, March 13, 2020', '3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021', 'NaN','May 20, 2022, May 21, 2022']}
# Create DataFrame
df = pd.DataFrame(data)
df["FormattedDate"] = df.Date.apply(lambda x: dateutil.parser.parse(x.strftime("%Y-%m-%d") ))
But i am getting an error
AttributeError: 'str' object has no attribute 'strftime'
Desired Output
Date DateFormatted
0 March 13, 2020, March 13, 2020 2020-03-13, 2020-03-13
1 3.9.2021, 3.9.2021, 03.09.2021, 3. September 2021 2021-03-09, 2021-03-09, 2021-03-09, 2021-09-03
2 NaN NaN
3 May 20, 2022, May 21, 2022 2022-05-20, 2022-05-21
I was authot of previous solution, so possible solution is change also it for avoid , like separator and like value in date strings is used Series.str.extractall, converting to datetimes and last is aggregate join:
format_list = ["[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{1,2}(?:\,|\.|\/|\-)(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)(?:\s)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\s)?[0-9]{2,4}",
"(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}",
"[0-9]{1,2}(?:\.)?(?:\s)?(?:\n)?(?:(?:(?:j|J)a)|(?:(?:f|F)e)|(?:(?:m|M)a)|(?:(?:a|A)p)|(?:(?:m|M)a)|(?:(?:j|J)u)|(?:(?:a|A)u)|(?:(?:s|S)e)|(?:(?:o|O)c)|(?:(?:n|N)o)|(?:(?:d|D)e))\w*(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
# initialise data of lists.
data = {'Name':['Today is 09 September 2021', np.nan, '25 December 2021 is christmas', '01/01/2022 is newyear and will be holiday on 02.01.2022 also']}
# Create DataFrame
df = pd.DataFrame(data)
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['DateFormatted'] = df['Name'].str.extractall(f'({"|".join(format_list)})')[0].apply(f).groupby(level=0).agg(','.join)
print (df)
Name DateFormatted
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01
Another alternative is processing lists after remove missing values in generato comprehension with join:
import dateutil.parser
f = lambda x: dateutil.parser.parse(x).strftime("%Y-%m-%d")
df['Date'] = df['Name'].str.findall("|".join(format_list)).dropna().apply(lambda y: ','.join(f(x) for x in y))
print (df)
Name Date
0 Today is 09 September 2021 2021-09-09
1 NaN NaN
2 25 December 2021 is christmas 2021-12-25
3 01/01/2022 is newyear and will be holiday on 0... 2022-01-01,2022-02-01

How to calculate the difference of timestamps based on a multi level index?

I've got the following pandas.DataFrame and would like to calculate a new column containing the timedelta between consecutive timestamps in the multi-index level Timestamp:
import pandas as pd
import numpy as np
data = {'Timestamp': [12, 12, 12, 22, 22, 22, 44, 44, 66, 102],
'Customer': ['bmw', 'vw', 'vw', 'bmw', 'vw', 'vw', 'vw', 'vw', 'bmw', 'bmw'],
'Series': ['series1', 'series1', 'series2', 'series1', 'series1', 'series2', 'series1', 'series2', 'series2', 'series1'],
'time_delta': [np.nan, np.nan, np.nan, 10, 10, 10, 22, 22, 22, 36]
}
df = pd.DataFrame(data).set_index(['Timestamp', 'Customer', 'Series'])
The column time_delta is the desired output I would like to achieve. I somewhat struggle since I can not use the pandas.Series.diff() function as the periods are not consistent. I want to do the timestamp delta calculation on the Timestamp level of the dataframe, but pass the result to all rows of this level. So for the first Timestamp level value 12 there is no preceeding timestamp value, thus all rows for this timestamp are filled with np.nan. For the next timestamp 22, I can take the delta to 12 (which is 10) and fill it for all rows of timestamp 22.
Let's try extracting the level values and calculate the difference from there:
df['time_delta'] = df.index.get_level_values('Timestamp')
s = df['time_delta'].diff()
df['time_delta'] = s.where(s>0).ffill()
Output:
time_delta
Timestamp Customer Series
12 bmw series1 NaN
vw series1 NaN
series2 NaN
22 bmw series1 10.0
vw series1 10.0
series2 10.0
44 vw series1 22.0
series2 22.0
66 bmw series2 22.0
102 bmw series1 36.0

interpolate pandas frame using time index from another data frame

So, I have 2 data frames where the first one has the following structure:
'ds', '1_sensor_id', '1_val_1', '1_val_2'
0 2019-09-13 12:40:00 33469 30 43
1 2019-09-13 12:45:00 33469 43 43
The second one has the following structure:
'ds', '2_sensor_id', '2_val_1', '2_val_2'
0 2019-09-13 12:42:00 20006 6 50
1 2019-09-13 12:47:00 20006 5 80
So what I want to do is merge the two pandas frame through interpolation. So ultimately, the merged frame should have values defined at the time stamps (ds) defined in frame 1 and the 2_val_1 and 2_val_2 columns would be interpolated and the merged frame would have a row for each value in ds column in frame 1. What would be the best way to do this in pandas? I tried the merge_asof function but this does nearest neighbourhood interpolation and I did not get all the time stamps back.
You can append one frame to another and use interpolate(), example:
import datetime
import pandas as pd
df1 = pd.DataFrame(columns=['ds', '1_sensor_id', '1_val_1', '1_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 40, 00), 33469, 30, 43],
[datetime.datetime(2019, 9, 13, 12, 45, 00), 33469, 33, 43]])
df2 = pd.DataFrame(columns=['ds', '2_sensor_id', '2_val_1', '2_val_2'],
data=[[datetime.datetime(2019, 9, 13, 12, 42, 00), 20006, 6, 50],
[datetime.datetime(2019, 9, 13, 12, 47, 00), 20006, 5, 80]])
df = df1.append(df2, sort=False)
df.set_index('ds', inplace=True)
df.interpolate(method = 'time', limit_direction='backward', inplace=True)
print(df)
1_sensor_id 1_val_1 ... 2_val_1 2_val_2
ds ...
2019-09-13 12:40:00 33469.0 30.0 ... 6.0 50.0
2019-09-13 12:45:00 33469.0 33.0 ... 5.4 68.0
2019-09-13 12:42:00 NaN NaN ... 6.0 50.0
2019-09-13 12:47:00 NaN NaN ... 5.0 80.0

How to find the row index in pandas column?

I am very new to pandas and trying to get the row index for the any value higher than the lprice. Can someone give me a quick idea on what I am doing wrong?
Dataframe
StrikePrice
0 40.00
1 50.00
2 60.00
3 70.00
4 80.00
5 90.00
6 100.00
7 110.00
8 120.00
9 130.00
10 140.00
11 150.00
12 160.00
13 170.00
14 180.00
15 190.00
16 200.00
17 210.00
18 220.00
19 230.00
20 240.00
Now I am trying to figure out how to get the row index for any value which is higher than the lprice
lprice = 99
for strike in df['StrikePrice']:
strike = float(strike)
# print(strike)
if strike >= lprice:
print('The high strike is:' + str(strike))
ce_1 = strike
print(df.index['StrikePrice' == ce_1])
The above gives 0 as the index
I am not sure what I am doing wrong here.
Using the index attribute after boolean slicing.
lprice = 99
df[df.StrikePrice >= lprice].index
Int64Index([6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
If you insist on iterating and finding when you've found it, you can modify your code:
lprice = 99
for idx, strike in df['StrikePrice'].iteritems():
strike = float(strike)
# print(strike)
if strike >= lprice:
print('The high strike is:' + str(strike))
ce_1 = strike
print(idx)
I think best is filter index by boolean indexing:
a = df.index[df['StrikePrice'] >= 99]
#alternative
#a = df.index[df['StrikePrice'].ge(99)]
Your code should be changed similar:
lprice = 99
for strike in df['StrikePrice']:
if strike >= lprice:
print('The high strike is:' + str(strike))
print(df.index[df['StrikePrice'] == strike])
numpy.where(condition[, x, y]) does exactly this if we specify only condition.
np.where() returns the tuple
condition.nonzero(), the indices where condition is True, if only condition is given.
In [36]: np.where(df.StrikePrice >= lprice)[0]
Out[36]: array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=int64)
PS thanks #jezrael for the hint -- np.where() returns numerical index positions instead of DF index values:
In [41]: df = pd.DataFrame({'val':np.random.rand(10)}, index=pd.date_range('2018-01-01', freq='9999S', periods=10))
In [42]: df
Out[42]:
val
2018-01-01 00:00:00 0.459097
2018-01-01 02:46:39 0.148380
2018-01-01 05:33:18 0.945564
2018-01-01 08:19:57 0.105181
2018-01-01 11:06:36 0.570019
2018-01-01 13:53:15 0.203373
2018-01-01 16:39:54 0.021001
2018-01-01 19:26:33 0.717460
2018-01-01 22:13:12 0.370547
2018-01-02 00:59:51 0.462997
In [43]: np.where(df['val']>0.5)[0]
Out[43]: array([2, 4, 7], dtype=int64)
workaround:
In [44]: df.index[np.where(df['val']>0.5)[0]]
Out[44]: DatetimeIndex(['2018-01-01 05:33:18', '2018-01-01 11:06:36', '2018-01-01 19:26:33'], dtype='datetime64[ns]', freq=None)

Create DataFrame from list of Dicts - Where values are lists themselves [duplicate]

This question already has answers here:
Pandas column of lists, create a row for each list element
(10 answers)
Closed 5 years ago.
Hi I want to create a DataFrame from a list of dicts where the items are lists. When the items are scalars, see test below, the call to pd.DataFrame works as expected:
test = [{'points': 40, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
pd.DataFrame(test)
month points points_h1 time year
0 NaN 40.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
However, if the items are lists themselves, I get what seems to be an unexpected result:
test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]},
{'points': [25], 'time': ['6:00'], 'month': ["february"]},
{'points':[90], 'time': ['9:00'], 'month': ['january']},
{'points_h1': [20], 'month': ['june']}]
pd.DataFrame(test)
month points points_h1 time year
0 NaN [40, 50] NaN [5:00, 4:00] [2010, 2011]
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
To solve this, I use: pd.concat([pd.DataFrame(z) for z in test]), but this is relatively slow because you have to create a new dataframe for each element in the list, which requires significant overhead. Am I missing something?
Although possible within pandas itself, it appears to be less difficult using Python, at least if you have the raw data.
import pandas as pd
test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]}, {'points': [25], 'time': ['6:00'], 'month': ["february"]}, {'points':[90], 'time': ['9:00'], 'month': ['january']}, {'points_h1': [20], 'month': ['june']}]
newtest = []
for t in test:
newtest.extend([{k:v for (k,v) in zip(t.keys(),values)} for values in zip(*t.values())])
df = pd.DataFrame(newtest)
print (df)
Result:
month points points_h1 time year
0 NaN 40.0 NaN 5:00 2010.0
1 NaN 50.0 NaN 4:00 2011.0
2 february 25.0 NaN 6:00 NaN
3 january 90.0 NaN 9:00 NaN
4 june NaN 20.0 NaN NaN
With pandas it is possible to use a combination of methods to get your data, but as you found out it can get quite heavy. My recommendation is pad your data before passing onto pandas:
import pandas as pd
test = [{'points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011]},
{'month': ['february'], 'points': [25], 'time': ['6:00']},
{'month': ['january'], 'points': [90], 'time': ['9:00']},
{'month': ['june'], 'points_h1': [20]}]
def pad_data(data):
# Set a dictionary with all the keys
result = {k:[] for i in data for k in i.keys()}
for i in data:
# Determine the longest value as padding for NaNs
pad = max([len(j) for j in i.values()])
# Create padding dictionary and update current
padded = {key: [pd.np.nan]*pad for key in result.keys() if key not in i.keys()}
i.update(padded)
# Finally extend to result dictionary
for key, val in i.items():
result[key].extend(val)
return result
# Padded data looks like this:
#
# {'month': [nan, nan, 'february', 'january', 'june'],
# 'points': [40, 50, 25, 90, nan],
# 'points_h1': [nan, nan, nan, nan, 20],
# 'time': ['5:00', '4:00', '6:00', '9:00', nan],
# 'year': [2010, 2011, nan, nan, nan]}
df = pd.DataFrame(pad_data(test), dtype='O')
print(df)
# month points points_h1 time year
# 0 NaN 40 NaN 5:00 2010
# 1 NaN 50 NaN 4:00 2011
# 2 february 25 NaN 6:00 NaN
# 3 january 90 NaN 9:00 NaN
# 4 june NaN 20 NaN NaN

Categories