Pandas dataframe custom formatting string to time - python

I have a dataframe that looks like this
DEP_TIME
0 1851
1 1146
2 2016
3 1350
4 916
...
607341 554
607342 633
607343 657
607344 705
607345 628
I need to get every value in this column DEP_TIME to have the format hh:mm.
All cells are of type string and can remain that type.
Some cells are only missing the colon (rows 0 to 3), others are also missing the leading 0 (rows 4+).
Some cells are empty and should ideally have string value of 0.
I need to do it in an efficient way since I have a few million records. How do I do it?

Use to_datetime with Series.dt.strftime:
df['DEP_TIME'] = (pd.to_datetime(df['DEP_TIME'], format='%H%M', errors='coerce')
.dt.strftime('%H:%M')
.fillna('00:00'))
print (df)
DEP_TIME
0 18:51
1 11:46
2 20:16
3 13:50
4 09:16
607341 05:54
607342 06:33
607343 06:57
607344 07:05
607345 06:28

import re
d = [['1851'],
['1146'],
['2016'],
['916'],
['814'],
[''],
[np.nan]]
df = pd.DataFrame(d, columns=['DEP_TIME'])
df['DEP_TIME'] = df['DEP_TIME'].fillna('0')
df['DEP_TIME'] = df['DEP_TIME'].apply(lambda y: '0' if y=='' else re.sub(r'(\d{1,2})(\d{2})$', lambda x: x[1].zfill(2)+':'+x[2], y))
df
DEP_TIME
0 18:51
1 11:46
2 20:16
3 09:16
4 08:14
5 0

Related

Pandas Dataframe - How to transpose one value for the row n to the row n-5 [duplicate]

I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.

How to convert the data type from object to numeric & then find the mean for each row in pandas ? eg. convert '<17,500, >=15,000' to 16250(mean val)

data['family_income'].value_counts()
>=35,000 2517
<27,500, >=25,000 1227
<30,000, >=27,500 994
<25,000, >=22,500 833
<20,000, >=17,500 683
<12,500, >=10,000 677
<17,500, >=15,000 634
<15,000, >=12,500 629
<22,500, >=20,000 590
<10,000, >= 8,000 563
< 8,000, >= 4,000 402
< 4,000 278
Unknown 128
The data column to be shown as a MEAN value instead of values in range
data['family_income']
0 <17,500, >=15,000
1 <27,500, >=25,000
2 <30,000, >=27,500
3 <15,000, >=12,500
4 <30,000, >=27,500
...
10150 <30,000, >=27,500
10151 <25,000, >=22,500
10152 >=35,000
10153 <10,000, >= 8,000
10154 <27,500, >=25,000
Name: family_income, Length: 10155, dtype: object
Output: as mean imputed value
0 16250
1 26250
3 28750
...
10152 35000
10153 9000
10154 26500
data['family_income']=data['family_income'].str.replace(',', ' ').str.replace('<',' ')
data[['income1','income2']] = data['family_income'].apply(lambda x: pd.Series(str(x).split(">=")))
data['income1']=pd.to_numeric(data['income1'], errors='coerce')
data['income1']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10150 NaN
10151 NaN
10152 NaN
10153 NaN
10154 NaN
Name: income1, Length: 10155, dtype: float64
In this case, conversion of datatype from object to numeric doesn't seem to work since all the values are returned as NaN. So, how to convert to numeric data type and find mean imputed values?
You can use the following snippet:
# Importing Dependencies
import pandas as pd
import string
# Replicating Your Data
data = ['<17,500, >=15,000', '<27,500, >=25,000', '< 4,000 ', '>=35,000']
df = pd.DataFrame(data, columns = ['family_income'])
# Removing punctuation from family_income column
df['family_income'] = df['family_income'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
# Splitting ranges to two columns A and B
df[['A', 'B']] = df['family_income'].str.split(' ', 1, expand=True)
# Converting cols A and B to float
df[['A', 'B']] = df[['A', 'B']].apply(pd.to_numeric)
# Creating mean column from A and B
df['mean'] = df[['A', 'B']].mean(axis=1)
# Input DataFrame
family_income
0 <17,500, >=15,000
1 <27,500, >=25,000
2 < 4,000
3 >=35,000
# Result DataFrame
mean
0 16250.0
1 26250.0
2 4000.0
3 35000.0

DataError: No numeric types to aggregate when creating plot in loop

I want to make multiple line in lineplot with loop like this but it returns DataError: No numeric types to aggregate. Why it returns that error and how to fix this?
plt.figure()
cases = pd.DataFrame(data=covid[['date','acc_released','acc_deceased','acc_negative','acc_confirmed']])
for col in cases.columns:
sns.lineplot(x=cases['date'],y=covid[col],data=cases)
Without loop it will be like this, which is not efficient but works fine
plt.figure()
sns.lineplot(x=covid['date'], y=covid['acc_confirmed'])
sns.lineplot(x=covid['date'], y=covid['acc_deceased'])
sns.lineplot(x=covid['date'], y=covid['acc_negative'])
sns.lineplot(x=covid['date'], y=covid['acc_released'])
plt.xticks(rotation=90)
plt.legend(['acc_confirmed', 'acc_deceased', 'acc_negative', 'acc_released'],
loc='upper left')
plt.title('Numbers of cases')
This is my data
date acc_released acc_deceased acc_negative acc_confirmed
0 2020-03-02 0 0 335 2
1 2020-03-03 0 0 337 2
2 2020-03-04 0 0 356 2
3 2020-03-05 0 0 371 2
4 2020-03-06 0 0 422 4
5 2020-03-07 0 0 422 4
It's supposed to look this way
If you set the date as your index you can pass the df to data;
sns.lineplot(data=cases)
to change the index;
df.index = df['Time']
then you can drop the time column;
df = df.drop(columns=['Time'])

Python Pandas Fill Dataframe with another DataFrame

I have a dataframe
x = pd.DataFrame(index = ['wkdy','hr'],columns=['c1','c2','c3'])
This leads to 168 rows of data in the dataframe. 7 weekdays and 24 hours in each day.
I have another dataframe
dates = pd.date_range('20090101',periods = 10000, freq = 'H')
y = DataFrame(np.random.randn(10000, 3), index = dates, columns = ['c1','c2','c3'])
y['hr'] = y.index.hour
y['wkdy'] = y.index.weekday
I want to fill 'y' with data from 'x', so that all each weekday and hour has same data but has a datestamp attached to it..
The only way i know is to loop through the dates and fill values. Is there a faster, more efficient way to do this?
My Solution (rather crude to say the least) iterates over the entire dataframe y row by row and tries to fill from dataframe x through a lookup.
for r in range(0,len(y)):
h = int(y.iloc[r]['hr'])
w = int(y.iloc[r]['wkdy'])
y.iloc[r] = x.loc[(w,h)]
Your dataframe x doesn't have 168 rows but looks like
c1 c2 c3
wkdy NaN NaN NaN
hr NaN NaN NaN
and you can't index it using a tuple like in x.loc[(w,h)]. What you probably had in mind was something like
x = pd.DataFrame(
index=pd.MultiIndex.from_product(
[range(7), range(24)], names=['wkdy','hr']),
columns=['c1','c2','c3'],
data=np.arange(3 * 168).reshape(3, 168).T)
x
c1 c2 c3
wkdy hr
0 0 0 168 336
1 1 169 337
... ... ... ... ...
6 22 166 334 502
23 167 335 503
168 rows × 3 columns
Now your loop will work, although a pythonic representation would look like this:
for idx, row in y.iterrows():
y.loc[idx, :3] = x.loc[(row.wkdy, row.hr)]
However, iterating through dataframes is very expensive and you should look for a vectorized solution by simply merging the 2 frames and removing the unwanted columns:
y = (x.merge(y.reset_index(), on=['wkdy', 'hr'])
.set_index('index')
.sort_index()
.iloc[:,:-3])
y
wkdy hr c1_x c2_x c3_x
index
2009-01-01 00:00:00 3 0 72 240 408
2009-01-01 01:00:00 3 1 73 241 409
... ... ... ... ... ...
2010-02-21 14:00:00 6 14 158 326 494
2010-02-21 15:00:00 6 15 159 327 495
10000 rows × 5 columns
Now y is a dataframe with columns c1_x, c2_x, c3_x having data from dataframe x where y.wkdy==x.wkdy and y.hr==x.hr. Merging here is 1000 times faster than looping.

Conditional update on two columns on Pandas Dataframe

I have a pandas dataframe where I'm trying to append two column values if the value of the second column is not NaN. Importantly, after appending the two values I need the value from the second column set to NaN. I have managed to concatenate the values but cannot update the second column to NaN.
This is what I start with for ldc_df[['ad_StreetNo', 'ad_StreetNo2']].head(5):
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196 198
4 227 NaN
This is what I currently have after appending:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 198
4 227 NaN
But here is what I am trying to obtain:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN
Where the value for ldc_df['ad_StreetNo2'].loc[3] should be changed to NaN.
This is the code I am using currently:
def street_check(street_number_one, street_number_two):
if pd.notnull(street_number_one) and pd.notnull(street_number_two):
return str(street_number_one) + '-' + str(street_number_two)
else:
return street_number_one
ldc_df['ad_StreetNo'] = ldc_df[['ad_StreetNo', 'ad_StreetNo2']].apply(lambda x: street_check(*x),axis=1)
Does anyone have any advice as to how I can obtain my expected output?
Sam
# Convert the Street numbers to a string so that you can append the '-' character.
ldc_df['ad_StreetNo'] = ldc_df['ad_StreetNo'].astype(str)
# Create a mask of those addresses having an additional street number.
mask = ldc_df.loc[ldc_df['ad_StreetNo2'].notnull()
# Use the mask to append the additional street number.
ldc_df.loc[mask, 'ad_StreetNo'] += '-' + ldc_df.loc[mask, 'ad_StreetNo2'].astype(str)
# Set the additional street number to NaN.
ldc_df.loc[mask, 'ad_StreetNo2'] = np.nan
Alternative Solution
ldc_df['ad_StreetNo'] = (
ldc_df['ad_StreetNo'].astype(str)
+ ['' if np.isnan(n) else '-{}'.format(str(int(n)))
for n in ldc_df['ad_StreetNo2']]
)
ldc_df['ad_StreetNo2'] = np.nan
pd.DataFrame.stack folds a dataframe with a single level column index into a series object. Along the way, it drops any null values by default. We can then group by the previous index levels and join with '-'.
df.stack().astype(str).groupby(level=0).apply('-'.join)
0 284
1 51
2 136
3 196-198
4 227
dtype: object
I then use assign to create a copy of df while overwriting the two columns.
df.assign(
ad_StreetNo=df.stack().astype(str).groupby(level=0).apply('-'.join),
ad_StreetNo2=np.NaN
)
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN

Categories