multiple columns to single datetime dataframe column - python

I have a data frame that contains (among others) columns for the time of day (00:00-23:59:59) day (1-7), month (1-12), and year (2000-2019). How can I combine the values of each of these columns on a row by row basis into a new DateTime object and then store these new date-times in a new column? I've read other posts pertaining to such a task but they all seem to involve one date column to one DateTime column whereas I have 4 columns that need to be transformed into DateTime. Any help is appreciated!
e.g.
| 4:30:59 | 1 | 1 | 2000 | TO 200/1/1 4:30:59
this is the only code I have so far which probably doesn't do anything
#creating datetime object (MISC)
data = pd.read_csv('road_accidents_data_clean.csv',delimiter=',')
df = pd.DataFrame(data)
format = '%Y-%m-%d %H:%M:%S'
n = 0
df['datetime'] = data.loc[n,'Crash_Day'],data.loc[n,'Crash_Month'],data.loc[n,'Year']
My DataFrame is layed out as follows:
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
1 42 2000 1 1 7:45:10 DRIVER MALE
2 25 2000 1 1 10:15:30 PEDESTRIAN FEMALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
OVERTAKING 1 YARRA MELBOURNE NaN
ADJACENT DIR 0 MONASH MELBOURNE NaN
NOTE: the dataframe is 13 columns wide i just couldn't fit them all on one line so Crash_Type starts to the right of Gender.
below is the code i've been suggested to use/my adaptation of it
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
data['Datetime'] = df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
I've adapted this code in order to combine the values for the datetime column with the my original dataframe.

Combine the columns into a single series of stings using + (converting to str where needed with pandas.Series.astype method) then pass that new series into pd.to_datetime before assigning it to a new column in your df:
import pandas as pd
df = pd.DataFrame(dict(time=['4:30:59'],date=[1],month=[1],year=[2000]))
df['datetime'] = pd.to_datetime(
df['year'].astype(str)+'-'+df['month'].astype(str)+'-'+df['date'].astype(str)+' '+df['time'],
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
example in python tutor
edit: You can also use a numpy.sum to make that one long line adding columns together easier on the eyes:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(
time=['4:30:59','4:20:00'],
date=[1,20],
month=[1,4],
year=[2000,2020],
))
df['datetime']=pd.to_datetime(
np.sum([
df['year'].astype(str),
'-',
df['month'].astype(str),
'-',
df['date'].astype(str),
' ',
df['time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
sum example in python tutor
edit 2: Using your actual column names, it should be something like this:
import pandas as pd
import numpy as np
'''
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
'''
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
another python tutor link
One thing to note is that you might want to double check if your csv file is separated by just a comma or could it be a comma and a space? possible that you may need to load the data with df = pd.read_csv('road_accidents_data_clean.csv',sep=', ') if there is an extra space separating the data in addition to the comma. You don't want to have that extra space in your data.

Related

Problems extracting data from pandas data frame as a result of Grouping

In order to plot the frequency of tornados every 10 days I have grouped the data in groups of 10 days using
df_grouped = pd.DataFrame()
df_grouped['COUNT'] = df.groupby(pd.Grouper(key='DATE', freq='10D'))['DATE'].count().to_frame()
however the column DATE does not exist in the code as shown when I run:
>>> df_grouped.shape
(1041,1)
despite the fact that I am able to view and plot the dates in the Jupiter notebook GUI 1.
This is an issue as I wish to access this data later for other purposes and I am unable to using:
year = pd.to_datetime(df_grouped['DATE'], dayfirst = True, errors='coerce').dt.year.values
df_grouped['year'] = year
It states that there is an invalid indexing error since the column no longe exists. Does anyone know what I can do to access the data?
MINIMUM REPRODUCIBLE EXAMPLE
import pandas as pd
df = pd.DataFrame(pd.date_range(start='1994-01-01', end='1994-01-21'), columns=['DATE'])
df_grouped = pd.DataFrame()
df_grouped['COUNT'] = df.groupby(pd.Grouper(key='DATE', freq='10D'))['DATE'].count().to_frame()
expected output
|DATE |COUNT |
|1994-01-01|10 |
|1994-01-11|10 |
|1994-01-21|10 |
|1994-01-31|01 |
actual output
| |COUNT |
|DATE | |
|1994-01-01|10 |
|1994-01-11|10 |
|1994-01-21|10 |
|1994-01-31|01 |
import pandas as pd
df = pd.DataFrame(pd.date_range(start='1994-01-01', end='1994-01-21'), columns=['DATE'])
df = (df.assign(COUNT=lambda x: 1)
.groupby(pd.Grouper(key='DATE', freq='10D')).count()
.reset_index())
print(df)
# DATE COUNT
# 0 1994-01-01 10
# 1 1994-01-11 10
# 2 1994-01-21 1

How do round decimals when writing dataframe in streamlit

import streamlit as st
import pandas as pd
data = {'Days': ["Sunday", "Wednesday", "Friday"],
'Predictions': [433.11, 97.9, 153.65]}
df = pd.DataFrame(data)
st.write(df)
streamlit writes dataframe with four decimals by default, but i expected two decimals. With print() it produces the expected, and when st.write() is used it produces the below output:
Days | Predictions
-------------|-------------
0 Sunday | 433.1100 |
1 Wednesday| 97.9000 |
2 Friday | 153.6500 |
I tried:
df.round(2)
but it didn't help.
Desired output format:
Days | Predictions
-------------|-------------
0 Sunday | 433.11 |
1 Wednesday| 97.90 |
2 Friday | 153.65 |
With the df.style.format() method, you can specify the desired number of decimal places to override the default behavior of pandas in streamlit.
Note: This approach only provides the desired output format but when working farther you will have to still use the original dataframe which is df.
formatted_df = df.style.format({"Predictions": "{:.2f}".format})
st.write(formatted_df)
You could also choose to write the dataframe without having to initialize a variable, like so:
st.write(df.style.format({"Predictions": "{:.2f}"}))

pandas: how to groupby / pivot retaining the NaNs? Converting float to str then back to float works but seems convoluted

I am tracking in which "month" a certain event has taken place. If it hasn't, the "month" field is a NaN. The starting table looks like this:
+-------+----------+---------+
| Month | Category | Balance |
+-------+----------+---------+
| 1 | a | 100 |
| nan | a | 300 |
| 2 | a | 200 |
+-------+----------+---------+
I am trying to build a crosstab like this:
+-------+----------------------------------+
| Month | Category a - cumulative % amount |
+-------+----------------------------------+
| 1 | 0.16 |
| 2 | 0.50 |
+-------+----------------------------------+
In month 1, the event has happened for 100/600, ie for 16%
In month 2, the event has happened, cumulatively, for (100 + 200) / 600 = 50%, where 100 is in month 1 and 200 in month 2.
My issue is with NaNs. Pandas automatically removes NaNs from any groupby / pivot / crosstab. I could convert the month field to string, so that grouping it won't remove the NaNs, but then pandas sorts by the month as if it were a string, ie it would sort: 10, 48, 5, 6.
Any suggestions?
The following works but seems extremely convoluted:
Convert "month" to string
Do a crosstab
Convert "month" back to float (can I do it without moving the index to a column, and then the column back to the index?)
Sort again
Do the cumsum
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame()
mylen = int(10e3)
df['ix'] = np.arange(0,mylen)
df['amount'] = np.random.uniform(10e3,20e3,mylen)
df['category'] = np.where( df['ix'] <=4000, 'a','b' )
df['month'] = np.random.uniform(3,48,mylen)
df['month'] = np.where( df['ix'] <=1000, np.nan, df['month'] )
df['month rounded'] = np.ceil(df['month'])
ct = pd.crosstab(df['month rounded'].astype(str) , df['category'], \
values = df['amount'] ,aggfunc = 'sum', margins = True ,\
normalize = 'columns', dropna = False)
# the index is 'month rounded'
ct = ct.reset_index()
ct['month rounded'] = ct['month rounded'].astype('float32')
ct = ct.sort_values('month rounded')
ct = ct.set_index('month rounded')
ct2 = ct.cumsum (axis = 0)
Use:
new_df = df.assign(cumulative=df['Balance'].mask(df['Month'].isna())
.groupby(df['Category'])
.cumsum()
.div(df.groupby('Category')['Balance']
.transform('sum'))).dropna()
print(new_df)
Month Category Balance cumulative
0 1.0 a 100 0.166667
2 2.0 a 200 0.500000
If you want create a DataFrame for each Category you could create a dict:
df_category = {i:group for i,group in new_df.groupby('Category')}
df['Category a - cumulative % amount'] = (
df.groupby(by=df.Month.fillna(np.inf))
.apply(lambda x: x.Balance.cumsum().div(df.Balance.sum()))
.reset_index(level=0, drop=True)
)
df.dropna()
Month Category Balance Category a - cumulative % amount
0 1 a 100 0.166667
2 2 a 200 0.333333

Using Pandas to join and append columns in a loop

I want to append columns from tables generated in a loop to a dataframe. I was hoping to accomplish this using pandas.merge, but it doesn't seem to be working out for me.
My code:
from datetime import date
from datetime import timedelta
import pandas
import numpy
import pyodbc
date1 = date(2017, 1, 1) #Starting Date
date2 = date(2017, 1, 10) #Ending Date
DateDelta = date2 - date1
DateAdd = DateDelta.days
StartDate = date1
count = 1
# Create the holding table
conn = pyodbc.connect('Server Information')
**basetable = pandas.read_sql("SELECT....")
while count <= DateAdd:
print(StartDate)
**datatable = pandas.read_sql("SELECT...WHERE Date = "+str(StartDate)+"...")
finaltable = basetable.merge(datatable,how='left',left_on='OrganizationName',right_on='OrganizationName')
StartDate = StartDate + timedelta(days=1)
count = count + 1
print(finaltable)
Shortened the select statements for brevity's sake, but the tables produced look like this:
**Basetable
School_District
---------------
District_Alpha
District_Beta
...
District_Zed
**Datatable
School_District|2016-01-01|
---------------|----------|
District_Alpha | 400 |
District_Beta | 300 |
... | 200 |
District_Zed | 100 |
I have the datatable written so the column takes the name of the date selected for that particular loop, so column names can be unique once i get this up and running. My problem, however, is that the above code only produces one column of data. I have a good guess as to why: Only the last merge is being processed - I thought using pandas.append would be the way to get around that, but pandas.append doesn't "join" like merge does. Is there some other way to accomplish a sort of Join & Append using Pandas? My goal is to keep this flexible so that other dates can be easily input depending on our data needs.
In the end, what I want to see is:
School_District|2016-01-01|2016-01-02|... |2016-01-10|
---------------|----------|----------|-----|----------|
District_Alpha | 400 | 1 | | 45 |
District_Beta | 300 | 2 | | 33 |
... | 200 | 3 | | 5435 |
District_Zed | 100 | 4 | | 333 |
Your error is in the statement finaltable = basetable.merge(datatable,...). At each loop iteration, you merge the original basetable with the new datatable, store the result in the finaltable... and discard it. What you need is basetable = basetable.merge(datatable,...). No finaltables.

How to parse all the values in a column of a DataFrame?

DataFrame df has a column called amount
import pandas as pd
df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns = ['Amount'])
df:
ID | Amount
0 | $3,000,000.00
1 | $3,000.00
2 | $200.5
3 | $5.5
I want to parse all the values in column amount and extract the amount as a number and ignore the decimal points. End result is DataFrame that looks like this:
ID | Amount
0 | 3000000
1 | 3000
2 | 200
3 | 5
How do I do this?
You can use str.replace with double casting by astype:
df['Amount'] = (df.Amount.str.replace(r'[\$,]', '').astype(float).astype(int))
print (df)
Amount
0 3000000
1 3000
2 200
3 5
You need to use the map function on the column and reassign to the same column:
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
df.Amount = df.Amount.map(lambda s: int(locale.atof(s[1:])))
PS: This uses the code from How do I use Python to convert a string to a number if it has commas in it as thousands separators? to convert a string representing a number with thousands separator to an int
Code -
import pandas as pd
def format_amount(x):
x = x[1:].split('.')[0]
return int(''.join(x.split(',')))
df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns =
['Amount'])
df['Amount'] = df['Amount'].apply(format_amount)
print(df)
Output -
Amount
0 3000000
1 3000
2 200
3 5

Categories