I have concatenated several csv files into one dataframe to make a combined csv file. But one of the columns has both date and time (e.g 02:33:01 21-Jun-2018) after being converted to date_time format. However when I call
new_dataframe = old_dataframe.sort_values(by = 'Time')
It sorts the dataframe by time , completely ignoring date.
Index Time Depth(ft) Pit Vol(bbl) Trip Tank(bbl)
189147 00:00:00 03-May-2018 2283.3578 719.6753 54.2079
3875 00:00:00 07-May-2018 5294.7308 1338.7178 29.5781
233308 00:00:00 20-May-2018 8073.7988 630.7964 41.3574
161789 00:00:01 05-May-2018 122.2710 353.6866 58.9652
97665 00:00:01 01-May-2018 16178.8666 769.1328 66.0688
How do I get it to sort by dates and then times , so that Aprils days come first, and come in chronological order?
In order to sort with your date first and then time, your Time column should be in the right way Date followed by Time. Currently, it's opposite.
You can do this:
df['Time'] = df['Time'].str.split(' ').str[::-1].apply(lambda x: ' '.join(x))
df['Time'] = pd.to_datetime(df['Time'])
Now sort your df by Time like this:
df.sort_values('Time')
Related
I'm a beginner in python. I have an excel file. This file shows the rainfall amount between 2016-1-1 and 2020-6-30. It has 2 columns. The first column is date, another column is rainfall. Some dates are missed in the file (The rainfall didn't estimate). For example there isn't a row for 2016-05-05 in my file. This a sample of my excel file.
Date rainfall (mm)
1/1/2016 10
1/2/2016 5
.
.
.
12/30/2020 0
I want to find the missing dates but my code doesn't work correctly!
import pandas as pd
from datetime import datetime, timedelta
from matplotlib import dates as mpl_dates
from matplotlib.dates import date2num
df=pd.read_excel ('rainfall.xlsx')
a= pd.date_range(start = '2016-01-01', end = '2020-06-30' ).difference(df.index)
print(a)
Here' a beginner friendly way of doing it.
First you need to make sure, that the Date in your dataframe is really a date and not a string or object.
Type (or print) df.info().
The date column should show up as datetime64[ns]
If not, df['Date'] = pd.to_datetime(df['Date'], dayfirst=False)fixes that. (Use dayfirst to tell if the month is first or the day is first in your date string because Pandas doesn't know. Month first is the default, if you forget, so it would work without...)
For the tasks of finding missing days, there's many ways to solve it. Here's one.
Turn all dates into a series
all_dates = pd.Series(pd.date_range(start = '2016-01-01', end = '2020-06-30' ))
Then print all dates from that series which are not in your dataframe "Date" column. The ~ sign means "not".
print(all_dates[~all_dates.isin(df['Date'])])
Try:
df = pd.read_excel('rainfall.xlsx', usecols=[0])
a = pd.date_range(start = '2016-01-01', end = '2020-06-30').difference([l[0] for l in df.values])
print(a)
And the date in the file must like 2016/1/1
To find the missing dates from a list, you can apply Conditional Formatting function in Excel. 4. Click OK > OK, then the position of the missing dates are highlighted. Note: The last date in the date list will be highlighted.
this TRICK Is not with python,a NORMAL Trick
I have a pandas array with a column which contains unix timestamp times, but I think it's in milliseconds because each time as 3 extra 0's at the end. For example, the first data point is 1546300800000, when it should be just 1546300800. I need to convert this column to readable times so right now I have:
df = pd.read_csv('data.csv')
df['Time] = pd.to_datetime(df['Time'])
df.to_csv('data.csv', index=False)
Instead of giving me the correct time it gives me a time in 1970. For example 1546300800000 gives me 1970-01-01 00:25:46.301100 when it should be 2019-01-01 00:00:00. It does this for every timestamp in the column, which is over 20K rows
Data;
df=pd.DataFrame({'UNIX':['1349720105','1546300800']})
Conversion
df['UNIX']=pd.to_datetime(df['UNIX'], unit='s')
I have the column of DateTime which gets combined into one column. However, I would like to separate it into 2 columns of Date, Time.
The time is every fifteen minutes and I need to make it to every hour.
First off, make sure your datetime column is in datetime format
df['datetime'] = pd.to_datetime(df['datetime'])
You can then easily extract the date and hour from this using:
df['Date'] = df['datetime'].dt.date
df['Hour'] = df['datetime'].dt.hour
I have a dataframe that contains the columns company_id, seniority, join_date and quit_date. I am trying to extract the number of days between join date and quit date. However, I get NaNs.
If I drop off all the columns in the dataframe except for quit date and join date and run the same code again, I get what I expect. However with all the columns, I get NaNs.
Here's my code:
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['days'] = df['days'].astype(str)
df1 = pd.DataFrame(df.days.str.split(' ').tolist(), columns = ['days', 'unwanted', 'stamp'])
df['numberdays'] = df1['days']
This is what I get:
days numberdays
585 days 00:00:00 NaN
340 days 00:00:00 NaN
I want 585 from the 'days' column in the 'numberdays' column. Similarly for every such row.
Can someone help me with this?
Thank you!
Instead of converting to string, extract the number of days from the timedelta value using the dt accessor.
import pandas as pd
df = pd.DataFrame({'join_date': ['2014-03-24', '2013-04-29', '2014-10-13'],
'quit_date':['2015-10-30', '2014-04-04', '']})
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['number_of_days'] = df['days'].dt.days
#Mohammad Yusuf Ghazi points out that dt.day is necessary to get the number of days instead of dt.days when working with datetime data rather than timedelta.
I have a large pandas dataframe that has hourly data associated with it. I then want to parse that into "monthly" data that sums the hourly data. However, the months aren't necessarily calendar months, they typically start in the middle of one month and end in the middle of the next month.
I could build a list of the "months" that each of these date ranges fall into and loop through it, but I would think there is a much better way to do this via pandas.
Here's my current code, the last line throws an error and is the crux of the question:
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
month = pd.DataFrame({'start':['1/4/2015 00:00','1/24/2015 00:00'], 'end':['1/23/2015 23:00','2/23/2015 23:00']})
month['start'] = pd.to_datetime(month['start'])
month['end'] = pd.to_datetime(month['end'])
month['num'] = df['num'][(df['date'] >= month['start']) & (df['date'] <= month['end'])].sum()
I would want an output similar to:
start end num
0 2015-01-04 2015-01-23 23:00:00 33,251
1 2015-01-24 2015-02-23 23:00:00 39,652
but of course, I'm not getting that.
pd.merge_asof only available with pandas 0.19
combination of pd.merge_asof + query + groupby
pd.merge_asof(df, month, left_on='date', right_on='start') \
.query('date <= end').groupby(['start', 'end']).num.sum().reset_index()
explanation
pd.merge_asof
From docs
For each row in the left DataFrame, we select the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. Both DataFrames must be sorted by the key.
But this only takes into account the start date.
query
I take care of end date with query since I now conveniently have end in my dataframe after pd.merge_asof
groupby
I trust this part is obvious`
Maybe you can convert to a period and add a number of days
# create data
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
# offset days and then create period
df['periods'] = (df.date + pd.tseries.offsets.Day(23)).dt.to_period('M')]
# group and sum
df.groupby('periods')['num'].sum()
Output
periods
2015-01 10051
2015-02 34229
2015-03 37311
2015-04 26655
You can then shift the dates back and make new columns