Im trying to create an empty DataFrame for which I will then constantly be appending rows to using the time stamp when the data arrives as index.
This is to code I have so far:
import pandas as pd
import datetime
df = pd.DataFrame(columns=['a','b'],index=pd.DatetimeIndex(freq='s'))
df.loc[event.get_datetime()] = event.get_data()
The problem Im having is with freq in the DateTimeIndex, the data is not arriving at any predefined intervalls, it is ju when some event tiggers. And also in the code above I need to specify a start and enddate for the index I dont want that I just want to be able to append rows whenever they arrive.
Set up empty with pd.to_datetime
df = pd.DataFrame(columns=['a','b'], index=pd.to_datetime([]))
Then do this
df.loc[pd.Timestamp('now')] = pd.Series([1, 2], ['a', 'b'])
df
a b
2018-06-10 20:52:52.025426 1 2
The first argument of DateTimeIndex is data. Try setting data to an empty list. If you want to define the start time, end time, or frequency, take a look at the other arguments for DateTimeIndex.
df = pd.DataFrame(columns=['a','b'], index=pd.DatetimeIndex([], name='startime'))
If you're trying to index on time delta values, also consider
df = pd.DataFrame(columns=['a','b'], index=pd.TimedeltaIndex([]))
Related
I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.
Goal:
From an excel file, I want to get all the records which have dates that fall within a range and write them to a new excel file. The infile I'm working with has 500K+ rows and 21 columns.
What I've tried:
I've read the infile to a Pandas dataframe then returned the DatetimeIndex. If I print the range variable I get the desired records.
import pandas as pd
in_excel_file = r'path\to\infile.xlsx'
out_excel_file = r'path\to\outfile.xlsx'
df = pd.read_excel(in_excel_file)
range = (pd.date_range(start='1910-1-1', end='2021-1-1'))
print(range)
##prints
DatetimeIndex(['1990-01-01', '1990-01-02', '1990-01-03', '1990-01-04',
'1990-01-05', '1990-01-06', '1990-01-07', '1990-01-08',
'1990-01-09', '1990-01-10',
...
'2020-12-23', '2020-12-24', '2020-12-25', '2020-12-26',
'2020-12-27', '2020-12-28', '2020-12-29', '2020-12-30',
'2020-12-31', '2021-01-01'],
dtype='datetime64[ns]', length=11324, freq='D')
Where I'm having trouble is getting the above DatetimeIndex to the outfile. The following gives me an error:
range.to_excel(out_excel_file, index=False)
AttributeError: 'DatetimeIndex' object has no attribute 'to_excel'
I'm pretty sure that when writing to excel it has to be a dataframe. So, my question is how do I get the range variable to a dataframe object?
Goal: From an excel file, I want to get all the records which have dates that fall within a range and write them to a new excel file. The infile I'm working with has 500K+ rows and 21 columns.
You could use an indexing operation to select only the data you need from the original DataFrame and save the result in an Excel file.
In order to do that first you need to check if the date column from your original DataFrame is already converted to a datetime/date object:
import numpy as np
date_column = "date" # Suppose this is your date column name
if not np.issubdtype(df[date_column].dtype, np.datetime64):
df.loc[:, date_column] = pd.to_datetime(df[date_column], format="%Y-%m-%d")
Now you can use a regular indexing operation to get all values you need:
mask = (df[date_column] >= '1910-01-01') & (df[date_column] <= '2021-01-01') # Creates mask for date range
out_dataframe = df.loc[mask] # Here we select the indices using our mask
out_dataframe.to_excel(out_excel_file)
You can try to create a dataframe from the DatetimeIndex before writing it to Excel, as follows:
range_df = pd.DataFrame(index=range).rename_axis(index='range').reset_index()
or as suggested by #guimorg, we can also do it as:
range_df = range.to_frame(index=False, name='range')
Then, continue with your code to write it to Excel:
range_df.to_excel(out_file, index=False)
I got a dataframe on which I would like to perform some analysis. An easy example of what I would like to achieve is, having the dataframe:
data = ['2017-02-13', '2017-02-13', '2017-02-13', '2017-02-15', '2017-02-16']
df = pd.DataFrame(data = data, columns = ['date'])
I would like to create a new dataframe from this. The new dataframe should contain 2 columns, the entire date span. So it should also include 2017-02-14 and the number of times each date appears in the original data.
I managed to construct a dataframe that includes all the dates as so:
dates = pd.to_datetime(df['date'], format = "%Y-%m-%d")
dateRange = pd.date_range(start = dates.min(), end = dates.max()).tolist()
df2 = pd.DataFrame(data = datumRange, columns = ['datum'])
My question is, how would I add the counts of each date from df to df 2? I've been messing around trying to write my own functions but have not managed to achieve it. I am assuming this needs to be done more often and that I am thinking to difficult...
Try this:
df2['counts'] = df2['datum'].map(pd.to_datetime(df['date']).value_counts()).fillna(0).astype(int)
I cannot manage to update a cell value when the dataframe index is a sub-second timeseries. For example:
import numpy as np
import pandas as pd
t0 = '2019-01-05 22:00:00.000'
t1 = '2019-01-05 22:00:05.000'
df_times = pd.date_range(t0, t1, freq = '500L')
df = pd.DataFrame()
df['datetime'] = df_times
df['Value']=[20,21,22,23,24,25,26,27,28,29,30]
df['Target'] = range(len(df_times))
df = df.set_index('datetime')
df
will result in this dataframe
Dataframe contents
If I try to update a cell 'Target' at index '2019-01-05 22:00:02.000', I end up updating also the 'Target' cell at index '2019-01-05 22:00:02.500'.
two cells updated instead of one
How can I work around this?
This will do the trick:
df.loc[pd.to_datetime('2019-01-05 22:00:02.000'), 'Target']=57
It apparently implicitly casts string to different (less precise) date time than pandas version.
Also using .loc[] will be better in this case.
I need to filter out data with specific hours. The DataFrame function between_time seems to be the proper way to do that, however, it only works on the index column of the dataframe; but I need to have the data in the original format (e.g. pivot tables will expect the datetime column to be with the proper name, not as the index).
This means that each filter looks something like this:
df.set_index(keys='my_datetime_field').between_time('8:00','21:00').reset_index()
Which implies that there are two reindexing operations every time such a filter is run.
Is this a good practice or is there a more appropriate way to do the same thing?
Create a DatetimeIndex, but store it in a variable, not the DataFrame.
Then call it's indexer_between_time method. This returns an integer array which can then be used to select rows from df using iloc:
import pandas as pd
import numpy as np
N = 100
df = pd.DataFrame(
{'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
df.iloc[index.indexer_between_time('8:00','21:00')]