pandas assigns column name randomly - python

I extracted some dates and temperature values from an xml file and wanted to make a dataframe out of them. So after some loops I defined variable temperature and date and appended their values to a list out of the loop (placeholder). Later I made a DataFrame from them and assigned the column name directly while making the dataframe. But I relaized that with every time I run my code the column names get assigned right or wrong randomly.
Here is my code:
placeholder=[]
for timeserie in timeseries:
date = re.findall('<entryisIntraday\D*(\d*.\d*.\d*)', timeserie)
temperature = re.findall('<value>(.*)<\/value>', timeserie)[0]
placeholder.append([date, temperature])
print(placeholder)
df = pd.DataFrame(placeholder, columns= {"DATE", "TEMP"})
print(df)
after running the code sometimes the result is like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'] ...
TEMP DATE
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
and sometimes like this:
[['2019-10-29', '4.4'], ['2019-10-30', '3.6'], ['2019-10-31', '2.1'], ...
DATE TEMP
0 2019-10-29 4.4
1 2019-10-30 3.6
2 2019-10-31 2.1
I didn't had this problem when I assigned the column names after I built the DataFrame:
df = pd.DataFrame(placeholder)
df=df.rename(columns= {0:"DATE",1:"TEMP"})
How can I solve this problem?

The columns argument of DataFrame constructor should be a list, not a set:
df = pd.DataFrame(placeholder, columns = ["DATE", "TEMP"])

Related

How to interpolate only over a specific window?

I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?
Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0

Split dataframe into many sub-dataframes based on timestamp

I have a large csv with the following format:
timestamp,name,age
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
2020-03-01 00:00:10,nick
2020-03-01 00:00:12,john
2020-03-01 00:00:54,hank
2020-03-01 00:01:03,peter
I load csv into a dataframe with:
df = pd.read_csv("/home/test.csv")
and then I want to create multiple dataframes every 2 seconds. For example:
df1 contains:
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
df2 contains :
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
and so on.
I achieve to split timestamps with command below:
full_idx = pd.date_range(start=df['timestamp'].min(), end = df['timestamp'].max(), freq ='0.2T')
but how I can store these spitted dataframes? How can I split a dataset based on timestamps into multiple dataframes?
Probably that question can help us: Pandas: Timestamp index rounding to the nearest 5th minute
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
ns2sec=2*1000000000 # 2 seconds in nanoseconds
# next we round our timestamp to every 2nd second with rounding down
timestamp_rounded = df['timestamp'].astype(np.int64) // ns2sec
df['full_idx'] = pd.to_datetime(((timestamp_rounded - timestamp_rounded % 2) * ns2sec))
# store array for each unique value of your idx
store_array = []
for value in df['full_idx'].unique():
store_array.append(df[df['full_idx']==value][['timestamp', 'name', 'age']])
How about .resample()?
#first loading your data
>>> import pandas as pd
>>>
>>> df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True)
>>> df.head()
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
2020-03-01 00:00:02 nick NaN
2020-03-01 00:00:02 john NaN
2020-03-01 00:00:04 peter NaN
#resampling it at a frequency of 2 seconds
>>> resampled = df.resample('2s')
>>> type(resampled)
<class 'pandas.core.resample.DatetimeIndexResampler'>
#iterating over the resampler object and storing the sliced dfs in a dictionary
>>> df_dict = {}
>>> for i, (timestamp,df) in enumerate(resampled):
>>> df_dict[i] = df
>>> df_dict[0]
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
Now for some explanation...
resample() is great for rebinning DataFrames based on time (I use it often for downsampling time series data), but it can be used simply to cut up the DataFrame, as you want to do. Iterating over the resampler object produced by df.resample() returns a tuple of (name of the bin,df corresponding to that bin): e.g. the first tuple is (timestamp of the first second,data corresponding to the first 2 seconds). So to get the DataFrames out, we can loop over this object and store them somewhere, like a dict.
Note that this will produce every 2-second interval from the start to the end of the data, so many will be empty given your data. But you can add a step to filter those out if needed.
Additionally, you could manually assign each sliced DataFrame to a variable, but this would be cumbersome (you would probably need to write a line for each 2 second bin, rather than a single small loop). Rather with a dictionary, you can still associate each DataFrame with a callable name. You could also use an OrderedDict or list or whatever collection.
A couple points on your script:
setting freq to "0.2T" is 12 seconds (.2 *60); you can rather
do freq="2s"
The example df and df2 are "out of phase," by that I mean one is binned in 2 seconds starting on odd numbers (1-2 seconds), while one is starting on evens (4-5 seconds). So the date_range you mentioned wouldn't create those bins, it would create dfs from either 0-1s, 2-3s, 4-5s... OR 1-2s,3-4s,5-6s,... depending on which timestamp it started on.
For the latter point, you can use the base argument of .resample() to set the "phase" of the resampling. So in the case above, base=0 would start bins on even numbers, and base=1 would start bins on odds.
This is assuming you are okay with that type of binning - if you really want 1-2 seconds and 4-5 seconds to be in different bins, you would have to do something more complicated I believe.

Python / Pandas: How creating an multi-index empty DataFrame, and then starting to fill it?

I would like to store the summary of a local set of DataFrames into a "meta DataFrame" using pd.MultiIndex.
Basically, row-axis has two levels, and column-axis also.
In the class managing the set of DataFrames, I define as a class variable this "Meta DataFrame".
import pandas as pd
row_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Period'])
column_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Extrema'])
MD = pd.DataFrame(index=row_axis, columns=column_axis)
It seems to work.
MD.index
>>> MultiIndex([], names=['Data', 'Period'])
MD.columns
>>> MultiIndex([], names=['Data', 'Extrema'])
Now, each time I process an individual DataFrame id, I want to update this "Meta DataFrame" accordingly. id has a DateTimeIndex with period '5m'.
id.index[0]
>>> Timestamp('2020-01-01 08:00:00')
id.index[-1]
>>> Timestamp('2020-01-02 08:00:00')
I want to keep in MD its first and last index values for instance.
MD.loc[[('id', '5m')],[('Timestamp', 'First')]] = id.index[0]
MD.loc[[('id', '5m')],[('Timestamp', 'Last')]] = id.index[-1]
This doesn't work, I get following error message:
TypeError: unhashable type: 'list'
In the end, the result I would like is to have in MD following type of info (I am having other id DataFrames with different periods) :
Timestamp
First Last
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00
10m 2020-01-05 08:00:00 2020-01-06 18:00:00
Ultimately, I will also keep min and max of some columns in id.
For instance if id has a column 'Temperature'.
Timestamp Temperature
First Last Min Max
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00 -2.5 10
10m 2020-01-05 08:00:00 2020-01-06 18:00:00 4 15
These values will be recorded when I record id.
I am aware initializing a DataFrame cell per cell is not time efficient, but it will not be done that often.
Besides, I don't see how I can manage this organization of information in a Dict, which is why I am considering doing it with a multi-level DataFrame.
I will then dump it in a csv file to store these "meta data".
Please, what is the right way to initialize each of these values in MD?
I thank you for your help!
Bests,
Instead of filling an empty DataFrame you can store the data in a dict of dicts. A MultiIndex uses tuples as the index values so we make the keys of each dictionary tuples.
The outer Dictionary uses the column MultiIndex tuples as keys and the values are another dictionary with the row MultiIndex tuples as keys and the value that goes in a cell as the value.
d = {('Score', 'Min'): {('id1', '5m'): 72, ('id1', '10m'): -18},
('Timestamp', 'First'): {('id1', '5m'): 1, ('id1', '10m'): 2},
('Timestamp', 'Last'): {('id1', '5m'): 10, ('id1', '10m'): 20}}
# | | |
# Column MultiIndex Row Multi Cell Value
# Label Label
pd.DataFrame(d)
Score Timestamp
Min First Last
id1 5m 72 1 10
10m -18 2 20
Creating that dict will depend upon how you get the values. You can extend a dict with update

Creating new rows and adding them to empty data frame

I have data frame(df1) of start and end dates look like this.
Start Date End Date
1875-01-01 1877-09-30
1881-07-01 1886-03-31
1888-01-01 1889-06-30
1890-10-01 1890-12-31
.
.
.
2016-10-01 2018-12-31
I have a different set of data frame (df2) which consists of daily time series. For example,
Date Value
1875-01-01 7.21
1875-01-02 7.23
1875-01-03 7.22
1875-01-04 7.12
.
.
.
2018-12-31 3.12
I set dates as an index for df2.
I am trying to make stats table based on df2 using df1.
First I created an empty data frame to add the values. For example,
outputtable = pd.DataFrame(columns = ('Max','Min','Ave'))
for i in df1.index:
try:
df3 = df2.loc[df1['Start Date'][i]:df1['End Date'][i]]
minimum = df3['Value'].min()
maximum = df3['Value'].max()
average = df3['Value'].mean()
outputtable[-1]= [minimum, maximum, average]
except:
pass
I used try because some of the dates in df1 are not in df2. In that case, I want the code to ignore and move onto the next set of dates.
I want the code to go through every row of df1 and do the stats (min, max and mean) and put them into the outputtable to fo further calculations. So far the code above is not working. Help would be much appreciated.
Desired output
Start Date End Date Min Max Ave
1875-01-01 1877-09-30 7 8 7.2
1881-07-01 1886-03-31 1 4 2.2
1888-01-01 1889-06-30 2 6.5 3
1890-10-01 1890-12-31 3 5 4.2
.
.
.
2016-10-01 2018-12-31 1 2 1.7

merge different dataframes and add other columns from base dataframe

I am trying to merge different dataframes.
Assume these two melted dataframes.
melted_dfs[0]=
Date Code delta_7
0 2014-04-01 GWA 0.08
1 2014-04-02 TVV -0.98
melted_dfs[1] =
Date Code delta_14
0 2014-04-01 GWA nan
1 2014-04-02 XRP -1.02
I am looking to merge both the above dataframes along with volume& GR columns from my base dataframe.
base_df =
Date Code Volume GR
0 2014-04-01 XRP 74,776.48 482.76
1 2014-04-02 TRR 114,052.96 460.19
I tried to use Python's built in reduce function by converting all dataframes into a list but it throws a error
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
# feature_dfs is a list which contains all the above dfs.
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
Any help is appreciated. Thanks!
This should work , as it stated some of df's date is not datetime format
feature_dfs=[x.assign(Date=pd.to_datetime(x['Date'])) for x in feature_dfs]
abt = reduce(lambda x,y: pd.merge(x,y,on=['Date', 'Code']), feature_dfs)
One of your dataframes in feature_dfs probably has a non-datetime dtype.
Try printing the datatypes and index of the DataFrames:
for i, df in enumerate(feature_dfs):
print 'DataFrame index: {}'.format(str(i))
print df.info()
print '-'*72
I would assume that one of the DataFrames is going to show a line like:
Date X non-null object
Indicating that you don't have a datetime datatype for Date. That DataFrame is the culprit and you will have the index from the print above.

Categories