Fit Data in Pandas DataFrame - python

I am querying a database for a few variables from an experiment, one at a time and storing the data in a Pandas DataFrame. I can get the data that I need, looks as below for instance:
file time variableid data
0 1 1503657 1 11
1 1 1503757 1 22
There is data for several variables that I will be grabbing like this, and then I will be combining them into a single DataFrame to be output to a csv. Each variable's data column will be added as a new column with the corresponding name of the variable (as the file_id should always be the same). The time column values might be different (one DF could be longer than the other, the data wasn't sampled at all of the same times, etc), but if I merge the tables on the time (and file) column, then any discrepancies are filled in with NaN (and I will fill them in with DF.fillna(0)) and the DF can be resorted by the time.
What I need though is a way to filter the data so that it fits a certain rate, such as every 100 milliseconds (1503700,1503800,...). The datapoint itself doesn't have to fit that rate exactly (and in fact the data rarely falls on a time that ends in 00 for instance), but it should be the closest matching data for that time (it could be the closest before or after that time actually, as long as it is consistent throughout).
I thought about iterating over all the values in the time column and adding the row with the closest time one by one (I would first create a blank DF with the desired times), but there are sometimes 50,000+ rows in a sample table. I found an answer about interpolating (link below), but I don't really want to add or modify any of the data itself, just pull the rows that most closely match the rate that I want to sample the data (one reason is some of the data is binary and I wouldn't want to end up with something like 0.5 because the before desired time and after desired time values were 0 and 1). Any help is greatly appreciated, thanks.
combining pandas dataframes of different sampling rates

Related

In python pandas, How do you eliminate rows of data that fail to meet a condition of grouped data?

I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
Original Data Set
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
Results of two lines of code
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query

Pandas convert timeframes to indexes

can you please suggest me an easy way to convert time periods to the corresponding indexes?
I have a function that picks entries from data frames based on numerical indexes (from 10th to 20th row) that I can not change. At the same time my data frame has time indexes and I have picked parts of it based on timestamps. How to convert those timestamps to the corresponding numerical indexes?
Thanks a lot
Alex
Adding some examples:
small_df.index[1]
Out[894]: Timestamp('2019-02-08 07:53:33.360000')
small_df.index[10]
Out[895]: Timestamp('2019-02-08 07:54:00.149000') # instead of time stamps.
These are the time period I want to pick from a second data frame that has time indexing as well. But I want to do that with numerical indexing
That means then
1. Find which numerical indexes correspond to the time period above
Based on the comment above this might be quite close on what I need:
start=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[1]))
end=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[10]))
picked_rows= second_dataframe[start:end]
Is there a better way to do that?
I believe you need Index.get_loc if need position:
small_df.index.get_loc(pd.Timestamp('2019-02-08 07:53:33.360000'))
1
EDIT: If values always matched, is possible get timestamp form first and extract second rows by DataFrame.loc:
start = small_df.index[1]
end = small_df.index[10]
picked_rows = second_dataframe.loc[start:end]
OrL
start=pd.Timestamp(small_df.index[1])
end=pd.Timestamp(small_df.index[10])
picked_rows = second_dataframe.loc[start:end]

Pandas Dataframes merging thru iterations. How to avoid lists and rows of headers

I code just once in a while and I am super basic at the moment. Might be a silly question, but it got me stuck in for a bit too much now.
Background
I have a function (get_profiles) that plots points every 5m along one transect line (100m long) and extracts elevation (from a geotiff).
The arguments are:
dsm (digital surface model)
transect_file (geopackage, holds many LineStrings with different transect_ID)
transect_id (int, extracted from transect_file)
step (int, number of meters to extract elevation along transect lines)
The output for one transect line is a dataframe like in the picture, which is what I expected, and I like it!
However, the big issue is when I iterate the function over the transect_ids (the transect_files has 10 Shapely LineStrings), like this:
tr_list = np.arange(1,transect_file.shape[0]-1)
geodb_transects= []
for i in tr_list:
temp=get_profiles(dsm,transect_file,i,5)
geodb_transects.append(temp)
I get a list. It might be here the error, but I don't know how to do in another way.
type(geodb_transects)
output:list
And, what's worse, I get headers (distance, z, tr_id, date) every time a new iteration starts.
How to get a clean pandas dataframe, just like the output of 1 iteration (20rows) but with all the tr_id chunks of 20row each aligned and without headers?
If your output is a DataFrame then you’re simply looking to concatenate the incremental DataFrame into some growing DataFrame.
It’s not the most efficient but something like
import pandas
df = pandas.DataFrame()
for i in range(7) :
df = df.concat( df_ret_func(i))
You may also be interested in the from_records function if you have a list of elements that are all records of the same form and can be converted into the rows of a DataFrame.

pandas inserting rows in a monotonically increasing dataframe using itertuples

I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.
A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09
Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]

Dataframe - Sum Arrays in Cell - Memory Issue

I have a very large dataframe (close to 1 million rows), which has a couple of meta data columns and one single column that contains a long string of triples. One string could look like this:
0,0,123.63;10,360,2736.11;30,270,98.08;...
That is, three values separated by comma and then separated by semicolon. Let us refer to the three values as IN, OUT, MEASURE. Effectively i want to group my data by the original columns + the IN & OUT columns and then sum over the MEASURE column. Since each long string contains roughly 30 triples my dataframe would grow to be ~30 million rows if i simply unstacked the data. Obviously this is not feasible.
So given a set of columns (which may in- or exclude the IN & OUT columns) over which I want to group and then sum my MEASURE data, how would I efficiently strip out the relevant data and sum everything up without blowing up my memory?
My current solution simply loops over each row and then over each triple and keeps a running total of each group I specified. This is very slow, so I am looking for something faster, perhaps vectorised. Any help would be appreciated.
Edit: Sample data below (columns separated by pipe)
DATE|REGION|PRIORITY|PARAMETERS
10-Oct-2016|UK|High|0,0,77.82;30,90,7373.70;
10-Oct-2016|US|Low|0,30,7.82;30,90,733.70;
11-Oct-2016|UK|High|0,0,383.82;40,90,713.75;
12-Oct-2016|NA|Low|40,90,937.11;30,180,98.23;
where PARAMETERS has the form 'IN,OUT,MEASURE;IN,OUT,MEASURE;...'
I basically want to (as an example) create a pivot table where
values=MEASURE
index=DATE, IN
columns=PRIORITY

Categories