Append multiple columns into two columns python - python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.

If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

Related

How to select every 4th row in a pandas dataframe and calculate the rolling average

I have a pandas dataframe that you can see in the screenshot. The dataframe has a time resolution of 15 minutes (it is generation data). I would like to reduce this time resolution to 1 hour meaning that I should take every 4th row and the value in every 4th row should be the anverage values of the last 4 rows (including this one). So it should be a rolling average with non-overlapping horizons.
I tried the following for one column (wind offshore):
df_generation = pd.read_csv("C:/Users/Desktop/Data/generation_data.csv", sep =",")
df_generation_2 = df_generation
df_generation_2['Wind Offshore Average'] = df_generation_2['Wind Offshore'].rolling(4).mean()
But this is not what I really want. As you can see in the screenshot, my code just created a further column with the average of the last 4th entries for every timeslot. Here the rolling average has overlapping horizons. What I want is to have a new dataframe that only has an entry after every hour (after 4 timslots of the original array). Do you have an idea how I can do that? I'd appreciate every comment.
From looking at your Index it looks like the .resample method is what you are looking for (with many examples for specific uses): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
as in
new = df_generation['Wind Offshore'].resample('1H').mean()

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

Pandas convert timeframes to indexes

can you please suggest me an easy way to convert time periods to the corresponding indexes?
I have a function that picks entries from data frames based on numerical indexes (from 10th to 20th row) that I can not change. At the same time my data frame has time indexes and I have picked parts of it based on timestamps. How to convert those timestamps to the corresponding numerical indexes?
Thanks a lot
Alex
Adding some examples:
small_df.index[1]
Out[894]: Timestamp('2019-02-08 07:53:33.360000')
small_df.index[10]
Out[895]: Timestamp('2019-02-08 07:54:00.149000') # instead of time stamps.
These are the time period I want to pick from a second data frame that has time indexing as well. But I want to do that with numerical indexing
That means then
1. Find which numerical indexes correspond to the time period above
Based on the comment above this might be quite close on what I need:
start=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[1]))
end=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[10]))
picked_rows= second_dataframe[start:end]
Is there a better way to do that?
I believe you need Index.get_loc if need position:
small_df.index.get_loc(pd.Timestamp('2019-02-08 07:53:33.360000'))
1
EDIT: If values always matched, is possible get timestamp form first and extract second rows by DataFrame.loc:
start = small_df.index[1]
end = small_df.index[10]
picked_rows = second_dataframe.loc[start:end]
OrL
start=pd.Timestamp(small_df.index[1])
end=pd.Timestamp(small_df.index[10])
picked_rows = second_dataframe.loc[start:end]

Pandas Dataframes merging thru iterations. How to avoid lists and rows of headers

I code just once in a while and I am super basic at the moment. Might be a silly question, but it got me stuck in for a bit too much now.
Background
I have a function (get_profiles) that plots points every 5m along one transect line (100m long) and extracts elevation (from a geotiff).
The arguments are:
dsm (digital surface model)
transect_file (geopackage, holds many LineStrings with different transect_ID)
transect_id (int, extracted from transect_file)
step (int, number of meters to extract elevation along transect lines)
The output for one transect line is a dataframe like in the picture, which is what I expected, and I like it!
However, the big issue is when I iterate the function over the transect_ids (the transect_files has 10 Shapely LineStrings), like this:
tr_list = np.arange(1,transect_file.shape[0]-1)
geodb_transects= []
for i in tr_list:
temp=get_profiles(dsm,transect_file,i,5)
geodb_transects.append(temp)
I get a list. It might be here the error, but I don't know how to do in another way.
type(geodb_transects)
output:list
And, what's worse, I get headers (distance, z, tr_id, date) every time a new iteration starts.
How to get a clean pandas dataframe, just like the output of 1 iteration (20rows) but with all the tr_id chunks of 20row each aligned and without headers?
If your output is a DataFrame then you’re simply looking to concatenate the incremental DataFrame into some growing DataFrame.
It’s not the most efficient but something like
import pandas
df = pandas.DataFrame()
for i in range(7) :
df = df.concat( df_ret_func(i))
You may also be interested in the from_records function if you have a list of elements that are all records of the same form and can be converted into the rows of a DataFrame.

Fit Data in Pandas DataFrame

I am querying a database for a few variables from an experiment, one at a time and storing the data in a Pandas DataFrame. I can get the data that I need, looks as below for instance:
file time variableid data
0 1 1503657 1 11
1 1 1503757 1 22
There is data for several variables that I will be grabbing like this, and then I will be combining them into a single DataFrame to be output to a csv. Each variable's data column will be added as a new column with the corresponding name of the variable (as the file_id should always be the same). The time column values might be different (one DF could be longer than the other, the data wasn't sampled at all of the same times, etc), but if I merge the tables on the time (and file) column, then any discrepancies are filled in with NaN (and I will fill them in with DF.fillna(0)) and the DF can be resorted by the time.
What I need though is a way to filter the data so that it fits a certain rate, such as every 100 milliseconds (1503700,1503800,...). The datapoint itself doesn't have to fit that rate exactly (and in fact the data rarely falls on a time that ends in 00 for instance), but it should be the closest matching data for that time (it could be the closest before or after that time actually, as long as it is consistent throughout).
I thought about iterating over all the values in the time column and adding the row with the closest time one by one (I would first create a blank DF with the desired times), but there are sometimes 50,000+ rows in a sample table. I found an answer about interpolating (link below), but I don't really want to add or modify any of the data itself, just pull the rows that most closely match the rate that I want to sample the data (one reason is some of the data is binary and I wouldn't want to end up with something like 0.5 because the before desired time and after desired time values were 0 and 1). Any help is greatly appreciated, thanks.
combining pandas dataframes of different sampling rates

Categories