resample dataframe on values rather than time / merge dataframes - python

I have two dataframes which I would like to merge. Both dataframes were obtained from measurements, and read in from stored CSV files. Below is the minimal code:
import pandas as pd
df1 = pd.read_csv('file_charge.csv') #battery charging file
print(df1.head())
soc volt
0 0.000052 46.489062
1 0.000246 46.600472
2 0.000537 46.714063
3 0.000833 46.823437
4 0.001125 46.919929
print(len(df1))
3052
df2 = pd.read_csv('file_discharge.csv') #battery discharging file
print(df2.head())
volt soc
0 56.064844 0.999797
1 55.608464 0.999458
2 55.236909 0.999117
3 54.908865 0.998753
4 54.639002 0.998398
print(len(df2))
2962
With timeseries data, I have found that resampling and using the datetime-index to merge/join/concat on to be great. What I want to do is create an overall file that contains the following:
soc | volt_charge | volt_discharge | volt_average
The issue I see at the moment is that the dataframes are of different length, which I don't know how to easily address. (How) Is it possible to downsample (or even upsample) a dataframe with a numeric index?
So far, my attempts at combining/merging the dataframes have failed. Using pd.merge results in an empty resulting dataframe, whereas pd.merge_ordered gives a df that has only 2 columns (soc and volt), rather than the desired 3 (soc, volt_1, volt_2), from which the additional fourth (volt_avg = mean(volt_1,volt_2)) column can be made.
In graphical terms: it is possible to graph both df1 and df2 on the same x- and y-axes, yet I don't know how the "visual" average of df1 and df2 could be created.

Related

Speed up operations over Python Pandas dataframes

I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.
This is the structure of the dataframes:
tempDF
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse
use1
0 NaN
1 NaN
2 NaN
This is the code:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.
When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.
If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):
tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')

Can you assign a dataframe to one dataframe cell? [duplicate]

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

How do I groupby a dataframe based on values that are common to multiple columns?

I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.

Pandas merge datagrames based on most similar values

I am attempting to Merge 2 pandas dataframes, however, the values are not exactly the same in the merge columns.
I am using the command
pd.merge(D_data, L_data,on="R_Time")
however, in D_data my R_time Column looks like
4.316667, 4.320834, 4.325000
and in my L_data column my data looks like:
4.31000, 4.32000, ...
Essentially, what I am trying to do is take every item in the first set, and match it to the closest element in the second set. I've done this with the vlookup function in Excel, but I'm not entire sure how to get the same functionality in Pandas Dataframe objects.
Given Data:
D_data:
4.316667
4.320834
4.325
4.329167
4.333334
4.3375
4.341667
4.345834
4.35
4.354167
4.358334
L_Data
4.316667
4.318667
4.320667
4.322667
4.324667
4.326667
4.328667
4.330667
4.332667
4.334667
4.336667
I Want to produce a pairing between exactly these elements, even though they are not exactly identical in most cases.
you can use Pandas' merge_asof():
First create a column in L_data with the value from R_data that is closest (index of smallest absolute difference) and then merge:
import pandas as pd
D_data =pd.DataFrame({"R_Time":[4.316667,4.320834,4.325,4.329167,4.333334,4.3375,4.341667,4.345834,4.35,4.354167,4.358334]})
L_data =pd.DataFrame({"_R_Time":[4.316667,4.318667,4.320667,4.322667,4.324667,4.326667,4.328667,4.330667,4.332667,4.334667,4.336667]})
L_data["R_Time"]=L_data.apply(lambda x:D_data["R_Time"][abs(D_data["R_Time"]-x["_R_Time"]).idxmin()],axis=1)
pd.merge(D_data, L_data,on="R_Time")
Result:
R_Time _R_Time
0 4.316667 4.316667
1 4.316667 4.318667
2 4.320834 4.320667
3 4.320834 4.322667
4 4.325000 4.324667
5 4.325000 4.326667
6 4.329167 4.328667
7 4.329167 4.330667
8 4.333334 4.332667
9 4.333334 4.334667
10 4.337500 4.336667

Create different dataframes from 1 excel file using selected columns

I have a large data frame with dates and then stocks at the top with columns of price data.
Header 1 Header 2 Header 3 Header 4
======== ======== ======== ========
Date Stock 1 Stock 2 Stock 3
1/2/2001 2.77 6.00 11.00
1/3/2001 2.89 6.08 11.10
1/4/2001 2.86 6.33 11.97
1/5/2001 2.80 6.58 12.40
What I want to do is make multiple dataframes from this one file with the date and the stock price of each stock. So essentially in this example you would have 4 dataframes (the file has more than 1000 of them so this is just a sample). So the dataframes would be:
DF1 = Data and Stock 1
DF2 = Data and Stock 2
DF3 = Data and Stock 3
DF4 = Data and Stock 4
I am then going to take each dataframe and add more columns to each of them once they are created.
I was reading through previous questions and came up with usecols, but I can seem to get the syntax written properly. Can someone help me out? Also if there is a better way to do this please advise. Since I have more than 1000, speed is important in running through the file
This is what I have so far but I am not sure I am heading down the most efficient path. It gives the following error (among others it seems):
>>>> ValueError: The elements of 'usecols' must either be all strings or all integers`
df2 = pd.read_csv('file.csv')
# read in Exel file to get column headers from excel
for i in df2:
a = 0
# always want to have 1st (date column) as 1st column in DF
d = pd.read_csv('file.csv',usecols=[a,i])
# Read in file with proper columns, will always be first column
#and add column 1, next loop cols 0,2, next loop 0,3, etc.
dataf[i] = pd.DataFrame(d) #actually create DataFrame
It also seems to be inefficient to have to read in the file each time. Maybe there is a way to read in file once and then create the dataframes. Any help would be appreciated.
Consider building a list of integer pairings ([0,1], [0,2], [0,3], etc.) to slice master dataframe by columns. Then iteratively append dataframes to a list which is a preferred setup of one container (with similarly structured elements) to avoid 1000's of dfs flooding your global enviroment.
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y')
masterdf = pd.read_csv("DataFile.csv", parse_dates=[0], date_parser=dateparse)
colpairs = [[0, i] for i in range(1, len(masterdf.columns))]
dfList = []
for cols in colpairs:
dfList.append(masterdf[cols])
print(len(dfList))
print(dfList[0].head())
print(dfList[1].head())
Alternatively, consider a dictionary of dataframes with stock names as keys for a container, where colpairs carry string literal pairings as opposed to integers:
colpairs = [['Date', masterdf.columns[i]] for i in range(1, len(masterdf.columns))]
dfDict = {}
for cols in colpairs:
dfDict[cols[1]] = masterdf[cols]
print(len(dfDict))
print(dfDict['Stock 1'].head())
print(dfDict['Stock 2'].head())

Categories