Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
Related
I need to compare 2 DataFrames (which should be identical) and output an Excel sheet that shows the comparison between them, with any mismatched values highlighted. This was the format requested by the analysts working with the reports.
I'm currently using df.compare() to do this, which gives a result like the below, where orig is the original df and new is the new df.
In the below, both values in col_1 at index 3 should be highlighted, because they didn't match between the dataframes:
index col_1 col_2 col_3
orig new orig new orig new
1 1 1 2 2 3 3
2 1 1 2 2 3 3
3 1 2 2 2 3 3
While I can do this on my own, the dataframes could be very large, and there will be hundreds of comparisons. So I need your help in doing it efficiently!
My idea was to do
orig.compare(new, keep_equal=False)
and use that to create a mask. This would work because keep_equal=False only returns values that differ, all other cells are NaN. Then I could run the comparison again with keep_equal=True, which populates all cells. Then finally apply the mask using
df.style.apply
to highlight the values that didn't match.
Is there a faster way to do this? It requires processing all the cells in the df several times.
Thanks for any help you can provide.
orig and new are the two dataframes you want to compare.
Use:
def highlight_diffs(orig, props=''):
return np.where(orig != new, props, '')
orig.style.apply(highlight_diffs, props='color:white;background-color:darkblue', axis=None)
Reference: Styler Functions. Acting on Data.
I have two dataframes. One is very large and has over 4 million rows of data while the other has about 26k. I'm trying to create a dictionary where the keys are the strings of the smaller data frame. This dataframe (df1) contains substrings or incomplete names and the larger dataframe (df2) contains full names/strings and I want to check if if the substring from df1 is in strings in df2 and then create my dict.
No matter what I try, my code takes long and I keep looking for faster ways to iterate through the df's.
org_dict={}
for rowi in df1.itertuples():
part = rowi.part_name
full_list = []
for rowj in df2.itertuples():
if part in rowj.full_name:
full_list.append(full_name)
org_dict[part]=full_list
Am I missing a break or is there a faster way to iterate through really large dataframes of way over 1 million rows?
Sample data:
df1
part_name
0 aaa
1 bb
2 856
3 cool
4 man
5 a0
df2
full_name
0 aaa35688d
1 coolbbd
2 8564578
3 coolaaa
4 man4857684
5 a03567
expected output:
{'aaa':['aaa35688d','coolaaa'],
'bb':['coolbbd'],
'856':['8564578']
...}
etc
The issue here is that nested for loops perform very badly time-wise as the data grows larger. Luckily, pandas allows us to perform vectorised operations across rows/columns.
I can't properly test without having access to a sample of your data, but I believe this does the trick and performs much faster:
org_dict = {substr: df2.full_name[df2.full_name.str.contains(substr)].tolist() for substr in df1.part_name}
I was struggling with how to word the question, so I will provide an example of what I am trying to do below. I have a dataframe that looks like this:
ID CODE COST
0 60086 V2401 105.38
1 60142 V2500 221.58
2 60086 V2500 105.38
3 60134 V2750 35
4 60134 V2020 0
I am trying to create a dataframe that has the ID as rows, the CODE as columns, and the COST as values since the cost for the same code is different per ID. How can I do this in?
This seems like a classic "long to wide" problem, and there are several ways to do it. You can try pivot, for example:
df.pivot_table(index='ID', columns='CODE', values='COST')
(assuming that the dataframe is df.)
I am creating a script that reads a GoogleSheet, transforms the data and passes it into my ERP API to automate the creation of Purchase Orders.
I have got as far as outputting the data in a dataframe but I need help on how I can iterate through this and pass it in the correct format to the API.
DataFrame Example (dfRow):
productID vatrateID amount price
0 46771 2 1 1.25
1 46771 2 1 2.25
2 46771 2 2 5.00
Formatting of the API data:
vatrateID1=dfRow.vatrateID[0],
amount1=dfRow.amount[0],
price1=dfRow.price[0],
productID1=dfRow.productID[0],
vatrateID2=dfRow.vatrateID[1],
amount2=dfRow.amount[1],
price2=dfRow.price[1],
productID2=dfRow.productID[1],
vatrateID3=dfRow.vatrateID[2],
amount3=dfRow.amount[2],
price3=dfRow.price[2],
productID3=dfRow.productID[2],
I would like to create a function that would iterate thru the DataFrame and return the data in the correct format to pass to the API.
I'm new at Python and struggle most with iterating / loops so any help is much appreciated!
First, you can always loop over the rows of a dataframe using df.iterrows(). Each step through this iterator yields a tuple containing the row index and the row contents as a pandas Series object. So, for example, this would do the trick:
for ix, row in df.iterrows():
for column in row.index:
print(f"{column}{ix}={row[column]}")
You can also do it without resorting to loops. This is great if you need performance, but if performance isn't a concern then it is really just a matter of taste.
# first, "melt" the data, which puts all of the variables on their own row
x = df.reset_index().melt(id_vars='index')
# now join the columns together to produce the rows that we want
s = x['variable'] + x['index'].map(str) + '=' + x['value'].map(str)
print(s)
0 productID0=46771.0
1 productID1=46771.0
2 productID2=46771.0
3 vatrateID0=2.0
...
10 price1=2.25
11 price2=5.0
I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.