I have to make bar plot of data from a multindex panda dataframe. This dataframe has the following structure :
value
1 2 25
3 96
4 -12
...
2 3 -25
4 -30
...
3 4 541
5 396
6 14
...
Note that there is a value for index entry (1,2) but no value for (2,1). There is always an index entry (x,y) with y > x and I'd like to create an entry (y,x) for every entry (x,y) having the same value. Basically, I'd like to make my dataframe matrix symmetric. I've tried by switching the level of the indexes and then concatenating the results into a new dataframe but I can't obtain the result I want. Maybe I could do it with a for loop but I'm pretty sure there is a better way to do that... Do you know how to do that efficiently ?
Try using, pd.concat and swaplevel :
pd.concat([df, df.swaplevel(0,1)])
Output:
value
x y
1 2 25
3 96
4 -12
2 3 -25
4 -30
3 4 541
5 396
6 14
2 1 25
3 1 96
4 1 -12
3 2 -25
4 2 -30
3 541
5 3 396
6 3 14
You can unstack, transpose, stack again, and concat to the original series:
new_df = pd.concat( (df.value, df.value.unstack(level=1).T.stack()))
Toy data:
idx = [(a,b) for b in range(1,4) for a in range(1, b)]
idx = pd.MultiIndex.from_tuples(idx)
np.random.seed(10)
df = pd.DataFrame({'value': np.random.randint(-100,100, len(idx))}, index=idx)
df.sort_index(inplace=True)
# df:
# value
# 1 2 -91
# 3 25
# 2 3 -85
Output (new_df):
1 2 -91.0
3 25.0
2 3 -85.0
1 -91.0
3 1 25.0
2 -85.0
dtype: float64
Related
I want to make a new column by calculating existing columns.
For example df
df
no data1 data2
1 10 15
2 51 46
3 36 20
......
i want to make this
new_df
no data1 data2 data1/-2 data1/2 data2/-2 data2/2
1 10 15 -5 5 -7.5 7.5
2 51 46 -25.5 25.5 -23 23
3 36 20 -18 18 -9 9
but i don't know how to make this as efficient as possible
To create a new df column based on the calculations of two or more other columns, you would have to define a new column and set it equal to your equation. For example:
df['new_col'] = df['col_1'] * df['col_2']
Is this what you mean? :
import pandas as pd
number = [[1,2],[3,4],[5,6],[7,8],[9,10]]
df = pd.DataFrame(number)
df['Data 1/2'] = df[0] / df[1]
And the output :
0 1 Data 1/2
0 1 2 0.500000
1 3 4 0.750000
2 5 6 0.833333
3 7 8 0.875000
4 9 10 0.900000
I have two dataframes with different sizes where one is bigger than the other but the second data frame has more columns.
I'm having problems with trying to add a data frame if it has the same column & row value as the other data frame which in this case is id
this is some dummy data and how I was trying to solve it
import pandas as pd
df1 = pd.DataFrame([(1,2,3),(3,4,5),(5,6,7),(7,8,9),(100,10,12),(100,10,12),(100,10,12)], columns=['id','value','c'])
df2 = pd.DataFrame([(1,200,3,4,6),(3,400,3,4,6),(5,600,3,4,6),(5,620,3,4,6)], columns=['id','value','x','y','z'])
so if id of the df1 and df2 are the same then add the column value by the value in "whatToAdd"
data
df1:
id value c
1 2 3
3 4 5
5 6 7
7 8 9
100 10 12
100 10 12
100 10 12
df2:
id value x y z
1 200 3 4 6
3 400 3 4 6
5 600 3 4 6
5 620 3 4 6
expected:
Out:
id value x y z
1 202 3 4 6
3 404 3 4 6
5 606 3 4 6
5 626 3 4 6
tried:
for each in df1.a:
if(df2.loc[df2['a'] == each]):
df2['a']+=df['a']
spew out an error "The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." which confusing for me cause i tried:
df2.loc[df2['a']==1
out of the loop and it works
Once you set both data frames to have same index:
df1 = df1.set_index("id")
df2 = df2.set_index("id")
You can do one very simple operation:
mask = df1.index.isin(df2.index)
df2["value"] += df1.loc[mask, "value"]
Output:
value x y z
id
1 202 3 4 6
3 404 3 4 6
5 606 3 4 6
5 626 3 4 6
You can always do df2.reset_index() to get back to original setting.
You can using set_index with add, then follow with reindex
df1.set_index('id').add(df2.set_index('id'),fill_value=0).dropna(axis=0).reset_index().reindex(columns=df2.columns)
Out[193]:
id value x y z
0 1 202.0 3.0 4.0 6.0
1 3 404.0 3.0 4.0 6.0
2 5 606.0 3.0 4.0 6.0
3 5 626.0 3.0 4.0 6.0
Here is code I came up with. It uses a dict to look up the value for each id in df1. Map can then be used to look up the value for each id in df2, creating a series that is then added to df2['value'] to produce the desired result.
df1_lookup = dict(df1.set_index('id')['value'].items())
df2['value'] += df2['id'].map(lambda x: df1_lookup.get(x, 0))
Here is a one-liner.
df2.loc[:, 'value'] += [df1.set_index('id').loc[i, 'value'] for i in df2.id]
print(df2)
>>>
id value x y z
0 1 202 3 4 6
1 3 404 3 4 6
2 5 606 3 4 6
3 5 626 3 4 6
After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()
I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17
With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
I have a list of values that are found in a large pandas dataframe:
value_list = [1, 4, 5, 6, 54]
Example DataFrame df is below:
column x
0 1 3
1 4 6
2 5 8
3 6 19
4 8 21
5 12 97
6 54 102
I would like to create a subset of the data frame using only these values:
df_new = df[df['column'] is in value_list] # pseudo code
Is this possible?
You might be looking for isin operation.
In [60]: df[df['column'].isin(value_list)]
Out[60]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
Also, you can use query like
In [63]: df.query('column in #value_list')
Out[63]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
You missed a for loop :
df_new = [df[elem]['column'] for elem in df if df[elem]['column'] in value_list]
I have 2 pandas data frames which one of them consists the modified selected rows of the first one (they have similar columns).
For simplicity the below frames illustrate this problem.
df1 = df2 =
A B C A B C
0 1 2 3 1 20 30 40
1 2 3 4 3 40 50 60
2 3 4 5
3 4 5 6
Is there any more efficient and pythonic way than the below code,to embedding the df2 into df1 by overwriting values? (working with high-dimensional frames)
for index, row in df2.iterrows():
df1.ix[index,:] = df2.ix[index, :]
which results in:
df1 =
A B C
0 1 2 3
1 20 30 40
2 3 4 5
3 40 50 60
You can use update to update a df with another df, where the row and column labels agree the values are updated, you will need to cast to int using astype because the dtype is changed to float due to missing values:
In [21]:
df1.update(df2)
df1 = df1.astype(int)
df1
Out[21]:
A B C
0 1 2 3
1 20 30 40
2 3 4 5
3 40 50 60