I have a dataset with columns a_x,b_x,c_x,d_x, a_z,b_z,c_z,d_z
df=pd.DataFrame({'a_x':['a','b','c'],'b_x':['a','b','c'] ,'c_x':['a','b','c'],'d_x':['a','b','c'],'a_z':['a','b','i'],'b_z':['a','t','c'] ,'c_z':['c','c','c'],'d_z':['a','b','c']})
I have another dataset with columns : original,_x,_z.
header_comp=pd.DataFrame({'original':['a','b','c','d'],'_x':['a_x','b_x','c_x','d_x'],'_z':['a_z','b_z','c_z','d_z']})
I'm trying to create a loop using the header_comp to compare the _x columns to the corresponding _z columns such that new columns are created in the original df dataset: a_comp, b_comp, c_comp, d_comp.
Each of these columns will compare if i_x is equal to i_z and spit out either 1 or 0.
output should therefore look like this:
df=pd.DataFrame({'a_x':['a','b','c'],'b_x':['a','b','c'] ,'c_x':['a','b','c'],'d_x':['a','b','c'],'a_z':['a','b','i'],'b_z':['a','t','c'] ,'c_z':['c','c','c'],'d_z':['a','b','c'],'a_comp':[1,1,0],'b_comp':[1,0,1] ,'c_comp':[0,0,1],'d_comp':[1,1,1]})
So far, my code looks like this
for i in range(0, len(header_match)):
df[header_matrch.iloc[i,0] + ' comp'] = (df[header_match.iloc[i,1]==df[header_match.iloc[i,2]]).astype(int)
however, this is not working, with an error of 'Pivotrelease_x'. Is anyone able to troubleshoot this for me?
If I just use the code for individual columns outside of the for loop, there are no problems. e.g.
df[header_matrch.iloc[1,0] + ' comp'] = (df[header_match.iloc[1,1]==df[header_match.iloc[1,2]]).astype(int)
Thanks.
You can just use the values in header_comp to index the values in df:
df[header_comp['original'] + '_comp'] = (df[header_comp['_x']].to_numpy() == df[header_comp['_z']]).astype(int)
Output:
>>> df
a_x b_x c_x d_x a_z b_z c_z d_z a_comp b_comp c_comp d_comp
0 a a a a a a c a 1 1 0 1
1 b b b b b t c b 1 0 0 1
2 c c c c i c c c 0 1 1 1
Related
I'm new in python.
I have data frame (DF) example:
id
type
1
A
1
B
2
C
2
B
I would like to add a column example A_flag group by id.
In the end I have data frame (DF):
id
type
A_flag
1
A
1
1
B
1
2
C
0
2
B
0
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.
Is there any way to optimize this case ?
Thank's for help.
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)
I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. I would like to save the minimum value for each group of unique strings in order to have the associated minimum with A, B, C. Does anybody have any suggestions on that? It could help me also storing somehow all the values for each string they are associated.
Many thanks,
James
Input data:
>>> df
0 1
0 A 0.4533
1 B 0.2323
2 A 1.2343
3 A 1.2353
4 B 4.3521
5 C 3.2113
6 C 2.1233
Use groupby before min:
out = df.groupby(0).min()
Output result:
>>> out
1
0
A 0.4533
B 0.2323
C 2.1233
Update:
filter out all the values in the original dataset that are more than 20% different from the minimum
out = df[df.groupby(0)[1].apply(lambda x: x <= x.min() * 1.2)]
>>> out
0 1
0 A 0.4533
1 B 0.2323
6 C 2.1233
You can simply do it by
min_A=min(df[df["column_1"]=="A"]["value"])
min_B=min(df[df["column_1"]=="B"]["value"])
min_C=min(df[df["column_1"]=="C"]["value"])
where df = Dataframe column_1 and value are the names of the columns of the dataframe
You can also do it by using the pre-defined function of pandas i.e. groupby()
>> df.groupby(["column_1"]).min()
The Above will also give the same results.
I searched and I couldn't find a problem like mine. So if there is and somehow I couldn't find please let me know. So I can delete this post.
I stuck with a problem to split pandas dataframe into different data frames (df) by a value.
I have a dataset inside a text file and I store them as pandas dataframe that has only one column. There are more than one sets of information inside the dataset and a certain value defines the end of that set, you can see a sample below:
The Sample Input
In [8]: df
Out[8]:
var1
0 a
1 b
2 c
3 d
4 endValue
5 h
6 f
7 b
8 w
9 endValue
So I want to split this df into different data frames. I couldn't find a way to do that but I'm sure there must be an easy way. The format I display in sample output can be a wrong format. So, If you have a better idea I'd love to see. Thank you for help.
The sample output I'd like
var1
{[0 a
1 b
2 c
3 d
4 endValue]},
{[0 h
1 f
2 b
3 w
4 endValue]}
You could check where var1 is endValue, take the cumsum, and use the result as a custom grouper. Then Groupby and build a dictionary from the result:
d = dict(tuple(df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))))
Or for a list of dataframes (effectively indexed in the same way):
l = [v for _,v in df.groupby(df.var1.eq('endValue').cumsum().shift(fill_value=0.))]
print(l[0])
var1
0 a
1 b
2 c
3 d
4 endValue
One idea with unique index values is replace non matched values to NaNs and backfilling them, last loop groupby object for list of DataFrames:
g = df.index.to_series().where(df['var1'].eq('endValue')).bfill()
dfs = [a for i, a in df.groupby(g, sort=False)]
print (dfs)
[ var1
0 a
1 b
2 c
3 d
4 endValue, var1
5 h
6 f
7 b
8 w
9 endValue]
Let's say I have a very simple pandas dataframe, containing a single indexed column with "initial values". I want to read in a loop N other dataframes to fill a single "comparison" column, with matching indices.
For instance, with my inital dataframe as
Initial
0 a
1 b
2 c
3 d
and the following two dataframes to read in a loop
Comparison
0 e
1 f
Comparison
2 g
3 h
4 i <= note that this index doesn't exist in Initial so won't be matched
I would like to produce the following result
Initial Comparison
0 a e
1 b f
2 c g
3 d h
Using merge, concat or join, I only ever seem to be able to create a new column for each iteration of the loop, filling the blanks with NaN.
What's the most pandas-pythonic way of achieving this?
Below an example from the proposed duplicate solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
print df1
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
print df2
df3 = pd.DataFrame(np.array([[2,'g'],[3,'h'],[4,'i']]), columns=['','Compare'])
df3 = df3.set_index('')
print df3
print df1.merge(df2,left_index=True,right_index=True).merge(df3,left_index=True,right_index=True)
>>
Initial
0 a
1 b
2 c
3 d
Compare
0 e
1 f
Compare
2 g
3 h
4 i
Empty DataFrame
Columns: [Initial, Compare_x, Compare_y]
Index: []
Second edit: #W-B, the following seems to work, but it can't be the case that there isn't a simpler option using proper pandas methods. It also requires turning off warnings, which might be dangerous...
pd.options.mode.chained_assignment = None
df1["Compare"]=pd.Series()
for ind in df1.index.values:
if ind in df2.index.values:
df1["Compare"][ind]=df2.T[ind]["Compare"]
if ind in df3.index.values:
df1["Compare"][ind]=df3.T[ind]["Compare"]
print df1
>>
Initial Compare
0 a e
1 b f
2 c g
3 d h
Ok , since Op need more info
Data input
import functools
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
df1['Compare']=np.nan
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
df3 = pd.DataFrame(np.array(['g','h','i']), columns=['Compare'],index=[2,3,4])
Solution
newdf=functools.reduce(lambda x,y: x.fillna(y),[df1,df2,df3])
newdf
Out[639]:
Initial Compare
0 a e
1 b f
2 c g
3 d h
I have been trying to select a subset of a correlation matrix using the Pandas Python library.
For instance, if I had a matrix like
0 A B C
A 1 2 3
B 2 1 4
C 3 4 1
I might want to select a matrix where some of the variables in the original matrix are correlated with some of the other variables, like :
0 A C
A 1 3
C 3 1
To do this, I tried using the following code to slice the original correlation matrix using the names of the desired variables in a list, transpose the correlation matrix, reassign the original column names, and then slice again.
data = pd.read_csv("correlationmatrix.csv")
initial_vertical_axis = pd.DataFrame()
for x in var_list:
a = data[x]
initial_vertical_axis = initial_vertical_axis.append(a)
print initial_vertical_axis
initial_vertical_axis = pd.DataFrame(data=initial_vertical_axis, columns= var_list)
initial_matrix = pd.DataFrame()
for x in var_list:
a = initial_vertical_axis[x]
initial_matrix = initial_matrix.append(a)
print initial_matrix
However, this returns an empty correlation matrix with the right row and column labels but no data like
0 A C
A
C
I cannot find the error in my code that would lead to this. If there is a simpler way to go about this, I am open to suggestions.
Suppose data contains your matrix,
In [122]: data
Out[122]:
A B C
0
A 1 2 3
B 2 1 4
C 3 4 1
In [123]: var_list = ['A','C']
In [124]: data.loc[var_list,var_list]
Out[124]:
A C
0
A 1 3
C 3 1