Pandas Counts with different level - python

I have raw data looks like df, how can I made df to df1?
df = pd.DataFrame({"A":["A","A","B","B","B"],
"B":["N","N","N","S","S"],
"C":["E","E","NE","NE","NE"],
"D":["y","n","n","y","n"]})
df1 = pd.DataFrame({"A":["A","N","E"],
"B":["N","N","S"],
"C":[2,1,2],
"D":[1,0,1]})

IIUC (I guess first column A in df1 is ['A','B','B'] )
df.groupby(['A','B']).agg({'C':'count','D':lambda x : sum(x=='y')}).reset_index()
Out[285]:
A B C D
0 A N 2 1
1 B N 1 0
2 B S 2 1

Related

How to replace data in one pandas df by the data of another one?

Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()

How to apply a function on a series of columns, based on the values in a corresponding series of columns?

I have a df where I have several columns, that, based on the value (1-6) in these columns, I want to assign a value (0-1) to its corresponding column. I can do it on a column by column basis but would like to make it a single function. Below is some example code:
import pandas as pd
df = pd.DataFrame({'col1': [1,3,6,3,5,2], 'col2': [4,5,6,6,1,3], 'col3': [3,6,5,1,1,6],
'colA': [0,0,0,0,0,0], 'colB': [0,0,0,0,0,0], 'colC': [0,0,0,0,0,0]})
(col1 corresponds with colA, col2 with colB, col3 with colC)
This code works on a column by column basis:
df.loc[(df.col1 != 1) & (df.col1 < 6), 'colA'] = (df['colA']+ 1)
But I would like to be able to have a list of columns, so to speak, and have it correspond with another. Something like this, (but that actually works):
m = df['col1' : 'col3'] != 1 & df['col1' : 'col3'] < 6
df.loc[m, 'colA' : 'colC'] += 1
Thank You!
Idea is filter both DataFrames by DataFrame.loc, then filter columns by mask and rename columns by another df2 and last use DataFrame.add only for df.columns:
df1 = df.loc[:, 'col1' : 'col3']
df2 = df.loc[:, 'colA' : 'colC']
d = dict(zip(df1.columns,df2.columns))
df1 = ((df1 != 1) & (df1 < 6)).rename(columns=d)
df[df2.columns] = df[df2.columns].add(df1)
print (df)
col1 col2 col3 colA colB colC
0 1 4 3 0 1 1
1 3 5 6 1 1 0
2 6 6 5 0 0 1
3 3 6 1 1 0 0
4 5 1 1 1 0 0
5 2 3 6 1 1 0
Here's what I would do:
# split up dataframe
sub_df = df.iloc[:,:3]
abc = df.iloc[:,3:]
# make numpy array truth table
truth_table = (sub_df.to_numpy() > 1) & (sub_df.to_numpy() < 6)
# redefine abc based on numpy truth table
new_abc = pd.DataFrame(truth_table.astype(int), columns=['colA', 'colB', 'colC'])
# join the updated dataframe subgroups
new_df = pd.concat([sub_df, new_abc], axis=1)

count unique values in groups pandas

I have a dataframe like this:
data = {'id': [1,1,1,2,2,3],
'value': ['a','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
I would like to get the unique counts of obj_id groupby id and value:
1 a 3
2 b 1
3 c 1
But when I do:
result=df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
the result I got was:
1 a 2
1 a 1
2 b 1
3 c 1
so the first two rows with same id and value don't group together.
How can I fix this? Many thanks!
For me your solution working nice with sample data.
Like mentioned #YOBEN_S in comments is possible problem traling whitespeces, then solution is add Series.str.strip:
data = {'id': [1,1,1,2,2,3],
'value': ['a ','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
df['value'] = df['value'].str.strip()
df = df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
print (df)
id value obj_counts
0 1 a 3
1 2 b 1
2 3 c 1

Show complete rows highliting difference between dataframes df1 , df2 , but only when difference in row's cell exists

i have two dataframes df1 and df2.
Same index and same column names.
how to construct a dataframe which shows difference, but only rows which have at least one different cell?
if row has different cells, but some are same, keep same cells intact.
example:
df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})
output needed:
pd.DataFrame({1:['a'],2:['c->d']},index=[0])
output in this example should be one row dataframe, not dataframe including same rows
NB: output should only contain full rows which has at least one difference in cell
i'd like an efficient solution without iterating by rows , and without creating special-strings in DataFrame
You can use this brilliant solution:
def report_diff(x):
return x[0] if x[0] == x[1] else '{}->{}'.format(*x)
In [70]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[70]:
1 2
0 a c->d
1 a c
For bit more complex DataFrames:
In [73]: df1
Out[73]:
A B C
0 a c 1
1 a c 2
2 1 2 3
In [74]: df2
Out[74]:
A B C
0 a d 1
1 a c 2
2 1 2 4
In [75]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[75]:
A B C
0 a c->d 1
1 a c 2
2 1 2 3->4
UPDATE: showing only changed/different rows:
In [54]: mask = df1.ne(df2).any(1)
In [55]: mask
Out[55]:
0 True
1 False
2 True
dtype: bool
In [56]: pd.Panel(dict(df1=df1[mask],df2=df2[mask])).apply(report_diff, axis=0)
Out[56]:
A B C
0 a c->d 1
2 1 2 3->4
How about a good ole list comprehension on the flattened contents...
import pandas as pd
import numpy as np
df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})
rows_different_mask = (df1 != df2).any(axis=1)
pairs = zip(df1.values.reshape(1, -1)[0], df2.values.reshape(1, -1)[0])
new_elems = ["%s->%s" %(old, new) if (old != new) else new for old, new in pairs]
df3 = pd.DataFrame(np.reshape(new_elems, df1.values.shape))
print df3
0 1
0 a c->d
1 a c

python: using .iterrows() to create columns

I am trying to use a loop function to create a matrix of whether a product was seen in a particular week.
Each row in the df (representing a product) has a close_date (the date the product closed) and a week_diff (the number of weeks the product was listed).
import pandas
mydata = [{'subid' : 'A', 'Close_date_wk': 25, 'week_diff':3},
{'subid' : 'B', 'Close_date_wk': 26, 'week_diff':2},
{'subid' : 'C', 'Close_date_wk': 27, 'week_diff':2},]
df = pandas.DataFrame(mydata)
My goal is to see how many alternative products were listed for each product in each date_range
I have set up the following loop:
for index, row in df.iterrows():
i = 0
max_range = row['Close_date_wk']
min_range = int(row['Close_date_wk'] - row['week_diff'])
for i in range(min_range,max_range):
col_head = 'job_week_' + str(i)
row[col_head] = 1
Can you please help explain why the "row[col_head] = 1" line is neither adding a column, nor adding a value to that column for that row.
For example, if:
row A has date range 1,2,3
row B has date range 2,3
row C has date range 3,4,5'
then ideally I would like to end up with
row A has 0 alternative products in week 1
1 alternative products in week 2
2 alternative products in week 3
row B has 1 alternative products in week 2
2 alternative products in week 3
&c..
You can't mutate the df using row here to add a new column, you'd either refer to the original df or use .loc, .iloc, or .ix, example:
In [29]:
df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3))
df
Out[29]:
a b c
0 -1.525011 0.778190 -1.010391
1 0.619824 0.790439 -0.692568
2 1.272323 1.620728 0.192169
3 0.193523 0.070921 1.067544
4 0.057110 -1.007442 1.706704
In [30]:
for index,row in df.iterrows():
df.loc[index,'d'] = np.random.randint(0, 10)
df
Out[30]:
a b c d
0 -1.525011 0.778190 -1.010391 9
1 0.619824 0.790439 -0.692568 9
2 1.272323 1.620728 0.192169 1
3 0.193523 0.070921 1.067544 0
4 0.057110 -1.007442 1.706704 9
You can modify existing rows:
In [31]:
# reset the df by slicing
df = df[list('abc')]
for index,row in df.iterrows():
row['b'] = np.random.randint(0, 10)
df
Out[31]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
But adding a new column using row won't work:
In [35]:
df = df[list('abc')]
for index,row in df.iterrows():
row['d'] = np.random.randint(0,10)
df
Out[35]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
row[col_head] = 1 ..
Please try the below line:
df.at[index,col_head]=1

Categories