python: using .iterrows() to create columns - python

I am trying to use a loop function to create a matrix of whether a product was seen in a particular week.
Each row in the df (representing a product) has a close_date (the date the product closed) and a week_diff (the number of weeks the product was listed).
import pandas
mydata = [{'subid' : 'A', 'Close_date_wk': 25, 'week_diff':3},
{'subid' : 'B', 'Close_date_wk': 26, 'week_diff':2},
{'subid' : 'C', 'Close_date_wk': 27, 'week_diff':2},]
df = pandas.DataFrame(mydata)
My goal is to see how many alternative products were listed for each product in each date_range
I have set up the following loop:
for index, row in df.iterrows():
i = 0
max_range = row['Close_date_wk']
min_range = int(row['Close_date_wk'] - row['week_diff'])
for i in range(min_range,max_range):
col_head = 'job_week_' + str(i)
row[col_head] = 1
Can you please help explain why the "row[col_head] = 1" line is neither adding a column, nor adding a value to that column for that row.
For example, if:
row A has date range 1,2,3
row B has date range 2,3
row C has date range 3,4,5'
then ideally I would like to end up with
row A has 0 alternative products in week 1
1 alternative products in week 2
2 alternative products in week 3
row B has 1 alternative products in week 2
2 alternative products in week 3
&c..

You can't mutate the df using row here to add a new column, you'd either refer to the original df or use .loc, .iloc, or .ix, example:
In [29]:
df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3))
df
Out[29]:
a b c
0 -1.525011 0.778190 -1.010391
1 0.619824 0.790439 -0.692568
2 1.272323 1.620728 0.192169
3 0.193523 0.070921 1.067544
4 0.057110 -1.007442 1.706704
In [30]:
for index,row in df.iterrows():
df.loc[index,'d'] = np.random.randint(0, 10)
df
Out[30]:
a b c d
0 -1.525011 0.778190 -1.010391 9
1 0.619824 0.790439 -0.692568 9
2 1.272323 1.620728 0.192169 1
3 0.193523 0.070921 1.067544 0
4 0.057110 -1.007442 1.706704 9
You can modify existing rows:
In [31]:
# reset the df by slicing
df = df[list('abc')]
for index,row in df.iterrows():
row['b'] = np.random.randint(0, 10)
df
Out[31]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704
But adding a new column using row won't work:
In [35]:
df = df[list('abc')]
for index,row in df.iterrows():
row['d'] = np.random.randint(0,10)
df
Out[35]:
a b c
0 -1.525011 8 -1.010391
1 0.619824 2 -0.692568
2 1.272323 8 0.192169
3 0.193523 2 1.067544
4 0.057110 3 1.706704

row[col_head] = 1 ..
Please try the below line:
df.at[index,col_head]=1

Related

Is there a way to vectorize the following python (pandas) code for faster speed?

I have 2 DataFrames
A_df = pd.DataFrame(data = np.arange(2, 103, 10) + np.random.randn(11),
columns = ['Time(s)'])
B_df = pd.DataFrame(data = zip(range(1, 102), np.random.randn(101)),
columns = ['Time(s)', 'Value'])
A_df.head()
Time(s)
0 2.751352
1 12.028663
2 20.638388
3 29.821199
4 42.516302
B_df.head()
Time(s) Value
0 1 1.075801
1 2 0.890754
2 3 -0.015543
3 4 0.085298
4 5 0.208645
I want to add a new column Value of B at T to A_df which is the latest value of B (from B_df) at time t (in A_df).
Currently I am doing this by using apply() method as follows:
A_df['Value of B at T'] = A_df.apply(lambda x: B_df.loc[B_df['Time(s)'] <= x['Time(s)'], 'Value'].values[-1], axis = 1)
A_df
Time(s) Value of B at T
0 2.751352 -0.891782
1 12.028663 2.416335
2 20.638388 -0.186364
3 29.821199 -0.148716
4 42.516302 0.821272
I was wondering if there is a way to vectorize the functionality being carried out by apply() method to speed up the code?
You can try the following:
A_df = A_df.assign(
b_at_t=B_df
.set_index('Time(s)')
.reindex(A_df['Time(s)'], method='ffill')
.values)

Element-wise Comparison of Two Pandas Dataframes

I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?
This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7
You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )
Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40

How to combine two pandas dataframes value by value

I have 2 dataframes - players (only has playerid) and dates (only has date). I want new dataframe which will contain for each player each date. In my case, players df contains about 2600 rows and date df has 1100 rows. I used 2 for loops to do this, but it is really slow, is there a way to do it faster via some function? thx
my loop:
player_elo = pd.DataFrame(columns = ['PlayerID','Date'])
for row in players.itertuples():
idx = row.Index
pl = players.at[idx,'PlayerID']
for i in dates.itertuples():
idd = row.Index
dt = dates.at[idd, 0]
new = {'PlayerID': [pl], 'Date': [dt]}
new = pd.DataFrame(new)
player_elo = player_elo.append(new)
If you have a key that is repeated for each df, you can come up with the cartesian product you are looking for using pd.merge().
import pandas as pd
players = pd.DataFrame([['A'], ['B'], ['C']], columns=['PlayerID'])
dates = pd.DataFrame([['12/12/2012'],['12/13/2012'],['12/14/2012']], columns=['Date'])
dates['Date'] = pd.to_datetime(dates['Date'])
players['key'] = 1
dates['key'] = 1
print(pd.merge(players, dates,on='key')[['PlayerID', 'Date']])
Output
PlayerID Date
0 A 2012-12-12
1 A 2012-12-13
2 A 2012-12-14
3 B 2012-12-12
4 B 2012-12-13
5 B 2012-12-14
6 C 2012-12-12
7 C 2012-12-13
8 C 2012-12-14

Anti-Join Pandas

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).
I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.
I am familiar with R and this is the code I would use to do this in R.
library("dplyr")
## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")
## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)
How would I achieve this in python?
indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:
left_only
right_only
both
Keep right_only and left_only. That is it.
outer_join = TableA.merge(TableB, how = 'outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
easy!
Here is a comparison with a solution from piRSquared:
1) When run on this example matching based on one column, piRSquared's solution is faster.
2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.
So it's up for you to decide.
Consider the following dataframes
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
Method 1
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Method 2
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
Timing
4 rows with 2 overlap
Method 1 is much quicker
10,000 rows 5,000 overlap
loops are bad
I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:
import pandas as pd
import numpy as np
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')
TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')
print('TableB_only', TableB_only, sep='\n')
Table_concatenated = pd.concat((TableA, TableB_only), join='inner')
print('Table_concatenated', Table_concatenated, sep='\n')
Which prints this output:
TableA
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
TableB
Key A B C
0 a 0.754538 0.692902 0.537704
1 e 0.499092 0.864145 0.004559
2 c 0.082087 0.682573 0.421654
3 f 0.768914 0.281617 0.924693
TableB_only
Key A_foo B_foo C_foo A B C _merge
4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only
5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only
Table_concatenated
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
4 e 0.499092 0.864145 0.004559
5 f 0.768914 0.281617 0.924693
Easiest answer imaginable:
tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")
answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]
Should be the fastest proposed as well.
One liner
TableA.append(TableB.loc[~TableB.Key.isin(TableA.Key)], ignore_index=True)
%%timeit gives about the same timing as the accepted answer.
You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.
Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:
import pandas as pd
TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])
removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same
for i in range(0, len(keyColumnA)):
firstValue = keyColumnA[i]
for j in range(0, len(keyColumnB)):
copycat = keyColumnB[j]
if firstValue == copycat:
removeTheseIndexes.append(j)
TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)
Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
# Table B
0 1 2
0 1 3 4
1 5 7 8
2 9 10 0
# Set 'Key' column = 1
# Run the script after the loop
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
3 5 7 8
4 9 10 0
# Table B
0 1 2
1 5 7 8
2 9 10 0
Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()]
you have a semi_join.
def anti_join(tableA,tableB,on):
#if joining on index, make it into a column
if tableB.index.name is not None:
dummy = tableB.reset_index()[on]
else:
dummy = tableB[on]
#create a dummy columns of 1s
if isinstance(dummy, pd.Series):
dummy = dummy.to_frame()
dummy.loc[:,'dummy_col'] = 1
#preserve the index of tableA if it has one
if tableA.index.name is not None:
idx_name = tableA.index.name
tableA = tableA.reset_index(drop = False)
else:
idx_name = None
#do a left-join
merged = tableA.merge(dummy,on=on,how='left')
#keep only the non-matches
output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
#reset the index (if applicable)
if idx_name is not None:
output = output.set_index(idx_name)
return(output)

pandas dataframe - add row comment

How do we add a extra row of str to a pandas dataframe ?
Minimal working example:
In [65]: header_list = ['MU', 'Ars', 'THo']
...: team =['MU', 'Ars', 'THot']
...: w = [1,4,5]
...: data = []
...: for n, line in enumerate(w, 1):
...: temp = []
...: temp.append(w[n-1])
...: temp.append(w[n-1]+1)
...: temp.append(w[n-1]-1)
...: data.append(temp)
...: pd = pandas.DataFrame(data, team, header_list)
The dataframe will be saved to a csv file.
pd.to_csv(os.path.join(new_directory, base_filename), index=True, sep=',', doublequote=True, escapechar=None,
decimal='.')
Out[65]:
MU Ars THo
MU 1 2 0
Ars 4 5 3
THot 5 6 4
How can I append an extra row of string to the dataframe, for example: the average is: (numerical value of the average of third column)
pandas version 0.13, you can use loc
d.loc['Avg'] = d.mean()
You have to use loc as radar suggested. Here is complete sample code to calculate only third column average
test_df =pd.DataFrame(np.arange(9).reshape(3,3))
mean= test_df[2].mean()#2 represents 3rd column name
avg_row = ['The average is' , '' ,mean]
test_df.loc[len(test_df)] =avg_row
test_df
Out[113]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 The average is 5

Categories