How to for each loop through two columns in a dataframe? - python

I have a dataframe containing 7 columns and I want to simultaneously loop through two of them to compare the values in each row. This is my for loop header, where watchCol and diaryCol are column numbers:
for watch, diary in df.iloc[:, watchCol], df.iloc[:, diaryCol]:
When I run this, I get the following error on that line:
ValueError: too many values to unpack (expected 2)
What am I doing wrong?
Thanks
EDIT:
Both columns contain datetimes. I need to compare the two values, and if the difference is within a certain range, I copy the value from the watchCol into another column, otherwise I move to the next row.

If you're trying to compare entries row by row, try this:
import pandas as pd
df = pd.DataFrame({"a": [2, 2, 2, 2, 2], "b": [4, 3, 2, 1, 0]})
df["a greater than b"] = df.apply(lambda x: x.a > x.b, axis=1)
print df
a b a greater than b
0 2 4 False
1 2 3 False
2 2 2 False
3 2 1 True
4 2 0 True
That said, if you did want to loop through the elements row by row:
for a, b in zip(df.iloc[:, 0], df.iloc[:, 1]):
print a, b
2 4
2 3
2 2
2 1
2 0

Related

How can I find and store how many columns it takes to reach a value greater than the first value in each row?

import pandas as pd
df = {'a': [3,4,5], 'b': [1,2,3], 'c': [4,3,3], 'd': [1,5,4], 'e': [9,4,6]}
df1 = pd.DataFrame(df, columns = ['a', 'b', 'c', 'd', 'e'])
dg = {'b': [2,3,4]}
df2 = pd.DataFrame(dg, columns = ['b'])
Original dataframe is df1. For each row, I want to find the first time a value is bigger than the value in the first column and store it in a new dataframe.
df1
a b c d e
0 3 1 4 1 9
1 4 2 3 5 4
2 5 3 3 4 6
df2 is the resulting dataframe. For example, for df1 row 1; the first value is 3 and the first value bigger than 3 is 4 (column c). Hence in df2 row 1, we store 2 (there are two columns from column a to c). For df1 row 2, the first value is 4 and the first value bigger than 4 is 5 (column d). Hence in df2 row 2, we store 3 (there are three columns from column a to d). For df1 row 3, the first value is 5 and the first value bigger than 5 is 6 (column e). Hence in df2 row 3, we store 4 (there are four columns from column a to e).
df2
b
0 2
1 3
2 4
I would appreciate the help.
In your case we can do sub , if the value gt than 0 , we get the id with idxmax
s=df1.columns.get_indexer(df1.drop('a',1).sub(df1.a,0).ge(0).idxmax(1))
array([1, 1, 3])
df['New']=s
You can get the column names by comparing the entire DataFrame index wise against the first columns, replacing false values with NaNs and applying first_valid_index row wise, eg:
names = (
df1.gt(df1.iloc[:, 0], axis=0)
.replace(False, pd.NA) # or use np.nan
.apply(pd.Series.first_valid_index, axis=1)
)
That'll give you:
0 c
1 d
2 e
Then you can convert those to offsets:
offsets = df1.columns.get_indexer(names)
# array([2, 3, 4])

Change all index of pandas series to one value

I'm trying to change all index values of a pandas series to one value. I have 200k+ rows and the index is a number from 0 to 200k+. I want the index to be a single string, for example 'Token'. Is this possible with pandas? I've tried reindex but that doesnt seem to work, I think that would only work if i would give a 200k list of 'token' as argument which is not what I want to do.
Use insert and set_index, like example here below:
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
index_string = 'token'
df.insert(loc=idx, column='A', value=index_string)
df.set_index('A', inplace=True)
df
Out:
B C
A
token 1 4
token 2 5
token 3 6

Match rows between dataframes and preserve order

I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.
Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN
Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']
Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!

Is there any way to replace values with different values in the same row if they meet a certain condition using a for loop?

I need to replace the values of a certain cell with values from another cell if a certain condition is met.
for r in df:
if df['col1'] > 1 :
df['col2']
else:
I am hoping for every value in column 1 to be replaced with their respective value in column 2 if the condition if the value of the row in column 1 is greater than 1.
No need to loop through the entire dataframe.
idx=df['col1']>1
df.loc[idx,'col1']=df.loc[idx,'col2']
Using a for loop
for _,row in df.iterrows():
if row['col1']>1:
row['col1']=row['col2']
elif condition:
#put assignment here
else other_condition:
#put assignment here
Here is an example
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 4, 6]})
print(df)
print('----------')
# the condition here is A^2 == B
df.loc[df['A'] * df['A'] == df['B'], 'A'] = df['B']
print(df)
output
A B
0 1 4
1 2 4
2 3 6
----------
A B
0 1 4
1 4 4
2 3 6

Selecting rows from pandas DataFrame using a list

I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0
The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])
import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)

Categories