How to iterate over every cell in pandas Dataframe? - python

How to iterate over every cell of a pandas Dataframe?
Like
for every_cell in df.iterrows:
print(cell_value)
printing of course is not the goal.
Cell values of the df should be updated in a MongoDB.

if it has to be a for loop, you can do it like this:
def up_da_ter3(df):
columns = df.columns.tolist()
for _, i in df.iterrows():
for c in columns:
print(i[c])
print("############")

You can use applymap. It will iterate down each column, starting with the left most. But in general you almost never need to iterate over every value of a DataFrame, pandas has much more performant ways to accomplish calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(6).reshape(-1, 2), columns=['A', 'B'])
# A B
#0 0 1
#1 2 3
#2 4 5
df.applymap(lambda x: print(x))
0
2
4
1
3
5
If you need it to raster through the DataFrame across rows you can transpose first:
df.T.applymap(lambda x: print(x))
0
1
2
3
4
5

One technique as an option would be using double for loop within square brackets such as
for i in [df[j][k] for k in range(0,len(df)) for j in df.columns]:
print(i)
in order to iterate starting from the first column to the last one of the first row, and then repeat the same process for each row by the order of members of lists.

Related

Pandas: add number of unique values to other dataset (as shown in picture):

I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance
Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1
Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())
This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).

Idiomatic way to create pandas dataframe as concatenation of function of another's rows

Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]

How do I efficiently assign a single value per groupby group in Pandas

I have a Pandas DataFrame with a column of non-unique numbers. I want to return a different random number for each of the non-unique values, but return the same random number at each row the non-unique value appears i.e. so the shape of the output dataframe of random numbers matches that of the ungrouped data frame.
I can do this like:
df.groupby('NonUnique').transform(lambda x: np.random.rand())
This returns a different random number for each column in df, as desired.
However, this is slow for large dataframes, but np.random.rand(df.size) is very fast. Is there any way to achieve what I want in a more efficient way? I can't seem to find a way to vectorise the assignment per group...
Create array by length of unique values, then use factorize with numpy indexing for repeating:
np.random.seed(123)
df = pd.DataFrame({'A':list('aaabbb')})
a = np.random.rand(len(df['A'].unique()))
df['B'] = a[pd.factorize(df.A)[0]]
print (df)
A B
0 a 0.696469
1 a 0.696469
2 a 0.696469
3 b 0.286139
4 b 0.286139
5 b 0.286139
Detail:
print (pd.factorize(df.A)[0])
[0 0 0 1 1 1]
I you're grouping by anyway, you can just use ngroup()
df.groupby('column').ngroup()
or
df.groupby('column').transform('ngroup')

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

How to modify data after replicate in Pandas?

I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)

Categories