I created a list as a mean of 2 other columns, the length of the list is same as the number of rows in the dataframe. But when I try to add that list as a column to the dataframe, the entire list gets assigned to each row instead of only corresponding values of the list.
glucose_mean = []
for i in range(len(df)):
mean = (df['h1_glucose_max']+df['h1_glucose_min'])/2
glucose_mean.append(mean)
df['glucose'] = glucose_mean
data after adding list
I think you overcomplicated it. You don't need for-loop but only one line
df['glucose'] = (df['h1_glucose_max'] + df['h1_glucose_min']) / 2
EDIT:
If you want to work with every row separatelly then you can use .apply()
def func(row):
return (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
df['glucose'] = df.apply(func, axis=1)
And if you really need to use for-loop then you can use .iterrows() (or similar functions)
glucose_mean = []
for index, row in df.iterrows():
mean = (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
glucose_mean.append(mean)
df['glucose'] = glucose_mean
Minimal working example:
import pandas as pd
data = {
'h1_glucose_min': [1,2,3],
'h1_glucose_max': [4,5,6],
}
df = pd.DataFrame(data)
# - version 1 -
df['glucose_1'] = (df['h1_glucose_max'] + df['h1_glucose_min']) / 2
# - version 2 -
def func(row):
return (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
df['glucose_2'] = df.apply(func, axis=1)
# - version 3 -
glucose_mean = []
for index, row in df.iterrows():
mean = (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
glucose_mean.append(mean)
df['glucose_3'] = glucose_mean
print(df)
You do not need to iterate over your frame. Use this instead (example for a pseudo data frame):
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8], 'col2': [10, 9, 8, 7, 6, 5, 4, 100]})
df['mean_col1_col2'] = df[['col1', 'col2']].mean(axis=1)
df
-----------------------------------
col1 col2 mean_col1_col2
0 1 10 5.5
1 2 9 5.5
2 3 8 5.5
3 4 7 5.5
4 5 6 5.5
5 6 5 5.5
6 7 4 5.5
7 8 100 54.0
-----------------------------------
As you can see in the following example, your code is appending an entire column each time the for loop executes, so when you assign glucose_mean list as a column, each element is a list instead of a single element:
import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[2, 3, 4, 5]})
glucose_mean = []
for i in range(len(df)):
glucose_mean.append(df['col1'])
print((glucose_mean[0]))
df['col2'] = [5, 6, 7, 8]
print(df)
Output:
0 1
1 2
2 3
3 4
Name: col1, dtype: int64
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Related
How can I create a Pandas DataFrame that shows the relative position of each value, when those values are sorted from low to high for each column?
So in this case, how can you transform 'df' into 'dfOut'?
import pandas as pd
import numpy as np
#create DataFrame
df = pd.DataFrame({'A': [12, 18, 9, 21, 24, 15],
'B': [18, 22, 19, 14, 14, 11],
'C': [5, 7, 7, 9, 12, 9]})
# How to assign a value to the order in the column, when sorted from low to high?
dfOut = pd.DataFrame({'A': [2, 4, 1, 5, 6, 3],
'B': [3, 5, 4, 2, 2, 1],
'C': [1, 2, 2, 3, 4, 3]})
If you need to map the same values to the same output, try using the rank method of a DataFrame. Like this:
>> dfOut = df.rank(method="dense").astype(int) # Type transformation added to match your output
>> dfOut
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
The rank method computes the rank for each column following a specific criteria. According to the Pandas documentation, the "dense" method ensures that "rank always increases by 1 between groups", and that might match your use case.
Original answer: In case that repeated numbers are not required to map to the same out value, np.argsort could be applied on each column to retrieve the position of each value that would sort the column. Combine this with the apply method of a DataFrame to apply the function on each column and you have this:
>> dfOut = df.apply(lambda column: np.argsort(column.values)))
>> dfOut
A B C
0 2 5 0
1 0 3 1
2 5 4 2
3 1 0 3
4 3 2 5
5 4 1 4
Here is my attempt using some functions:
def sorted_idx(l, num):
x = sorted(list(set(l)))
for i in range(len(x)):
if x[i]==num:
return i+1
def output_list(l):
ret = [sorted_idx(l, elem) for elem in l]
return ret
dfOut = df.apply(lambda column: output_list(column))
print(dfOut)
I make reduce the original list to unique values and then sort. Finally, I return the index+1 where the element in the original list matches this unique, sorted list to get the values you have in your expected output.
Output:
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
The following is the dataframe,
a b
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
5 6 8
6 7 9
I want to add a new column, call it sum which takes the sum of it's respective row values.
Expected output
a b sum
0 1 3 4
1 2 4 6
2 3 5 8
3 4 6 10
4 5 7 12
5 6 8 14
6 7 9 16
How to achieve this using pandas map, apply, Applymap functions?
My Code
df = pd.DataFrame({
'a': [1,2,3,4,5,6,7],
'b': [3,4,5,6,7,8,9]
})
def sum(df):
return df['a']+df['b']
# Methods I tried
df['sum'] = df.apply(sum(df))
df['sum']=df[['a',"b"]].map(sum)
df['sum'] = df.apply(lambda x: x['a'] + x['b'])
Note: This is just a dummy code. The original code has a function which returns different output for each individual rows and it ain't as simple as applying sum function. So I request you to make a custom sum function and implement those methods, so that I'll learn and apply the same to my code.
You can use the pandas sum function like below:
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7], "b": [3, 4, 5, 6, 7, 8, 9]})
df["sum"] = df.sum(axis=1)
print(df)
And if you have to use lambda with apply you can try:
import pandas as pd
def add(a, b):
return a + b
df = pd.DataFrame({
'a': [1,2,3,4,5,6,7],
'b': [3,4,5,6,7,8,9]
})
df['sum'] = df.apply(lambda row : add(row['a'], row['b']), axis = 1)
print(df)
Hi I would like to change the column names of a part of the columns in my dataframe.
When I print just the part I want to change it to: palColAdj.iloc[:, 73:].columns.str[:-2] I see the outcome I would like to see, but when I try to change it in my original dataframe I don't see the change.
So if I write either
palColAdj.iloc[:, 73:].columns=palColAdj.iloc[:, 73:].columns.str[:-2]
or
prodColAdj.iloc[:, 39:].columns=prodColAdj.iloc[:, 39:].columns.str[:-2].to_list()
and afterwards I print
prodColAdj.head()
I still see the original column names. How can this be?
Here's a way to do it.
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['aaaa', 'bbbb', 'ccccc'])
# aaaa bbbb ccccc
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
cols = df2.columns.values
dict = {}
for col in cols:
dict[col] = col[:-2]
df.rename(dict, axis=1, inplace=True)
# aa bb ccc
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
To pick specific cols, edit this:
cols = df2.columns.values[0:2]
# array(['aaaa', 'bbbb'], dtype=object)
Let's say I have a csv where a sample row looks like: [' ', 1, 2, 3, 4, 5] where indicates an empty cell. I want to iterate through all of the rows in the .csv and replace all of the values in the first column for each row with another value, i.e. [100, 1, 2, 3, 4, 5]. How could this be done? It's also worth noting that the columns don't have labels (they were converted from an .xlsx).
Currently, I'm trying this:
for i, row in test.iterrows():
value = randomFunc(x, row)
test.loc[test.index[i], 0] = value
But this adds a column at the end with the label 0.
Use iloc for select first column by position with replace by regex for zero or more whitespaces:
df = pd.DataFrame({
0:['',20,' '],
1:[20,10,20]
})
df.iloc[:, 0] = df.iloc[:, 0].replace('^\s*$',100, regex=True)
print (df)
0 1
0 100 20
1 20 10
2 100 20
You don't need a for loop while using pandas and numpy,
Just an example Below where we have b and c are empty which is been replaced by replace method:
import pandas as pd
import numpy as np
>>> df
0
a 1
b
c
>>> df.replace('', 100, inplace=True)
>>> df
0
a 1
b 100
c 100
Example to replace the empty cells in a Specific column:
In the Below example we have two columns col1 and col2, Where col1 having an empty cells at index 2 and 4 in col1.
>>> df
col1 col2
0 1 6
1 2 7
2
3 4
4 10
Just to replace the above mentioned empty cells in col1 only:
However, when we say col1 then it implies to all the rows down to the column itself which is handy in a sense.
>>> df.col1.replace('', 100, inplace=True)
>>> df
col1 col2
0 1 6
1 2 7
2 100
3 4
4 100 10
Another way around Just choosing the DataFrame column Specific:
>>> df['col1'] = df.col1.replace('', 100, regex=True)
>>> df
col1 col2
0 1 6
1 2 7
2 100
3 4
4 100 10
Why don't you do something like this:
df = pd.DataFrame([1, ' ', 2, 3, ' ', 5, 5, 5, 6, 7, 7])
df[df[0] == " "] = rd.randint(0,100)
The output is:
0
0 1
1 10
2 2
3 3
4 67
5 5
6 5
7 5
8 6
9 7
10 7
Here is a solution using csv module
import csv
your_value = 100 # value that you want to replace with
with open('input.csv', 'r') as infile, open('output.csv', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
row[0] = your_value
writer.writerow(row)
I am filling a DataFrame by transposing some numpy array :
for symbol in syms[:5]:
price_p = Share(symbol)
closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
dump = np.array(closes_p)
na_price_ar.append(dump)
print symbol
df = pd.DataFrame(na_price_ar).transpose()
df, the DataFrame is well filled however, the column name are 0,1,2...,5 I would like to rename them with the value of the element syms[:5]. I googled it and I found this:
for symbol in syms[:5]:
df.rename(columns={ ''+ str(i) + '' : symbol}, inplace=True)
i = i+1
But if I check the variabke df I still have the same column name.
Any ideas ?
Instead of using a list of arrays and transposing, you could build the DataFrame from a dict whose keys are symbols and whose values are arrays of column values:
import numpy as np
import pandas as pd
np.random.seed(2016)
syms = 'abcde'
na_price_ar = {}
for symbol in syms[:5]:
# price_p = Share(symbol)
# closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
# dump = np.array(closes_p)
dump = np.random.randint(10, size=3)
na_price_ar[symbol] = dump
print(symbol)
df = pd.DataFrame(na_price_ar)
print(df)
yields
a b c d e
0 3 3 8 2 4
1 7 8 7 6 1
2 2 4 9 3 9
You can use:
na_price_ar = [['A','B','C'],[0,2,3],[1,2,4],[5,2,3],[8,2,3]]
syms = ['q','w','e','r','t','y','u']
df = pd.DataFrame(na_price_ar, index=syms[:5]).transpose()
print (df)
q w e r t
0 A 0 1 5 8
1 B 2 2 2 2
2 C 3 4 3 3
You may use as dictionary key into the .rename() method the df.columns[ number ] statement
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1], 'd': [4, 1, 3, 1], 'e': [5, 2, 6, 0]}
df = pd.DataFrame(dic)
number = 0
for symbol in syms[:5]:
df.rename( columns = { df.columns[number]: symbol}, implace = True)
number = number + 1
and the result is
i f g h i
0 4 4 5 4 5
1 1 2 7 1 2
2 3 1 9 3 6
3 1 4 1 1 0