Replace value in existing column .csv pandas - python

Let's say I have a csv where a sample row looks like: [' ', 1, 2, 3, 4, 5] where indicates an empty cell. I want to iterate through all of the rows in the .csv and replace all of the values in the first column for each row with another value, i.e. [100, 1, 2, 3, 4, 5]. How could this be done? It's also worth noting that the columns don't have labels (they were converted from an .xlsx).
Currently, I'm trying this:
for i, row in test.iterrows():
value = randomFunc(x, row)
test.loc[test.index[i], 0] = value
But this adds a column at the end with the label 0.

Use iloc for select first column by position with replace by regex for zero or more whitespaces:
df = pd.DataFrame({
0:['',20,' '],
1:[20,10,20]
})
df.iloc[:, 0] = df.iloc[:, 0].replace('^\s*$',100, regex=True)
print (df)
0 1
0 100 20
1 20 10
2 100 20

You don't need a for loop while using pandas and numpy,
Just an example Below where we have b and c are empty which is been replaced by replace method:
import pandas as pd
import numpy as np
>>> df
0
a 1
b
c
>>> df.replace('', 100, inplace=True)
>>> df
0
a 1
b 100
c 100
Example to replace the empty cells in a Specific column:
In the Below example we have two columns col1 and col2, Where col1 having an empty cells at index 2 and 4 in col1.
>>> df
col1 col2
0 1 6
1 2 7
2
3 4
4 10
Just to replace the above mentioned empty cells in col1 only:
However, when we say col1 then it implies to all the rows down to the column itself which is handy in a sense.
>>> df.col1.replace('', 100, inplace=True)
>>> df
col1 col2
0 1 6
1 2 7
2 100
3 4
4 100 10
Another way around Just choosing the DataFrame column Specific:
>>> df['col1'] = df.col1.replace('', 100, regex=True)
>>> df
col1 col2
0 1 6
1 2 7
2 100
3 4
4 100 10

Why don't you do something like this:
df = pd.DataFrame([1, ' ', 2, 3, ' ', 5, 5, 5, 6, 7, 7])
df[df[0] == " "] = rd.randint(0,100)
The output is:
0
0 1
1 10
2 2
3 3
4 67
5 5
6 5
7 5
8 6
9 7
10 7

Here is a solution using csv module
import csv
your_value = 100 # value that you want to replace with
with open('input.csv', 'r') as infile, open('output.csv', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
row[0] = your_value
writer.writerow(row)

Related

Complete list assigned to each row in python

I created a list as a mean of 2 other columns, the length of the list is same as the number of rows in the dataframe. But when I try to add that list as a column to the dataframe, the entire list gets assigned to each row instead of only corresponding values of the list.
glucose_mean = []
for i in range(len(df)):
mean = (df['h1_glucose_max']+df['h1_glucose_min'])/2
glucose_mean.append(mean)
df['glucose'] = glucose_mean
data after adding list
I think you overcomplicated it. You don't need for-loop but only one line
df['glucose'] = (df['h1_glucose_max'] + df['h1_glucose_min']) / 2
EDIT:
If you want to work with every row separatelly then you can use .apply()
def func(row):
return (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
df['glucose'] = df.apply(func, axis=1)
And if you really need to use for-loop then you can use .iterrows() (or similar functions)
glucose_mean = []
for index, row in df.iterrows():
mean = (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
glucose_mean.append(mean)
df['glucose'] = glucose_mean
Minimal working example:
import pandas as pd
data = {
'h1_glucose_min': [1,2,3],
'h1_glucose_max': [4,5,6],
}
df = pd.DataFrame(data)
# - version 1 -
df['glucose_1'] = (df['h1_glucose_max'] + df['h1_glucose_min']) / 2
# - version 2 -
def func(row):
return (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
df['glucose_2'] = df.apply(func, axis=1)
# - version 3 -
glucose_mean = []
for index, row in df.iterrows():
mean = (row['h1_glucose_max'] + row['h1_glucose_min']) / 2
glucose_mean.append(mean)
df['glucose_3'] = glucose_mean
print(df)
You do not need to iterate over your frame. Use this instead (example for a pseudo data frame):
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8], 'col2': [10, 9, 8, 7, 6, 5, 4, 100]})
df['mean_col1_col2'] = df[['col1', 'col2']].mean(axis=1)
df
-----------------------------------
col1 col2 mean_col1_col2
0 1 10 5.5
1 2 9 5.5
2 3 8 5.5
3 4 7 5.5
4 5 6 5.5
5 6 5 5.5
6 7 4 5.5
7 8 100 54.0
-----------------------------------
As you can see in the following example, your code is appending an entire column each time the for loop executes, so when you assign glucose_mean list as a column, each element is a list instead of a single element:
import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3, 4], 'col2':[2, 3, 4, 5]})
glucose_mean = []
for i in range(len(df)):
glucose_mean.append(df['col1'])
print((glucose_mean[0]))
df['col2'] = [5, 6, 7, 8]
print(df)
Output:
0 1
1 2
2 3
3 4
Name: col1, dtype: int64
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8

how to insert a list value into a dataframe by row and column number?

How do I insert a list value to a dataframe on a specific row and column?
For example say I have the dataframe
source col 1 col 2
0 a xxx xxx
1 b xxx xxx
2 c xxx xxx
3 a xxx xxx
My list is
list_value = [5,"text"]
How do I insert this list to the dataframe at row 1 and column 1 (col 1)
source col 1 col 2
0 a xxx xxx
1 b 5 xxx
2 c text xxx
3 a xxx xxx
EDIT
#Dev Arora
When I run your code I get this error.
d = {'col1': [1, 2,3,5,6,7], 'col2': [3, 4,5,"",5,6]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
2 3 5
3 5
4 6 5
5 7 6
list_value = [5,"text"]
df.at[1, 'col2'] = list_value
df
col1 col2
0 1 3
1 2 [5, 'text']
2 3 5
3 5
4 6 5
5 7 6
Instead I want it to be
col1 col2
0 1 3
1 2 5
2 3 'text'
3 5
4 6 5
5 7 6
Assuming we're looking at pandas dataframes:
I think the df.at operator is what you're looking for:
df = pd.read_csv("./test.csv")
list_value = [5,"text"]
string_to_input = ""
for val in list_value:
string_to_input += str(val) + " "
df.at[<row_num>, "<col_name>"] = string_to_input
EDIT: If you're looking to add the values in just as a list you can also do
df = pd.read_csv("./test.csv")
list_value = [5,"text"]
df.at[<row_num>, "<col_name>"] = list_value
EDIT: Hmm okay lets take this from the top. As per the desired information in the post i.e. how to insert a value into a dataframe by row and column number, we're looking at df.at. What df.at does is insert a value in a dataframe based on the specific row number and column number given. Insofar in the example:
d = {'col1': [1, 2,3,5,6,7], 'col2': [3, 4,5,"",5,6]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
2 3 5
3 5
4 6 5
5 7 6
list_value = [5,"text"]
df.at[1, 'col2'] = list_value
df
col1 col2
0 1 3
1 2 [5, 'text']
2 3 5
3 5
4 6 5
5 7 6
That is exactly what has happened. This is not an error.
The command df.at[1, 'col2'] = list_value specifies that at row 1 and col2 insert the list_value which is [5, 'text'].
If you want a dataframe that looks like this by specifically indicating the desired row and column for each insertion:
col1 col2
0 1 3
1 2 5
2 3 'text'
3 5
4 6 5
5 7 6
Something like this is required:
df.at[1, "col2"] = 5
df.at[2, "col2"] = 'text'
The above code specifies that at row 1, col2 insert 5, and at row 2 col2 insert 'text'. Hope this helps!

How to identify unique elements in two dataframes and append with a new row

I am trying to write a function that takes in two dataframes with a different number of rows, finds the elements that are unique to each dataframe in the first column, and then appends a new row that only contains the unique element to the dataframe where it does not exist. For example:
>>> d1 = {'col1': [1, 2], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
col1 col2
0 1 3
1 2 4
2 5 6
>>> d2 = {'col1': [1, 2], 'col2': [3, 4]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
col1 col2
0 1 3
1 2 4
2 6 7
>>> standarized_unique_elems(df1, df2)
>>> df1
col1 col2
0 1 3
1 2 4
2 5 6
3 6 NaN
>>> df2
col1 col2
0 1 3
1 2 4
2 6 7
3 5 NaN
Before posting this question, I gave it my best shot, but cant figure out a good way to append a new row at the bottom of each dataframe with the unique element. Here is what I have so far:
def standardize_shape(df1, df2):
unique_elements = list(set(df1.iloc[:, 0]).symmetric_difference(set(df2.iloc[:, 0])))
for elem in unique_elements:
if elem not in df1.iloc[:, 0].tolist():
# append a new row with the unique element with rest of values NaN
if elem not in df2.iloc[:, 0].tolist():
# append a new row with the unique element with rest of values NaN
return (df1, df2)
I am still new to Pandas, so any help would be greatly appreciated!
We can do
out1 = pd.concat([df1,pd.DataFrame({'col1':df2.loc[~df2.col1.isin(df1.col1),'col1']})])
Out[269]:
col1 col2
0 1 3.0
1 2 4.0
2 5 6.0
2 6 NaN
#out2 = pd.concat([df2,pd.DataFrame({'col1':df1.loc[~df1.col1.isin(df2.col1),'col1']})])

Iterating over dataframe and get columns as new dataframes

I'm trying to create a set of dataframes from one big dataframe. Theses dataframes consists of the columns of the original dataframe in this manner:
1st dataframe is the 1st column of the original one,
2nd dataframe is the 1st and 2nd columns of the original one,
and so on.
I use this code to iterate over the dataframe:
for i, data in enumerate(x):
data = x.iloc[:,:i]
print(data)
This works but I also get an empty dataframe in the beginning and an index vector I don't need.
any suggestions on how to remove those 2?
thanks
Instead of enumerating the dataframe, since you are not using the outcome after enumerating but using only the index value, you can just iterate in the range 1 through the number of columns added one, then take the slice df.iloc[:, :i] for each value of i, you can use list-comprehension to achieve this.
>>> [df.iloc[:, :i] for i in range(1,df.shape[1]+1)]
[ A
0 1
1 2
2 3,
A B
0 1 2
1 2 4
2 3 6]
The equivalent traditional loop would look something like this:
for i in range(1,df.shape[1]+1):
print(df.iloc[:, :i])
A
0 1
1 2
2 3
A B
0 1 2
1 2 4
2 3 6
you can also do something like this:
data = {
'col_1': np.random.randint(0, 10, 5),
'col_2': np.random.randint(10, 20, 5),
'col_3': np.random.randint(0, 10, 5),
'col_4': np.random.randint(10, 20, 5),
}
df = pd.DataFrame(data)
all_df = {col: df.iloc[:, :i] for i, col in enumerate(df, start=1)}
# For example we can print the last one
print(all_df['col_4'])
col_1 col_2 col_3 col_4
0 1 13 5 10
1 8 16 1 18
2 6 11 5 18
3 3 11 1 10
4 7 14 8 12

Pandas - calculate new column with variable column input

heres the problem... Imagine the following dataframe as an example:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [3, 4, 5, 6, 7],'col3': [3, 4, 5, 6, 7],'col4': [1, 2, 3, 3, 2]})
Now, I would like to add another column "col 5" which is calculated as follows:
if the value of "col4" is 1, then give me the corresponding value in the column with index 1 (i.e. "col2" in this case), if "col4" is 2 give me the corresponding value in the column with index 2 (i.e. "col3" in this case), etc.
I have tried the below and variations of it, but I can't seem to get the right result
df["col5"] = df.apply(lambda x: df.iloc[x,df[df.columns[df["col4"]]]])
Any help is much appreciated!
If your 'col4' is the indicator of column index, this will work:
df['col5'] = df.apply(lambda x: x[df.columns[x['col4']]], axis=1)
df
# col1 col2 col3 col4 col5
#0 1 3 3 1 3
#1 2 4 4 2 4
#2 3 5 5 3 3
#3 4 6 6 3 3
#4 5 7 7 2 7
You can use fancy indexing with NumPy and avoid a Python-level loop altogether:
df['col5'] = df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']]
print(df)
col1 col2 col3 col4 col5
0 1 3 3 1 3
1 2 4 4 2 4
2 3 5 5 3 3
3 4 6 6 3 3
4 5 7 7 2 7
You should see significant performance benefits for larger dataframes:
df = pd.concat([df]*10**4, ignore_index=True)
%timeit df.apply(lambda x: x[df.columns[x['col4']]], axis=1) # 2.36 s per loop
%timeit df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']] # 1.01 ms per loop

Categories