Python/Pandas - Writing empty cells to a csv file (instead of zeros) - python

Running Python 3.8.1, 64 bit, on Windows 10.
I have a csv file with two columns. The first column does not have numeric values on every row (=empty cells in between cells with values) and second has numeric values on every row.
column_1 column_2
200
13 201
202
203
204
205
129 206
16 207
208
I read the csv file (shown above) with Pandas:
df = pd.read_csv("old.csv")
I make modifications to the Pandas dataframe and write to a new csv file with Pandas without the index column.
df.to_csv("new.csv", sep=',', encoding='utf-8', index=False)
The result is a csv file that has zeros in place of the original empty cells.
column_1,column_2
0,200
13,201
0,202
0,203
0,204
0,205
129,206
16,207
0,208
My question: how to modify my script to write empty cells instead of zeros (0) in the csv file (i.e. the rows where column_2 value is 200, 202, 203, 204, 205 and 208)?

You can set 0 to missing values by Series.mask and for integers, convert the output to Int64, working in pandas 0.24+:
df = pd.DataFrame({'column_1': [0, 13, 0, 0, 0, 0, 129, 16, 0],
'column_2': [200, 201, 202, 203, 204, 205, 206, 207, 208]})
print (df)
column_1 column_2
0 0 200
1 13 201
2 0 202
3 0 203
4 0 204
5 0 205
6 129 206
7 16 207
8 0 208
df['column_1'] = df['column_1'].mask(df['column_1'].eq(0)).astype('Int64')
print (df)
column_1 column_2
0 NaN 200
1 13 201
2 NaN 202
3 NaN 203
4 NaN 204
5 NaN 205
6 129 206
7 16 207
8 NaN 208
df.to_csv("new.csv", sep=',', encoding='utf-8', index=False)
column_1,column_2
,200
13,201
,202
,203
,204
,205
129,206
16,207
,208
Another idea is to replace the empty strings:
df['column_1'] = df['column_1'].mask(df['column_1'].eq(0), '')
print (df)
column_1 column_2
0 200
1 13 201
2 202
3 203
4 204
5 205
6 129 206
7 16 207
8 208
df.to_csv("new.csv", sep=',', encoding='utf-8', index=False)
column_1,column_2
,200
13,201
,202
,203
,204
,205
129,206
16,207
,208

Related

Dynamically differencing columns in a pandas dataframe using similar column names

The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?
Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0

how can i change values in a column by checking the next rows value in another column [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 1 year ago.
i would like to ask how can i can iterate through the dataframe and check where the ID is the same value, then sum the prices for these rows.
i tried it with the following code:
d = {'ID': [126, 126, 148, 148, 137, 137], 'price': [100, 50, 120, 40, 160, 30]}
df = pd.DataFrame(data=d)
so the Dataframe looks like this
ID price
0 126 100
1 126 50
2 148 120
3 148 40
4 137 160
5 137 30
for index in df.index():
if df.iloc[index, "ID"] == df.iloc[index+1, "ID"]:
df.at[index, "price"] = df.iloc[index, "price"] + df.iloc[index+1, "price"]
df.at[index+1, "price"] = df.iloc[index, "price"] + df.iloc[index+1, "price"]
i would like to have a resulst like this:
ID price
0 126 150
1 126 150
2 148 160
3 148 160
4 137 190
5 137 190
Please help if you someone knows how to do it. :)
TRY Groupby-Transform:
df['price'] = df.groupby('ID')['price'].transform('sum')
OUTPUT:
ID price
0 126 150
1 126 150
2 148 160
3 148 160
4 137 190
5 137 190

Math operations across all columns of a pandas dataframe, regardless of its size

import pandas as pd
import numpy as np
d = {'col1': [100, 198, 495, 600, 50], 'col2': [99, 200, 500, 594, 100], 'col3': [101, 202, 505, 606, 150]}
df = pd.DataFrame(data=d)
df
From this I get a simple table:
col1 col2 col3
0 100 99 101
1 198 200 202
2 495 500 505
3 600 594 606
4 50 100 150
From this I would like to take the %CV of all values in the first row, then second rows and so on...
I would like that it works regardless of how many columns the table has.
I could do this with a few lines of code:
df_shape = df.shape
CV_list = []
for i in range(df_shape[0]):
CV = np.std(df.iloc[i, :], ddof=1) / np.mean(df.iloc[i, :]) * 100
CV_list.append(str(round(CV, 3)) + ' %')
df["cv"] = CV_list
df
output:
col1 col2 col3 CV
0 100 99 101 1%
1 198 200 202 1%
2 495 500 505 1%
3 600 594 606 1%
4 50 100 150 50%
But I wonder if Pandas has a built in functions for this (that I could not find so far).
You can operate across an entire row by specifying axis=1. So get the Series of standard deviations and means (for each row) and divide.
df['CV'] = df.std(axis=1, ddof=1)/df.mean(axis=1)*100
col1 col2 col3 CV
0 100 99 101 1.0
1 198 200 202 1.0
2 495 500 505 1.0
3 600 594 606 1.0
4 50 100 150 50.0

Pandas: How to (cleanly) unpivot two columns with same category?

I'm trying to unpivot two columns inside a pandas dataframe. The transformation I seek would be the inverse of this question.
We start with a dataset that looks like this:
import pandas as pd
import numpy as np
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df_orig
accuracy time_a time_b memory_a memory_b
0 6 118 170 102 239
1 241 9 166 159 162
2 164 70 76 228 121
3 228 121 135 128 92
I wish to unpivot both themwmory and time columns, obtaining this dataset in result:
df
accuracy memory category time
0 6 102 a 118
1 241 159 a 9
2 164 228 a 70
3 228 128 a 121
12 6 239 b 170
13 241 162 b 166
14 164 121 b 76
15 228 92 b 135
So far I have managed to get my desired output using df.melt() twice plus some extra commands:
df = df_orig.copy()
# Unpivot memory columns
df = df.melt(id_vars=['accuracy','time_a', 'time_b'],
value_vars=['memory_a', 'memory_b'],
value_name='memory',
var_name='mem_cat')
# Unpivot time columns
df = df.melt(id_vars=['accuracy','memory', 'mem_cat'],
value_vars=['time_a', 'time_b'],
value_name='time',
var_name='time_cat')
# Keep only the 'a'/'b' as categories
df.mem_cat = df.mem_cat.str[-1]
df.time_cat = df.time_cat.str[-1]
# Keeping only the colums whose categories match (DIRTY!)
df = df[df.mem_cat==df.time_cat]
# Removing the duplicated category column.
df = df.drop(columns='time_cat').rename(columns={"mem_cat":'category'})
Given how easy it was to solve the inverse question, I believe my code is way too complex. Can anyone do it better?
Use wide_to_long:
np.random.seed(123)
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df = (pd.wide_to_long(df_orig.reset_index(),
stubnames=['time','memory'],
i='index',
j='category',
sep='_',
suffix='\w+')
.reset_index(level=1)
.reset_index(drop=True)
.rename_axis(None))
print (df)
category accuracy time memory
0 a 254 109 66
1 a 98 230 83
2 a 123 57 225
3 a 113 126 73
4 b 254 126 220
5 b 98 17 106
6 b 123 214 96
7 b 113 47 32

How to only show the rows where data of one variable matches column data of another variable

So I want to only show the rows in which the x and y value of the other id's matches the x and y of id 0. For example, show id 0 and id 250017920 (row 9) as the x and y match out of the first 20 rows. This process would need to be repeated for all rows so that all we have left is the rows where the x and y match that of id 0 as its x and y changes.
d={'ID':[0,2398794,3987694,987957, 9875987, 76438739, 2474654, 1983209, 2874050, 250017920, 38764902],
'x':[-46,8769,432, 426, 132, 93, 124, 475, 857, -46, 67],
'y':[2562,987, 987, 252, 234, 123, 765, 1452, 542, 2562, 5876],
'z':[5, 7, 6, 2, 7, 7 ,4 , 5 , 1, 9,3]}
data=pd.DataFrame(data=d)
ID x y z
0 0 -46 2562 5
1 2398794 8769 987 7
2 3987694 432 987 6
3 987957 426 252 2
4 9875987 132 234 7
5 76438739 93 123 7
6 2474654 124 765 4
7 1983209 475 1452 5
8 2874050 857 542 1
9 250017920 -46 2562 9
10 38764902 67 5876 3
For the following dataframe
d={'ID':[999,2398794,3987694,987957, 9875987],
'x':[132,8769,432, 132, 132],
'y':[563,987, 987, 563, 234],
'z':[5, 7, 6, 2, 7]}
data=pd.DataFrame(data=d)
print(data)
ID x y z
0 999 132 563 5
1 2398794 8769 987 7
2 3987694 432 987 6
3 987957 132 563 2
4 9875987 132 234 7
get the index of the ID value for which you want to match values in column x and column y. Here, lets say for ID value 999.
#[0] here refers to the first time this ID appeared in the dataframe, if in any case same ID had appeared multiple times
ind=data.index[data.ID == 999][0]
Now get the rows where x and y values matches x and y values for ID = 999.
data[(data['x']==data['x'][ind]) & (data['y']==data['y'][ind])]
ID x y z
0 999 132 563 5
3 987957 132 563 2

Categories