Pandas read csv replacing #DIV/0! and #VALUE! with NaN - python

I am new to Pandas for Python and am busy reading a csv file. Unfortunately the Excel file has some cells with #VALUE! and #DIV/0! in them. I cannot fix this in Excel because the data is pulled from other sheets. Pandas turns these columns into objects instead of numpy64, so I cannot plot from them. I want to replace the #VALUE! and #DIV/0! strings with NaN entries in Pandas, however i cannot find how to do this. I have tried the following (my code runs, but it changes nothing):
import pandas as pd
import numpy as np
df = pd.read_csv('2013AllData.csv')
df.replace('#DIV/0!', np.nan)

Rather than replacing after loading, just set the param na_values when reading the csv in and it will convert them to NaN values when the df is created:
df = pd.read_csv('2013AllData.csv', na_values=['#VALUE!', '#DIV/0!'])
Check the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

Related

Treat everything as raw string (even formulas) when reading into pandas from excel

So, I am actually handling text responses from surveys, and it is common to have responses that starts with -, an example is: -I am sad today.
Excel would interpret it as #NAMES?
So when I import the excel file into pandas using read_excel, it would show NAN.
Now is there any method to force excel to retain as raw strings instead interpret it at formula level?
I created a vba and assigning the entire column with text to click through all the cells in the column, which is slow if there is ten thousand++ data.
I was hoping it can do it at python level instead, any idea?
I hope, it works for your solution, use openpyxl to extract excel data and then convert it into a pandas dataframe
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = './formula_contains_raw.xlsx', ).active
print(wb.values)
# sheet_names = wb.get_sheet_names()[0]
# sheet_ranges = wb[name]
df = pd.DataFrame(list(wb.values)[1:], columns=list(wb.values)[0])
df.head()
It works for me using a CSV instead of excel file.
In the CSV file (opened in excel) I need to select the option Formulas/Show Formulas, then save the file.
pd.read_csv('draft.csv')
Output:
Col1
0 hello
1 =-hello

Force computation of calculated columns in .xlsx imported via pd.ExcelFile

Suppose my data.xlsx's first sheet contains some computed columns.
I'm trying to pull out a pd.DataFrame of that sheet that holds the computed values.
But try as I may, I cannot achieve this.
Fails:
🔸
# > pip install openpyxl
import pandas as pd
pd.read_excel(f'data.xlsx', 'firstSheetName')
# NOTE: Adding `, engine='openpyxl'` makes no difference
df_nodal.head()
This gives NaN in all calculate fields.
🔸
xl = pd.ExcelFile(f'data.xlsx')
df = xl.parse('firstSheetName')
df.head()
Same.
🔸
how to read xlsx as pandas dataframe with formulas as strings
from openpyxl import load_workbook
wb = load_workbook(filename = f'data.xlsx')
ws = wb['mySheetName']
df = pd.DataFrame(ws.values)
df.head()
Now this is giving the formulae: =H2, =H3 etc. in the cells.
An attempt to 'type-convert' these colums failed:
df[12][2:].astype(float)
# ValueError: could not convert string to float: '=H3'
🔸 How to force pandas to evaluate formulas of xlsx and not read them as NaN? might offer a solution, which involves saving and reloading the .xlsx. However I can't get it working. That syntax appears invalid.
import pandas as pd, xlwings as xw
def df_from_excel(path):
book = xw.Book(path)
book.save()
return pd.read_excel(path,header=0)
df = df_from_excel('nodal0.xlsx')
This gives XlwingsError: Make sure to have "appscript" and "psutil", dependencies of xlwings, installed.
And pip install appscript psutil says they're both already installed.
Note: Same idea here: Pandas read_excel with formulas and get values
🔸🔸🔸
I'm trying to find a way for it to render into a dataframe, which will then contain numeric values.
Is there any way to do it?
🔸🔸🔸
EDIT:
Here's what I'm dealing with:
The raw .xlsx is shown below. I've double-clicked a calculated cell revealing the underlying =H2.
Notice the corresponding cell of the dataframe (generated from this .xlsx) is showing NaN

Delete rows in CSV file after being read by pandas

So I want to have 1 script writing continually to a CSV file, and another script reading periodically from that same CSV file.
What I'm looking for is a way to delete the rows I've just read in from the CSV file (not from my pandas dataframe).
Can anybody help?
# Read data in to dataframe
deviceInfo = pd.read_csv("sampleData.csv", nrows = 100)
# Somehow delete those 100 rows from the CSV file
#JoseAngelSanchez is correct that you might want to read the whole csv into a dataframe, but I think this way lets you get a dataframe with the first 100 rows and still delete them from the csv file.
import pandas as pd
df = pd.read_csv("sampleData.csv")
deviceInfo = df.iloc[:100]
df.iloc[100:].to_csv("sampleData.csv")
Note: if you're doing this repetitively then you'll probably want to write to_csv(...,index=None) or a new index column will be created in the .csv file on each iteration.
You should read the whole document and then delete the rows you don't want
import pandas as pd
df = pd.read_csv("sampleData.csv")
df = df.iloc[100:]
df.to_csv("sampleData.csv")

Pandas excel to python for long column

So I'm very new to python and I'm using Pandas to read an excel file, my file column is having 197 values to it, so when I read them with Pandas, I don't get all of the values " as shown in the picture"
not the full excel sheet is appearing
import pandas as pd
xl =pd.ExcelFile('test.xlsx')
sheet1 = xl.parse()
z=str(sheet1)
z=z.replace('212/',"")
z=z.replace('/1',"")
print(z)
Thanks for helping.
Is your question to show those values? What you see is normal behavior. If you want see specific rows, try loc or iloc.

Split multiple times?

So I'm currently transferring a txt file into a csv. It's mostly cleaned up, but even after splitting there are still empty columns between some of my data.
Below is my messy CSV file
And here is my current code:
Sat_File = '/Users'
output = '/Users2'
import csv
import matplotlib as plt
import pandas as pd
with open(Sat_File,'r') as sat:
with open(output,'w') as outfile:
if "2004" in line:
line=line.split(' ')
writer=csv.writer(outfile)
writer.writerow(line)
Basically, I'm just trying to eliminate those gaps between columns in the CSV picture I've provided. Thank you!
You can use python Pandas library to clear out the empty columns:
import pandas as pd
df = pd.read_csv('path_to_csv_file').dropna(axis=1, how='all')
df.to_csv('path_to_clean_csv_file')
Basically we:
Import the pandas library.
Read the csv file into a variable called df (stands for data frame).
Than we use the dropna function that allows to discard empty columns/rows. axis=1 means drop columns (0 means rows) and how='all' means drop columns all of the values in them are empty.
We save the clean data frame df to a new, clean csv file.
$$$ Pr0f!t $$$

Categories