pandas read_excel doesn't read all rows - python

I have a problem with "pandas read_excel", thats my code:
import pandas as pd
df = pd.read_excel('myExcelfile.xlsx', 'Table1', engine='openpyxl', header=1)
print(df.__len__())
If I run this code in Pycharm on Windows PC I got the right length of the dataframe, which is 28757
but if I run this code on my linux server I got only 26645 as output.
Any ideas whats the reason for that?
Thanks

Try this way:
import pandas as pd
data= pd.read_excel('Advertising.xlsx')
data.head()

I got the solution.
The problem was an empty first row in my .xlsx File.
My file is automatically created by another program, so I used openpyxl to delete the first row and make a new .xlsx File.
import openpyxl
path = 'myExcelFile.xlsx'
book = openpyxl.load_workbook(path)
sheet = book['Tabelle1']
#start at row 0, length 1 row:
sheet.delete_rows(0,1)
#save in new file:
book.save('myExcelFile_new.xlsx')
Attention, in this code sample I don`t check if the first row is empty!
So I delete the first line no matter if there is content in it or not.

Related

Trouble wrting to Excel

I' am new to Python and trying to write into a merged cell within Excel. I can see the data that is already stored within this cell/row, so I know its there. However when I try to overwrite it nothing happens.
I have tried messing with the index and header as well but nothing seems to work.
import pandas as pd
from openpyxl import load_workbook
Read the excel file into a pandas DataFrame
df = pd.read_excel(file here', sheet_name='Sheet1')
print(df.iloc[8, 2])
Make the changes to the DataFrame
df.iloc[8, 2] = "Bob Smith"
Load the workbook
book = load_workbook(file here)
writer = pd.ExcelWriter(file here, engine='openpyxl')
writer.book = book
Write the DataFrame to the first sheet
df.to_excel(writer, index=False)
Save the changes to the Excel file
writer.save()
import pandas as pd
from openpyxl import *
file="C:/Users/OneDrive/Bureau/draftExcel.xlsx"
df = pd.read_excel(file,sheet_name='sheet1')
df.iat[5,0]='cell is updated'
print(df) # to check first in the terminal if the content of the cell is updated
book=load_workbook(file)
writer=pd.ExcelWriter(file, engine='openpyxl')
df.to_excel(writer,sheet_name='sheet1',index=False)
writer.close()
I tried to make an example from what you explained because you didn't show your code, so I hope it was helpful.
Instead of using .iloc I used .iat so you can update the data in a specific cell in your DataFrame using column_index instead of column_label.
Remember that the Excel file you are working on must be closed while you are editing data with python, if it is open you will get an error.

Pandas "usecols" doesn't seems work perfectly

I am reading below excel with below python code but not getting any idea why the first column header has ".1" even though set to ignore the first column. Any idea please? many thanks in advance.
python script
import pandas as pd
import os
os.system('cls')
df = pd.read_excel('test_1\Book1.xlsx','sheet1', header=0, skiprows=1, usecols='B:D',index_col= 0, nrows=5)
print(df)
I am very confused about ".1" in the first column name header "PIA IM Equity.1" in the below result
In the file you have two columns named "PIA IM Equity" when reading pandas will rename the second identical column by adding .1 in the name. If you would have a third column with the same name it would have added .2 in it.

saving a dataframe to csv file (python)

I am trying to restructure the way my precipitations' data is being organized in an excel file. To do this, I've written the following code:
import pandas as pd
df = pd.read_excel('El Jem_Souassi.xlsx', sheetname=None, header=None)
data=df["El Jem"]
T=[]
for column in range(1,56):
liste=data[column].tolist()
for row in range(1,len(liste)):
liste[row]=str(liste[row])
if liste[row]!='nan':
T.append(liste[row])
result=pd.DataFrame(T)
result
This code works fine and through Jupyter I can see that the result is good
screenshot
However, I am facing a problem when attempting to save this dataframe to a csv file.
result.to_csv("output.csv")
The resulting file contains the vertical index column and it seems I am unable to call for a specific cell.
(Hopefully, someone can help me with this problem)
Many thanks !!
It's all in the docs.
You are interested in skipping the index column, so do:
result.to_csv("output.csv", index=False)
If you also want to skip the header add:
result.to_csv("output.csv", index=False, header=False)
I don't know how your input data looks like (it is a good idea to make it available in your question). But note that currently you can obtain the same results just by doing:
import pandas as pd
df = pd.DataFrame([0]*16)
df.to_csv('results.csv', index=False, header=False)

read in csv and changing first value from 'ID' then write csv in python3

I am trying to import a csv, change the first value in the file, and then write the file out to another csv. I am doing this as excel opens the csv files as SYLK format files if 'ID' is in the first value. I therefore intend to change 'ID' to "Value_ID'. I can't figure out how to change the value of s[0][0] = 'Value_ID'. Any help would be greatly appreciated.
with open('input1.csv', 'r') as file1:
reader = csv.reader(file1)
s = ('output1.csv')
filewriter = csv.writer(open(s,'w',newline= '\n'))
for row in reader:
filewriter.writerow(row)
s=[0][0] = 'Match_ID'
You can use pandas for doing this and many more operations quite efficiently and easily.
To install pandas
pip install pandas
This will make sure install all its dependencies as well.
Once this is done, open up the python shell
import pandas as pd
df = pd.read_csv('input1.csv')
new_df = df.set_value(index,col,value)
new_df.to_csv('Output1.csv')
In the above snippet, replace your index with row number and column with the colomn name.
If you are unsure what the row and column names are, type
df.head(5)
This shall give you top 5 rows and coloumns of the Pandas Dataframe.
Happy coding. Cheers!

pandas read excel values not formulas

Is there a way to have pandas read in only the values from excel and not the formulas? It reads the formulas in as NaN unless I go in and manually save the excel file before running the code. I am just working with the basic read excel function of pandas,
import pandas as pd
df = pd.read_excel(filename, sheetname="Sheet1")
This will read the values if I have gone in and saved the file prior to running the code. But after running the code to update a new sheet, if I don't go in and save the file after doing that and try to run this again, it will read the formulas as NaN instead of just the values. Is there a work around that anyone knows of that will just read values from excel with pandas?
That is strange. The normal behaviour of pandas is read values, not formulas. Likely, the problem is in your excel files. Probably your formulas point to other files, or they return a value that pandas sees as nan.
In the first case, the sheet needs to be updated and there is nothing pandas can do about that (but read on).
In the second case, you could solve by setting explicit nan values in read_excel:
pd.read_excel(path, sheetname="Sheet1", na_values = [your na identifiers])
As for the first case, and as a workaround solution to make your work easier, you can automate what you are doing by hand using xlwings:
import pandas as pd
import xlwings as xl
def df_from_excel(path):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path)
df = df_from_excel(path to your file)
If you want to keep those formulas in your excel file just save the file in a different location (book.save(different location)). Then you can get rid of the temporary files with shutil.
I had this problem and I resolve it by moving a graph below the first row I was reading. Looks like the position of the graphs may cause problems.
you can use xlrd to read the values.
first you should refresh your excel sheet you are also updating the values automatically with python. you can use the function below
file = myxl.xls
import xlrd
import win32com.client
import os
def refresh_file(file):
xlapp = win32com.client.DispatchEx("Excel.Application")
path = os.path.abspath(file)
wb = xlapp.Wordbooks.Open(path)
wb.RefreshAll()
xlapp.CalculateUntilAsyncqueriesDone()
wb.save()
xlapp.Quit()
after the file refresh, you can start reading the content.
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_index(0)
for rowid in range(worksheet.nrows):
row = worksheet.row(rowid)
for colid, cell in enumerate(row):
print(cell.value)
you can loop through however you need the data. and put conditions while you are reading the data. lot more flexibility

Categories