python version: 3.7.11
pandas version: 1.1.3
IDE: Jupyter Notebook
Software for opening and resaving the .csv file: Microsoft Excel
I have a .csv file. You can download it from here: https://icedrive.net/0/35CvwH7gqr
In .csv file, I looked for rows that have blank cells and after finding that rows I deleted them. To do this I follow bellow instruction:
I Opened .csv file with Microsoft Excel.
I pressed F5, then in the "Reference" field I wrote "A1:E9030", then I clicked on ok.
I pressed F5 again, then clicked on "Special..." button, select "Blanks", then clicked on ok
In the "Home" tab from "Cells", I clicked "Delete", then "Delete Sheet Rows"
saved the file and closed it.
This is the file after deleting some rows: https://icedrive.net/0/cfG1dT6bBr
but when I run bellow code, it seems that extra columns are added after deleting some rows.
import pandas as pd
# The file doesn't have any header.
my_file = pd.read_csv(path_to_my_file, header=None)
my_file.head()
print(my_file.shape)
The output:
(9024, 244)
You can also see the difference by opening the file with notepad:
.csv file before deleting some rows:
.csv file after deleting some rows:
before deleting the rows the my_file.shape shows me 5 columns but after deleting some rows it shows me 244 for number of columns.
Question:
How to remove rows in excel or with other ways so I won't end up with this problem?
Note: I can't remove these rows with pandas because pandas automatically doesn't take into account these rows so I should do this manually.
Thanks in advance for any help.
I am not familiar with the operation you are carrying out in the first part of your question, but I suggest a different solution. Pandas will recognize only np.nan objects as null. So, in this case, we could start by loading the .csv file into Pandas first and replace the empty cells with np.nan values:
>>> import pandas as pd
>>> import numpy as np
>>> my_file = pd.read_csv(path_to_my_file, header=None)
>>> my_file = my_file.replace('', np.nan, inplace=True)
Then, we could ask pandas to drop all the rows containing np.nan:
>>> my_file = my_file.dropna(inplace=True)
This should give you the desired output. I think is a good habit to work on data frames from your IDE directly. Hope this helped!
Related
I am trying to merge a large number of .csv files. They all have the same table format, with 60 columns each. My merged table results in the data coming out fine, except the first row consists of 640 columns instead of 60 columns. The remainder of the merged .csv consists of the desired 60 column format. Unsure where in the merge process it went wrong.
The first item in the problematic row is the first item in 20140308.export.CSV while the second (starting in column 61) is the first item in 20140313.export.CSV. The first .csv file is 20140301.export.CSV the last is 20140331.export.CSV (YYYYMMDD.export.csv), for a total of 31 .csv files. This means that the problematic row consists of the first item from different .csv files.
The Data comes from http://data.gdeltproject.org/events/index.html. In particular the dates of March 01 - March 31, 2014. Inspecting the download of each individual .csv file shows that each file is formatted the same way, with tab delimiters and comma separated values.
The code I used is below. If there is anything else I can post, please let me know. All of this was run through Jupyter Lab through Google Cloud Platform. Thanks for the help.
import glob
import pandas as pd
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory= False) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
I used the following bash code to download the data:
!curl -LO http://data.gdeltproject.org/events/[20140301-20140331].export.CSV.zip
I used the following code to unzip the data:
!unzip -a "********".export.CSV.zip
I used the following code to transfer to my storage bucket:
!gsutil cp 2014DataCombinedMarch.csv gs://ddeltdatabucket/2014DataCombinedMarch.csv
Looks like these CSV files have no header on them, so Pandas is trying to use the first row in the file as a header. Then, when Pandas tries to concat() the dataframes together, it's trying to match the column names which it has inferred for each file.
I figured out how to suppress that behavior:
import glob
import pandas as pd
def read_file(f):
names = [f"col_{i}" for i in range(58)]
return pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory=False, names=names)
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([read_file(f) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
You can supply your own column names to Pandas through the names parameter. Here, I'm just supplying col_0, col_1, col_2, etc for the names, because I don't know what they should be. If you know what those columns should be, you should change that names = line.
I tested this script, but only with 2 data files as input, not all 31.
PS: Have you considered using Google BigQuery to get the data? I've worked with GDELT before through that interface and it's way easier.
So I want to have 1 script writing continually to a CSV file, and another script reading periodically from that same CSV file.
What I'm looking for is a way to delete the rows I've just read in from the CSV file (not from my pandas dataframe).
Can anybody help?
# Read data in to dataframe
deviceInfo = pd.read_csv("sampleData.csv", nrows = 100)
# Somehow delete those 100 rows from the CSV file
#JoseAngelSanchez is correct that you might want to read the whole csv into a dataframe, but I think this way lets you get a dataframe with the first 100 rows and still delete them from the csv file.
import pandas as pd
df = pd.read_csv("sampleData.csv")
deviceInfo = df.iloc[:100]
df.iloc[100:].to_csv("sampleData.csv")
Note: if you're doing this repetitively then you'll probably want to write to_csv(...,index=None) or a new index column will be created in the .csv file on each iteration.
You should read the whole document and then delete the rows you don't want
import pandas as pd
df = pd.read_csv("sampleData.csv")
df = df.iloc[100:]
df.to_csv("sampleData.csv")
So I'm very new to python and I'm using Pandas to read an excel file, my file column is having 197 values to it, so when I read them with Pandas, I don't get all of the values " as shown in the picture"
not the full excel sheet is appearing
import pandas as pd
xl =pd.ExcelFile('test.xlsx')
sheet1 = xl.parse()
z=str(sheet1)
z=z.replace('212/',"")
z=z.replace('/1',"")
print(z)
Thanks for helping.
Is your question to show those values? What you see is normal behavior. If you want see specific rows, try loc or iloc.
Is there a way to have pandas read in only the values from excel and not the formulas? It reads the formulas in as NaN unless I go in and manually save the excel file before running the code. I am just working with the basic read excel function of pandas,
import pandas as pd
df = pd.read_excel(filename, sheetname="Sheet1")
This will read the values if I have gone in and saved the file prior to running the code. But after running the code to update a new sheet, if I don't go in and save the file after doing that and try to run this again, it will read the formulas as NaN instead of just the values. Is there a work around that anyone knows of that will just read values from excel with pandas?
That is strange. The normal behaviour of pandas is read values, not formulas. Likely, the problem is in your excel files. Probably your formulas point to other files, or they return a value that pandas sees as nan.
In the first case, the sheet needs to be updated and there is nothing pandas can do about that (but read on).
In the second case, you could solve by setting explicit nan values in read_excel:
pd.read_excel(path, sheetname="Sheet1", na_values = [your na identifiers])
As for the first case, and as a workaround solution to make your work easier, you can automate what you are doing by hand using xlwings:
import pandas as pd
import xlwings as xl
def df_from_excel(path):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path)
df = df_from_excel(path to your file)
If you want to keep those formulas in your excel file just save the file in a different location (book.save(different location)). Then you can get rid of the temporary files with shutil.
I had this problem and I resolve it by moving a graph below the first row I was reading. Looks like the position of the graphs may cause problems.
you can use xlrd to read the values.
first you should refresh your excel sheet you are also updating the values automatically with python. you can use the function below
file = myxl.xls
import xlrd
import win32com.client
import os
def refresh_file(file):
xlapp = win32com.client.DispatchEx("Excel.Application")
path = os.path.abspath(file)
wb = xlapp.Wordbooks.Open(path)
wb.RefreshAll()
xlapp.CalculateUntilAsyncqueriesDone()
wb.save()
xlapp.Quit()
after the file refresh, you can start reading the content.
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_index(0)
for rowid in range(worksheet.nrows):
row = worksheet.row(rowid)
for colid, cell in enumerate(row):
print(cell.value)
you can loop through however you need the data. and put conditions while you are reading the data. lot more flexibility
First question: please be kind.
I am having trouble loading a CSV file into a DataFrame on Spyder, using iPython. When I load an XLS file, it seems to have no problem and populates the new DataFrame variable into the variable explorer.
For example:
import pandas as pd
energy = pd.read_excel('file.xls', skiprows=17)
The above returns a DataFrame, named energy, populated in the variable explorer (i.e. I can actually see the DataFrame).
However, when I try to load in a CSV file using the same method, it seems to read in the file, however it does not populate the variable explorer.
For example:
import pandas as pd
GDP = pd.read_csv('file.csv')
When I run the above line, I don't get an error message, but the new DataFrame, GDP, does not populate the variable explorer. If I print GDP I get the values (268 rows x 60 columns). Am I not saving the new DataFrame correctly as a variable?
Thanks!
The problem is not with the variable, but with the way Variable Explorer filters what it shows. Go to "Tools/Preferences", select "Variable explorer", and uncheck option "Exclude all-uppercase references".