I am running Python through conda and my terminal. I was given a script that should be able to run without error. The script imports a url and reads it as a csv. This is what I have been given:
url = 'https://www.aoml.noaa.gov/hrd/hurdat/hurdat2.html'
data, storm, stormList = readHURDAT2(url)
columnnames= ['a,b,c,etc']
The error begins with the next line:
for line in pd.read_csv(url, header=None, names=columnnames, chunksize=1):
The computer runs several iterations before outputting this error message:
Too many columns specified: expected 20 and found 1
This happens because the data in https://www.aoml.noaa.gov/hrd/hurdat/hurdat2.html is in HTML format. I'd recommend you copy and paste the data into a local CSV file and read_csv from it. Also, because the file has a specific format that splits the document into HEADER LINES and DATA LINES, both with a different number of columns, you'd need to set engine='python' to read it. Finally, there are the maximum of 21 columns, not 20.
The code should look something like this:
for line in pd.read_csv('hurdat2.csv', # <- Here
engine='python', # <- Here
header=None,
names=columnnames,
chunksize=1,
):
Related
My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'
First of all, I have found several questions with the same title/topic here and I have tried the solutions that have been suggested, but none has worked for me
Here is the issue:
I want to extract a sample of workers from a huge .txt file (> 50 GB)
I am using HPC cluster for this purpose.
Every row in the data represents a worker which has many info (column variables). The idea is to a extract subsample of workers based on the first two letters in the ID variable:
df = pd.read_csv('path-to-my-txt-file', encoding= 'ISO-8859-1', sep = '\t', low_memory=False, error_bad_lines=False, dtype=str)
df = df.rename(columns = {'Worker ID' : 'worker_id'})
# extract subsample based on first 2 lettter in worker id
new_df = df[df.worker_id.str.startswith('DK', na=False)]
new_df.to_csv('DK_worker.csv', index = False)
The problem is that the resulting .CSV file has only 10-15 % of the number of rows that should be there (I have another source of information on the approximate number of rows that I should expect).
I think the data has some encoding issues. I have tried something like 'utf-8', 'latin_1' .. nothing has changed.
Do you see anything wrong in this code that may cause this problem? have I missed some argument?
I am not a Python expert :)
Many thanks in advance.
you can't load a 50GB file into your computers RAM, it would not be possible to store that much data. And I doubt the csv module can handle files of that size. What you need to do is open the file in small pieces, then process each piece.
def process_data(piece):
# process the chunk ...
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('path-to-my-txt-file.csv') as f:
for piece in read_in_chunks(f):
process_data(piece)
I am new to Python/Panda and I am trying to import the following file in Jupyter notebook via pd.read_
Initial file lines:
either pd.read_excel or pd.read_csv returned an error.
eliminating the first row allowed me to read the file but all csv data were not separated.
could you share the line of code you have used so far to import the data?
Maybe try this one here:
data = pd.read_csv(filename, delimiter=',')
It is always easier for people to help you if you share the relevant code accompanied by the error you are getting.
I have encountered a problem on extracting the data from a database1.csv file. My database1.csv file contain a million of data and I need to extract out certain column of data which I need for it. The following figure is my coding and I found an error when running the coding. The error I got is Error: unknown dialect.
For your information:
1) I need to extract out the entire certain column which contain the information "GWM" from database1.csv file.
2) After I extracted the data and I need to put all of them into a new excel file which is result.csv file.
3) The word "GWM" is the word that I selected to extract out the certain entire column
Any recommended suggestion can be used to improve and edit my coding? Thanks. enter image description here
Make sure you have Python 3 (most recent version) installed and have a command line window open in the folder your file is in.
Install pandas via pip or pip3, whichever works. (pip install pandas)
The code below, if saved and run in the same directory as your .xlsx file, will extract all your columns to .dat files, the filenames being the first row in said columns. From there, just choose the file you want.
import pandas as pd
xlsxname = input('File: ')
datacols = pd.read_excel(xlsxname, low_memory=False)
cols = list(datacols)
lencols = len(cols)
countup = 0
while countup != lencols:
colstemp = cols[countup]
data = pd.read_excel(xlsxname,
low_memory=False,
usecols=[colstemp])
colsname = f'{colstemp}.dat'
data.to_excel(colsname, index=False, header=False)
countup = countup + 1
It may be ugly, it may be an idiotic and poorly-coded solution (why not just select a specific column?), but hey, it works.
(in Excel)
...You could also left-click the number at the top of the column you want, press Ctrl-C, and paste it into a text editor, but hey...
I am new to python and actually started with R.
My problem is that I am unable to debug key errors from my pandas dataframes. Here is part of the code:
I read in a data frame from excel with following commands.
cwd = os.getcwd()
os.chdir(directorytofile)
os.listdir('.')
file = dataset
xl = pd.ExcelFile(file)
df1 = cl.parse('Sheet1')
Now when i want to select a header with a blank space like
Lieferung angelegt am
(It's German sorry for that)
I get the key error. I tried different ways to delete blank spaces in my headers when building the dataframe like:
sep='\s*,\s*'
But it still occurs. Is there a way for me to see where the problems happen?
Obviously its about the blank spaces because for headers without everything works fine.