Renaming index leads to data corruption in Python Pandas - python

I'm reading a csv and writing it out as a compressed csv. I use the following code
inp = pd.read_csv(inp_path)
inp.to_csv(filename, compression='gzip',encoding='utf-8')
And it just works fine
I need to rename the index as rownum, and I use the following code
inp = pd.read_csv(inp_path)
inp.index.names = ['rownum']
inp.to_csv(filename, compression='gzip',encoding='utf-8')
this leads to an error while reading the written file
Compressed file ended before the end-of-stream marker was reached
I'm doing this for 4 files, but I run into this issue for just one file.
Is there something wrong with what I'm doing ? Or is this a possible data issue ?
OR
Is there another way I can do this rename that will help me bypass this problem ?
EDIT
As suggested in the comments. Tried the following code
inp = pd.read_csv(inp_path)
row_count = len(inp)
index_row = range(0, row_count)
inp.insert(0, "rownum", index_row)
inp.to_csv(filename, compression='gzip',encoding='utf-8', index=False)
Still running into the same error mentioned above while trying to read the file

Related

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

Everything in csv file converted to int64?

I'm trying to work with a csv file(link:https://github.com/mwaskom/seaborn-data/blob/master/tips.csv) on jupyter notebook. When I perform .dtypes to the file it only returns the result dtype('int64') and nothing else. How can I get other data types such as float64 and object for every columns?
Also I'm using the file from uploading the csv file to jupyter notebook. Used the code below to read it.
df = pd.read_csv('tips.csv')
It's weird because when I ran the exact same code yesterday it showed data types for each column. Does anyone know what the problem may be?
The local file on your laptop isn't what you think it is; check the output of
import csv
with open('tips.csv', newline='') as f:
reader = csv.reader(f)
for _ in range(2):
print(next(reader))
and confirm that your local copy of 'tips.csv' is missing data. Re-download the file from your linked source if needed.

Prevent New Line Formation in json_normalize Data Frames

I am working to flatten some tweets into a wide data frame. I simply use the pandas.json_normalize function on my to perform this.
I then save this data frame into a CSV file. The CSV format when uploaded produces some rows that are associated with the above, rather than holding all the data on a single row. I discovered this issue when uploading the CSV into R and into Domo.
When I run the following command in a jupyter notebook the CSV loads fine,
sb_2019 = pd.read_csv('flat_tweets.csv',lineterminator='\n',low_memory=False)
Without the lineterminator I see this error:
Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Needs:
I am looking for a post-processing step to eliminate the need for a the lineterminator. I need to open the CSV in platforms and languages that do not have this specification. How might I go about doing this?
Note:
I am working with over 700k tweets. The json_normalize function works great on small pieces of my data where issues are being found. When I run json_normalize on the whole dataset I am finding this issue.
Try using '\r\n' or '\r' as lineterminator, and not '\n'.
This solution would be helpful too, opening in universal-new-line mode:
sb_2019 = pd.read_csv(open('flat_tweets.csv','rU'), encoding='utf-8', low_memory=False)

Use python to extract information and data from excel file

I have encountered a problem on extracting the data from a database1.csv file. My database1.csv file contain a million of data and I need to extract out certain column of data which I need for it. The following figure is my coding and I found an error when running the coding. The error I got is Error: unknown dialect.
For your information:
1) I need to extract out the entire certain column which contain the information "GWM" from database1.csv file.
2) After I extracted the data and I need to put all of them into a new excel file which is result.csv file.
3) The word "GWM" is the word that I selected to extract out the certain entire column
Any recommended suggestion can be used to improve and edit my coding? Thanks. enter image description here
Make sure you have Python 3 (most recent version) installed and have a command line window open in the folder your file is in.
Install pandas via pip or pip3, whichever works. (pip install pandas)
The code below, if saved and run in the same directory as your .xlsx file, will extract all your columns to .dat files, the filenames being the first row in said columns. From there, just choose the file you want.
import pandas as pd
xlsxname = input('File: ')
datacols = pd.read_excel(xlsxname, low_memory=False)
cols = list(datacols)
lencols = len(cols)
countup = 0
while countup != lencols:
colstemp = cols[countup]
data = pd.read_excel(xlsxname,
low_memory=False,
usecols=[colstemp])
colsname = f'{colstemp}.dat'
data.to_excel(colsname, index=False, header=False)
countup = countup + 1
It may be ugly, it may be an idiotic and poorly-coded solution (why not just select a specific column?), but hey, it works.
(in Excel)
...You could also left-click the number at the top of the column you want, press Ctrl-C, and paste it into a text editor, but hey...

pd.read_excel does recognize the file but does not actually read it

I've been busy working on some code and one part of it is importing an excel file. I've been using the code below. Now, on one pc it works but on another it does not (I did change the paths though). Python does recognize the excel file and does not give an error when loading, but when I print the table it says:
Empty DataFrame
Columns: []
Index: []
Just to be sure, I checked the filepath which seems to be correct. I also checked the sheetname but that is all good too.
df = pd.read_excel(book_filepath, sheet_name='Potentie_alles')
description = df["#"].map(str)
The key error '#' (# is the header of the first column of the sheet).
Does anyone know how to fix this?
Kind regards,
iCookieMonster

Categories