Prevent New Line Formation in json_normalize Data Frames - python

I am working to flatten some tweets into a wide data frame. I simply use the pandas.json_normalize function on my to perform this.
I then save this data frame into a CSV file. The CSV format when uploaded produces some rows that are associated with the above, rather than holding all the data on a single row. I discovered this issue when uploading the CSV into R and into Domo.
When I run the following command in a jupyter notebook the CSV loads fine,
sb_2019 = pd.read_csv('flat_tweets.csv',lineterminator='\n',low_memory=False)
Without the lineterminator I see this error:
Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Needs:
I am looking for a post-processing step to eliminate the need for a the lineterminator. I need to open the CSV in platforms and languages that do not have this specification. How might I go about doing this?
Note:
I am working with over 700k tweets. The json_normalize function works great on small pieces of my data where issues are being found. When I run json_normalize on the whole dataset I am finding this issue.

Try using '\r\n' or '\r' as lineterminator, and not '\n'.
This solution would be helpful too, opening in universal-new-line mode:
sb_2019 = pd.read_csv(open('flat_tweets.csv','rU'), encoding='utf-8', low_memory=False)

Related

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

Writing results of several cells in jupyter to one file (no overwriting)

I have a jupyter notebook where I run the same simulation using many different combinations of parameters (essentially, to simulate different versions of environment and their effect on the results). Let's say that the result of each run is an image and a 2d array of all relevant metrics for my system. I want to be able to keep the images in notebook, but save the arrays all in one place, so that I can work with them later on if needed.
Ideally I would save them into an external file with the following format:
'Experiment environment version i' (or some other description)
2d array
and every time I would run a new simulation (a new cell) the results would be added into this file until I close it.
Any ideas how to end up with such external summary file?
If you have excel available to you then you could use pandas to write the results to a spreadsheet (or you could use pandas to write to a csv). See the documentation here, but essentially you would do the following when appending and/or using a new sheet:
import pandas as pd
for i in results:
with pd.ExcelWriter('results.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Result'+i)
You will need to have your array in dataframe 'df', there are lots of tutorials on how to put an array into pandas.
After a bit of try and error, here is a general answer how to write to txt (without pandas, otherwise see #jaybeesea's answer)
with open("filename.txt", "a+") as f:
f.write("Comment 1 \n")
f.write("%s \n" %np.array2string(array, separator=' , '))
Every time you run it, it adds to the file "f".

How can I have Pandas recognize the structure of my data properly?

I have some data saved in ".txt" files. this is how they are stored:
I used the code below to read the data and save it in a data frame object: (no need to mention that I'm using pandas library of python):
new_df = pd.read_csv(location, sep='\t', lineterminator='\n', names=None)
the problem is that when I get the shape of my data frame with new_df.shape I end up with: (123,1). It does not recognize that the data have 4 columns. How can I fix this?
It seems you don't have tab but spaces - use sep="\s+"
From your screenshot, your data appear to be in fixed width format.
Try to use pandas.read_fwf to read your data file:
pd.read_fwf(location)
You may pass the colspecs=... argument to tell it in which column each of the data are, but the routine is smart enough to figure this out automagically.

pandas transform a csv into a h5 file avoiding memory error

I have this simple code
data = pd.read_csv(file_path + 'PSI_TS_clean.csv', nrows=None,
names=None, usecols=None)
data.to_hdf(file_path + 'PSI_TS_clean.h5', 'table')
but my data is too big and I run into memory issues.
What is a clean way to do this chunk by chunk?
If the csv is really big split the file using a method such as detailed here : chunking-data-from-a-large-file-for-multiprocessing
then iterate through the files and use pd.read_csv on each then use the pd.to_hdf method
for to_hdf check the parameters here: DataFrame.to_hdf you need to ensure mode 'a' and consider append.
Without knowing further detail about the dataframe structure its difficult to comment further.
also for read_csv there is the param: low_memory=False

Writing Fortran unformatted files with Python

I have some single-precision little-endian unformatted data files written by Fortran77. I am reading these files using Python using the following commands:
import numpy as np
original_data = np.dtype('float32')
f = open(file_name,'rb')
original_data = np.fromfile(f,dtype='float32',count=-1)
f.close()
After some data manipulation in Python, I (am trying to) write them back in the original format using Python using the following commands:
out_file = open(output_file,"wb")
s = struct.pack('f'*len(manipulated_data), *manipulated_data)
out_file.write(s)
out_file.close()
But it doesn't seem to be working. Any ideas what is the right way of writing the data using Python back in the original fortran unformatted format?
Details of the problem:
I am able to read the final file with manipulated data from Fortran. However, I want to visualize these data using a software (Paraview). For this I convert the unformatted data files in the *h5 format. I am able to convert both the original and manipulated data in h5 format using h5 utilities. But while Paraview is able to read the *h5 files created from original data, Paraview is not able to read the *h5 files created from the manipulated data. I am guessing something is being lost in translation.
This is how I am opening the file written by Python in Fortran (single precision data):
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
And this is I am writing the original unformatted data by Fortran:
open(out_file_id,FILE=out_file,form="unformatted")
Is this information sufficient?
Have you tried using the .tofile method of the manipulated data array? It will write the array in C order but is capable of writing plain binary.
The documentation for .tofile also suggests this is the same as:
with open(outfile, 'wb') as fout:
fout.write(manipulated_data.tostring())
this is creating an unformatted sequential access file:
open(out_file_id,FILE=out_file,form="unformatted")
Assuming you are writing a single array real a(n,n,n) using simply write(out_file_id)a you should see a file size 4*n^3+8 bytes. The extra 8 bytes being a 4 byte integer (=4n^3) repeated at the start and end of the record.
the second form:
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
opens direct acess, which does not have those headers. For writing now you'd have write(unit,rec=1)a. If you read your sequential access file using direct acess it will read without error but you'll get that integer header read as a float (garbage) as the (1,1,1) array value, then everything else is shifted. You say you can read with fortran ,but are you looking to see that you are really reading what you expect?
The best fix to this is to fix your original fortran code to use unformatted,direct access for both reading and writing. This gives you an 'ordinary' raw binary file, no headers.
Alternately in your python you need to first read that 4 byte integer, then your data. On output you could put the integer headers back or not depending on what your paraview filter is expecting.
---------- here is python to read/modify/write an unformatted sequential fortran file containing a single record:
import struct
import numpy as np
f=open('infile','rb')
recl=struct.unpack('i',f.read(4))[0]
numval=recl/np.dtype('float32').itemsize
data=np.fromfile(f,dtype='float32',count=numval)
endrec=struct.unpack('i',f.read(4))[0]
if endrec is not recl: print "error unexpected end rec"
f.close()
f=open('outfile')
f.write(struct.pack('i',recl))
for i in range(0,len(data)):data[i] = data[i]**2 #example data modification
data.tofile(f)
f.write(struct.pack('i',recl)
just loop for multiple records.. note that the data here is read as a vector and assumed to be all floats. Of course you need to know the actuall data type to make use if it..
Also be aware you may need to deal with byte order issues depending on platform.

Categories