pandas randomly reads one NaN? - python

I have a series of .csv files that I'm reading with pandas.read_csv. From a bunch of columns, I only read 2, (the 2nd and 15th columns).
datafiles = glob.glob(mypath)
for dfile in datafiles:
data = pd.read_csv(dfile,header=6,usecols=['Reading','Value'])
the CSV looks like
this, with a few lines of header at the top. Every once in a while pandas reads one of these numbers off as a NaN. Excel has no trouble reading these values, and visually inspecting the file I don't see what causes the problem. Specifically in this case, the row indexed as 265 in this file, 263 in the data frame, the 'Value' column reads a NaN when it should be ~27.4.
>>>data['Value'][264]
nan
This problem is consistent doesn't change with the number of files I read. In many of the files, this problem is not present. In the rest, it will only read one random number as a NaN, in either one of the columns. I've tried changing from the automatic float64 to np.float128 using dtype, but this doesn't fix it. Any ideas on how to fix this?
Update: A grep search shows that the newline character is \M with only 4 exceptions--lines at the beginning of every file before the header. On further inspection, this specific point [264] is treated differently in the failing files: In 5/12 files, it's fine. In 2/12 files it's read out as 27.0, in 3/12 it's read out as nan, and in 2/12 files it's read out as 2.0. One of the files (one that reads out a 27.0) is available for download here

It looks like you randomly have null characters throughout your csv files, and they are causing the problem. What you need to do to fix this is replace \0 with nothing.
Here's an example of how to do so. The imports are because of loading from a string instead of from a file.
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
datafiles = glob.glob(mypath)
for dfile in datafiles:
st=''
with open(dfile,'r') as f:
for line in f:
line = line.replace('\0','')
st += line
data = pd.read_csv(StringIO(st),header=6,usecols=['Reading','Value'])
It would be cool if pandas had a function to do this by default when you load data into the DataFrame, but it appears that there is no function like that as of now.

Related

How to write a large .txt file to a csv for Biq Query dump?

I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.
I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.
Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.
Current thoughts would be:
csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'
with open(txt_path, "r") as in_text:
in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
with open(csv_path, "w") as out_csv:
out_writer = csv.writer(out_csv, delimiter = ',')
for row in in_reader:
out_writer.writerow(row)
Currently, I am receiving an error stating:
Error: field larger than field limit (131072)
It seems it's the maximum row count in a single column, so I'm quite a bit off.
I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above.
Update: I have expanded the sys.maxsize and am still receiving this same error
I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.
So I have 3 questions:
Is there a way I can write a larger sized csv?
Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?
Can I dump 35 csv's into Big Query in one upload?
Edit:
here is a short dataframe sample:
|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
I think there is some red-herring going on here:
Is there a way I can write a larger sized csv?
Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.
Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?
You shouldn't need to.
Can I dump 35 csv's into Big Query in one upload?
That's more a Big Query api question, so I wont attempt to answer that here.
In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.
Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.
Try replacing the | with \t and see if that helps.
If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!

Merge CSV columns with irregular timestamps and different header names per file

I have long CSV files with different headers in every file.
The first column is always a timestamp which is irregular with its timings, so it rarely matches.
file1.csv
time,L_pitch,L_roll,L_yaw
2020-08-21T09:58:07.570,-0.0,-6.1,0.0
2020-08-21T09:58:07.581,-0.0,-6.1,0.0
2020-08-21T09:58:07.591,-0.0,-6.1,0.0
....
file2.csv
time,R_pitch,R_roll,R_yaw
2020-08-21T09:58:07.591,1.3,-5.7,360.0
2020-08-21T09:58:07.607,1.3,-5.7,360.0
2020-08-21T09:58:07.617,1.3,-5.7,360.0
....
file3.csv
time,L_accel_lat,L_accel_long,L_accel_vert
2020-08-21T09:58:07.420,-0.00,-0.00,0.03
2020-08-21T09:58:07.430,-0.00,0.00,0.03
2020-08-21T09:58:07.440,-0.00,0.00,0.03
....
At the moment there can be up to 6 CSV files in that format in a folder.
I would like to merge these CSV into one file where all columns are recognized and sorted according to the timestamps. When timestamps are matching, data gets merged into its corresponding line. If time is not matched, it gets a separate line with empty fields.
The result should look like this.
time,L_pitch,L_roll,L_yaw,R_pitch,R_roll,R_yaw,L_accel_lat,L_accel_long,L_accel_vert
2020-08-21T09:58:07.420,,,,,,,-0.00,-0.00,0.03
2020-08-21T09:58:07.430,,,,,,,-0.00,0.00,0.03
2020-08-21T09:58:07.440,,,,,,,-0.00,0.00,0.03
....
2020-08-21T09:58:07.581,-0.0,-6.1,0.0,,,,,,
2020-08-21T09:58:07.591,-0.0,-6.1,0.0,1.3,-5.7,360.0,,,
Last line would be an example of a matching timecode and with this also datamerging into one line
So far I tried this Github Link, but this merges with filenames into the CSV and no sorting.
Panda in Python seems to be up to the task, but my skills are not. I also tried some python files from GitHub...
This one seemed the most promising with changing the user, but it runs with no end (files to big?).
Is this possible to do this in a PowerShell ps1 or a somewhat (for me) "easy" python script?
I would build this into a batch file to work in several folders.
Thanks in advance
goam
As you mentioned, you can solve your problems rather conveniently using pandas.
import pandas as pd
import glob
tmp=[]
for f in glob.glob("file*"):
print(f)
tmp.append(pd.read_csv(f, index_col=0, parse_dates=True))
pd.concat(tmp,axis=1,sort=True).to_csv('merged')
Some explanation:
Here, we use glob to get the list of files using the wildcard pattern file*. We loop over this list and read each file using pandas read_csv. Note, we parse the dates of the file (converts to dtype datetime64[ns]) and use the date column as an index of the dataframe. We store the dataframes in a list called tmp. Finally we concatinate the individual dataframes (of the individual files) in tmp using concat and immediately write it to a file called merged.csv using pandas to_csv.

Read a csv into pandas that has commas *within* first/index cells of the csv rows without changing value

Ok, I get this error...:
"pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 12, saw 7"
...when trying to import a csv into a python script with pandas.read_csv():
path,Drawing_but_no_F5,Paralell_F5,Fixed,Needs_Attention,Errors
R:\13xx Original Ranch Buildings\1301 Stonehouse\1301-015\F - Bid Documents and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-026A Carriage House, Redo North Side Landscape\F - Bid Document and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-028\F - Bid Documents and Contract Award,Yes,No,No,No,No
R:\13xx Original Ranch Buildings\1302 Carriage House\1302-029\F - Bid Documents and Contract Award,Yes,No,No,No,No
Obviously, in the above entries, it is the third line that throws the error. Caveats include that I have to use that column as a path to process files there so changing the entry is not allowed. CSV is created elsewhere; I get it as-is.
I do want to preserve the column header.
This filepath column is used later as an index, so I would like to preserve that.
Many, many similar issues, but solutions seem very specific and I cannot get them to cooperate for my use case:
Pandas, read CSV ignoring extra commas
Solutions seem to change entry values or rely on the cells being in the last column
Commas within CSV Data
Solution involves sql tools methinks. I don't want to read the csv into sql tables...
csv file is already delimited by commas so I don't think I changing the sep value will work.. (I cannot get it to work -- yet)
Problems reading CSV file with commas and characters in pandas
Solution throws error: "for line in reader:_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)"
Not too optimistic since op had the cell value in quotes whereas I do not.
Here is a solution which is a minor modification of the accepted answer by #DSM in the last thread to which you linked (Problems reading CSV file with commas and characters in pandas).
import csv
with open('original.csv', 'r') as infile, open('fixed.csv', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
newline = [','.join(line[:-5])] + line[-5:]
writer.writerow(newline)
After running the above preprocessing code, you should be able to read fixed.csv using pd.read_csv().
This solution depends on knowing how many of the rightmost columns are always formatted correctly. In your example data, the rightmost five columns are always good, so we treat everything to the left of these columns as a single field, which csv.writer() wraps in double quotes.

Using Python v3.5 to load a tab-delimited file, omit some rows, and output max and min floating numbers in a specific column to a new file

I've tried for several hours to research this, but every possible solution hasn't suited my particular needs.
I have written the following in Python (v3.5) to download a tab-delimited .txt file.
#!/usr/bin/env /Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5
import urllib.request
import time
timestr = time.strftime("%Y-%m-%d %H-%M-%S")
filename="/data examples/"+ "ace-magnetometer-" + timestr + '.txt'
urllib.request.urlretrieve('http://services.swpc.noaa.gov/text/ace-magnetometer.txt', filename=filename)
This downloads the file from here and renames it based on the current time. It works perfectly.
I am hoping that I can then use the "filename" variable to then load the file and do some things to it (rather than having to write out the full file path and file name, because my ultimate goal is to do the following to several hundred different files, so using a variable will be easier in the long run).
This using-the-variable idea seems to work, because adding the following to the above prints the contents of the file to STDOUT... (so it's able to find the file without any issues):
import csv
with open(filename, 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
print(row)
As you can see from the file, the first 18 lines are informational.
Line 19 provides the actual column names. Then there is a line of dashes.
The actual data I'm interested in starts on line 21.
I want to find the minimum and maximum numbers in the "Bt" column (third column from the right). One of the possible solutions I found would only work with integers, and this dataset has floating numbers.
Another possible solution involved importing the pyexcel module, but I can't seem to install that correctly...
import pyexcel as pe
data = pe.load(filename, name_columns_by_row=19)
min(data.column["Bt"])
I'd like to be able to print the minimum Bt and maximum Bt values into two separate files called minBt.txt and maxBt.txt.
I would appreciate any pointers anyone may have, please.
This is meant to be a comment on your latest question to Apoc, but I'm new, so I'm not allowed to comment. One thing that might create problems is that bz_values (and bt_values, for that matter) might be a list of strings (at least it was when I tried to run Apoc's script on the example file you linked to). You could solve this by substituting this:
min_bz = min([float(x) for x in bz_values])
max_bz = max([float(x) for x in bz_values])
for this:
min_bz = min(bz_values)
max_bz = max(bz_values)
The following will work as long as all the files are formatted in the same way, i.e. data 21 lines in, same number of columns and so on. Also, the file that you linked did not appear to be tab delimited, and thus I've simply used the string split method on each row instead of the csv reader. The column is read from the file into a list, and that list is used to calculate the maximum and minimum values:
from itertools import islice
# Line that data starts from, zero-indexed.
START_LINE = 20
# The column containing the data in question, zero-indexed.
DATA_COL = 10
# The value present when a measurement failed.
FAILED_MEASUREMENT = '-999.9'
with open('data.txt', 'r') as f:
bt_values = []
for val in (row.split()[DATA_COL] for row in islice(f, START_LINE, None)):
if val != FAILED_MEASUREMENT:
bt_values.append(float(val))
min_bt = min(bt_values)
max_bt = max(bt_values)
with open('minBt.txt', 'a') as minFile:
print(min_bt, file=minFile)
with open('maxBt.txt', 'a') as maxFile:
print(max_bt, file=maxFile)
I have assumed that since you are doing this to multiple files you are looking to accumulate multiple max and min values in the maxBt.txt and minBt.txt files, and hence I've opened them in 'append' mode. If this is not the case, please swap out the 'a' argument for 'w', which will overwrite the file contents each time.
Edit: Updated to include workaround for failed measurements, as discussed in comments.
Edit 2: Updated to fix problem with negative numbers, also noted by Derek in separate answer.

Text file manipulation with Python

First off, I am very new to Python. When I started to do this it seemed very simple. However I am at a complete loss.
I want to take a text file with as many as 90k entries and put the data groups on a single line separated by a ';' My examples are below. Keep in mind that the groups of data vary in size. They could be two entries, or 100 entries.
Raw Data
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
Formatted Data
group1;data;
group2;data;data;data;
group3;data;data;data;data;data;data;data;data;data;data;data;data;
group4;data;data;
try something like the following. (untested...you can learn a bit of python by debugging!)
create python file "parser.py"
import sys
f = open('filename.txt', 'r')
for line in f:
txt = line.strip()
if txt == '':
sys.stdout.write('\n\n')
sys.stdout.flush()
sys.stdout.write( txt + ';')
sys.stdout.flush()
f.close()
and in a shell, type:
python parser.py > output.txt
and see if output.txt is what you want.
Assuming the groups are separated with an empty line, you can use the following one-liner:
>>> print "\n".join([item.replace('\n', ';') for item in open('file.txt').read().split('\n\n')])
group1;data
group2;data;data;data
group3;data;data;data;data;data;data;data;data;data;data;data;data
group4;data;data;
where file.txt contains
group1
data
group2
data
data
data
group3
data
data
data
data
data
data
data
data
data
data
data
data
group4
data
data
First the file content (open().read()) is split on empty lines split('\n\n') to produce a list of blocks, then, in each block [item ... for item in list], newlines are replaced with semi-colons, and finally all blocks are printed separated with a newline "\n".join(list)
Note that the above is not safe for production, that is code that you would write for interactive data transformation, not in production-level scripts.
What have you tried? Text file is for/from what? File manipulation is one of the last "basic" things I plan on learning. I'm saving it for when I understand the nuances of for loops, while loops, dictionaries, lists, appending, and a million other handy functions out there. That's after 2-3 months of research, coding and creating GUI's by the way.
Anyways here's some basic suggestions.
';'.join(group) will put a ";" in between each group, effectively creating one long (semi-colon delimited) string
group.replace("SPACE CHARACTER", ";") : This will replace any spaces or specified character (like a newline) within a group with a semi-colon.
There's a lot of other methods that include loading the txt file into a python script, .append() functions, putting the groups into lists, dictionaries, or matrix's, etc..
These are my bits to throw on the problem:
from collections import defaultdict
import codecs
import csv
res = defaultdict(list)
cgroup = ''
with codecs.open('tmp.txt',encoding='UTF-8') as f:
for line in f:
if line.startswith('group'):
cgroup = line.strip()
continue
res[cgroup].append(line.strip())
with codecs.open('out.txt','w',encoding='UTF-8') as f:
w = csv.writer(f, delimiter=';',quoting=csv.QUOTE_MINIMAL)
for k in res:
w.writerow([k,]+ res[k])
Let me explain a bit on the why I did things, as I did. First, I used the codecs module to open the data file explicitly with the codec, since data should always be treated right and not by just guessing what it might be. Then I used a defaultdict, which has a nice documentation online, cause its more pythonic, at least regarding to mr. hettinger. It is one of the patterns, that can be unlearned if you use python.
At least, I used a csv-writer to generate the output, cause writing CSV files is not as easy as one might think. And to be able to just meet the right criteria, or just to get the data into a correct csv format, it is better to use, what many eyes have seen, instead of reinventing the wheel.

Categories