Best format for saving list of JSON strings - python

I'm trying to run an ETL over a lot of XML files that exist in my datalake. The first step is to translate those XML files into JSON files because it way easier to load JSON files into databases rather than XML strings.
I'm trying to understand what format is better :
Format 1
[{'key':val}, {'key':val}, {'key':val}, {'key':val}]
Format 2:
{'key':val}, {'key':val}, {'key':val}, {'key':val}
Format 3 (as one column CSV):
{'key':val}
{'key':val}
{'key':val}
The advantage of file 1 is that I'm able to load that file back with json.load which I can't do to the second example (I'll get json.decoder.JSONDecodeError: Extra data: ) .
The advantage of file2 is that I can load that file as it is to many databases.
Another option is saving all 3 versions of the file but feels like a waste of space if there is any other good solution.

Related

How to write a large .txt file to a csv for Biq Query dump?

I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.
I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.
Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.
Current thoughts would be:
csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'
with open(txt_path, "r") as in_text:
in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
with open(csv_path, "w") as out_csv:
out_writer = csv.writer(out_csv, delimiter = ',')
for row in in_reader:
out_writer.writerow(row)
Currently, I am receiving an error stating:
Error: field larger than field limit (131072)
It seems it's the maximum row count in a single column, so I'm quite a bit off.
I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above.
Update: I have expanded the sys.maxsize and am still receiving this same error
I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.
So I have 3 questions:
Is there a way I can write a larger sized csv?
Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?
Can I dump 35 csv's into Big Query in one upload?
Edit:
here is a short dataframe sample:
|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
I think there is some red-herring going on here:
Is there a way I can write a larger sized csv?
Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.
Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?
You shouldn't need to.
Can I dump 35 csv's into Big Query in one upload?
That's more a Big Query api question, so I wont attempt to answer that here.
In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.
Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.
Try replacing the | with \t and see if that helps.
If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!

Pandas read_excel get only last row

I have an excel that is generated daily and can have up to 50k+ rows. Is there a way to read only the last row (which is the sum of the columns)?
right now I am just reading the entire sheet and keeping only the last row but it is taking up a huge amount of runtime.
my code:
df=pd.read_excel(filepath,header=1,usecols="O:AC")
df=df.tail(1)
Pandas is quite slow, especially with large in memory data. You can think about a lazy loading method, for example check dask.
Else you can read the file using "open" and read the last line :
with open(filepath, "r") as file:
last_line = file.readlines()[-1]
I dont think there is a way to decrease runtime when you read excel file.
When you read a excel or one sheet of excel,you would load excel all data into dask,even you use pd.read_excel skiprows,Its just keep the row the skiprows choose after you load all data into dask.So it cant decrease runtime.
If you really want decrease runtime of read file,you should save the file into another format,.csv or .txt and so on.
AND you generally you can't read Microsoft Excel files as a text files using methods like readlines or read. You should convert files to another format before (good solution is .csv which can be readed by csv module) or use a special python modules like pyexcel and openpyxl to read .xlsx files directly.

Looping through an xlsx file

I have two questions regarding reading data from a file in .xlsx format.
Is it possible to convert an .xlsx file to .csv without actually opening the file in pandas or using xlrd? Because when I have to open many files this is quite slow and I was trying to speed it up.
Is it possible to use some sort of for loop to loop through decoded xlsx lines? I put an example below.
xlsx_file = 'some_file.xlsx'
with open(xlsx_file) as lines:
for line in lines:
<do something like I would do for a normal string>
I would like to know if this is possible without the well known xlrd module.

How to parse strings with newlines from a csv file in tensorflow?

The official tensorflow tutorial suggests parsing csv files by using a tf.TextLineReader to read the file line by line and then using tf.decode_csv (source). This does however not work with csv records containing strings with newlines, since this causes a single csv record to be split up by the reader.
What is the best way to parse these types of files?
pandas.read_csv() can parse such CSV files properly if such strings are quoted correctly:
CSV:
a,b,c
1,"text which includes
line
breaks",100
2,another line,200
3,yet another line,300
import pd as pandas
df = pd.read_csv(r'D:\temp\1.csv')
result:
In [21]: df
Out[21]:
a b c
0 1 text which includes\r\nline\r\nbreaks 100
1 2 another line 200
2 3 yet another line 300
tf.decode_csv expects CSV files in RFC 4180 format and according to RFC4180, line breaks (CRLF) are indeed supposed to delimit records.
TensorFlow version 1.8 has introduced the API tf.contrib.data.make_csv_dataset to Read CSV files into a dataset.
I don't know if it solves your problem, but it is worth trying.

Python xmlutils, formatting xml to csv conversion

I am converting a generated xml file to a csv file using xmlutils. However, the nodes that I tagged in the xml file sometimes have an extra child node which messes up the formatting of the converted csv file.
For instance,
<issue>
<name>project1</name>
<key>733</key>
</issue>
<issue>
<name>project2</name>
<key>123</key>
<debt>233</debt>
</issue>
I tagged "issue" and the xml file was converted to a csv. However, once I opened the csv file, the formatting is wrong. Since there was an extra "debt" node in the second issue element, the columns for the second were shifted.
For instance,
name key
project1 733
project2 123 233
How can I tell xmlutil to generate a new column "debt"?
Also, if xmlutils cannot do the job, can you recommend me a program that is better suited?

Categories