I have a series of python scripts doing Oracle SQL database checks and outputting the results into individual json files per output. That's all working fine, albeit there is probably better ways to do it.
Then using the following which I found on here in merging the json files into a single json:
import json
import glob
result = []
for f in glob.glob("*.json"):
with open(f, "rb") as infile:
result.append(json.load(infile))
with open("merged_file.json", "w") as outfile:
json.dump(result, outfile)
This works roughly how I wanted the output to be, the only issue I'm facing is the "names" of the result sets (I'm sure my lack of terminology knowledge is what's made this hard).
To visualize what I mean, this is the output result without the data (I realise this isn't the correct output format, this is from a gui viewer of the file to better see formatting):
[]JSON
{}0
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
{}1
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
I'm looking to have the 0 and 1 be the labels of the result set, so that the output would look like:
[]JSON
{}LAX
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
{}LGW
{}RMTID
{}FLIGHTS
{}HOURLY
{}WEATHER
{}CALIBRATION
{}GAPS
Is that possible with Python? Or should I be looking at alternative solutions?
I've seen other suggestions for similar questions that suggest just taking the outputs directly into one file rather then merging multiple files, however the results are from different database connections and finding a solution for "queuing" database connections and storing outputs has been above my skill level.
Thanks!
Related
I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.
I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.
Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.
Current thoughts would be:
csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'
with open(txt_path, "r") as in_text:
in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
with open(csv_path, "w") as out_csv:
out_writer = csv.writer(out_csv, delimiter = ',')
for row in in_reader:
out_writer.writerow(row)
Currently, I am receiving an error stating:
Error: field larger than field limit (131072)
It seems it's the maximum row count in a single column, so I'm quite a bit off.
I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above.
Update: I have expanded the sys.maxsize and am still receiving this same error
I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.
So I have 3 questions:
Is there a way I can write a larger sized csv?
Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?
Can I dump 35 csv's into Big Query in one upload?
Edit:
here is a short dataframe sample:
|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
I think there is some red-herring going on here:
Is there a way I can write a larger sized csv?
Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.
Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?
You shouldn't need to.
Can I dump 35 csv's into Big Query in one upload?
That's more a Big Query api question, so I wont attempt to answer that here.
In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.
Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.
Try replacing the | with \t and see if that helps.
If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!
I don't need the entire code but I want a push to help me on the way, I've been searching on the internet for clues on how to start to write a function like this but I haven't gotten any further then just the name of the function.
So I haven't got the slightest clue on how to start with this, I don't know how to work with text files. Any tips?
These text files are CSV (Comma Separated Values). It is a simple file format used to store tabular data.
You may explore Python's inbuilt module called csv.
Following code snippet an example to load .csv file in Python:
import csv
filename = 'us_population.csv'
with open(filename, 'r') as csvfile:
csvreader = csv.reader(csvfile)
So recently I've been using REPL as python code source, but whenever I'm offline, any information stored in the JSON File is rolled back after a bit of time. Now I know this is a REPL specific problem after doing some research, but is there any way I can fix this? My code itself is quite a few lines long, so I would rather not want to use a completely different storage method.
To successfully store data in json files in replit.com, it's important to load and dump it the correct way.
An example of storing data in json files:
with open("sample.json", "r") as file:
sample = json.load(file)
sample["item"] = "Value"
with open("sample.json", "w") as file:
json.dump(sample, file)
Let me know if you've already followed these steps.
I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.
My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.
Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.
I was wondering if I could get tips on how to make my code more efficient.
for opening and reading files, I use:
filenames = os.listdir('.')
dict = {}
for file in filenames:
with open(file) as f:
contents = f.read()
dict[file.replace(".txt", "")] = contents
Doing print(dict) crashes (at least it seems like it) my python.
Is there a better way to handle this?
Additionally, I also convert all the values in my dict to lowercase, using:
def lower_dict(d):
lcase_dict = dict((k, v.lower()) for k, v in d.items())
return lcase_dict
lower = lower_dict(dict)
I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?
Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?
however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different.
Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.
Any help is very much appreciated.
thanks
I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:
for line in lines:
line = re.sub('[T:-]', '', line)
Let me know if this helps!
I am usint the 64-bit version of Enthought Python to process data across multiple HDF5 files. I'm using h5py version 1.3.1 (HDF5 1.8.4) on 64-bit Windows.
I have an object that provides a convenient interface to my specific data heirarchy, but testing the h5py.File(fname, 'r') independently yields the same results. I am iterating through a long list (~100 files at a time) and attempting to pull out specific pieces of information from the files. The problem I'm having is that I'm getting the same information out of several files! My loop looks something like:
files = glob(r'path\*.h5')
out_csv = csv.writer(open('output_file.csv', 'rb'))
for filename in files:
handle = hdf5.File(filename, 'r')
data = extract_data_from_handle(handle)
for row in data:
out_csv.writerow((filename, ) +row)
When I inspect the files using something like hdfview, I know the internals are different. However, the csv I get seems to indicate that all the files contain the same data. Has anyone seen this behavior before? Any suggestions where I could go to start debugging this issue?
I've concluded that this is a strange manifestation of Perplexing assignment behavior with h5py object as instance variable . I re-wrote my code so that each file is handled within a function call and the variable is not reused. Using this approach, I don't see the same strange behavior and it seems to work much better. For clarity, the solution looks more like:
files = glob(r'path\*.h5')
out_csv = csv.writer(open('output_file.csv', 'rb'))
def extract_data_from_filename(filename):
return extract_data_from_handle(hdf5.File(filename, 'r'))
for filename in files:
data = extract_data_from_filename(filename)
for row in data:
out_csv.writerow((filename, ) +row)