How to fix upload csv file in bigquery using python - python

while uploading csv file on BigQuery through storage , I am getting below error:
CSV table encountered too many errors, giving up. Rows: 5; errors: 1. Please look into the error stream for more details.
In schema , I am using all parameter as string.
In csv file,I have below data:
It's Time. Say "I Do" in my style.
I am not able upload csv file in BigQuery containing above sentence

Does the CSV file have the exact same structure of the dataset schema? Both must match for the upload to be successful.
If your CSV file has only one sentence in the first row of the first column, then your schema must have a table with exactly one field as STRING. If there is content in the second column of the CSV, the schema must then have a second field for it, etc. Conversely, if your scheman has say 2 fields set as STRING, there must be data in first two columns in the CSV.
Data location must also match, if your BigQuery dataset is in US, then your Cloud Storage bucket must be in US too for the upload to work.
Check here for details of uploading CSV into BigQuery.

Thanks to all for a response.
Here is my solution to this problem:
with open('/path/to/csv/file', 'r') as f:
text = f.read()
converted_text = text.replace('"',"'") print(converted_text)
with open('/path/to/csv/file', 'w') as f:
f.write(converted_text)

Related

pyspark dataframe is saved in s3 bucket with junk data

While trying to save pyspark DataFrame to csv and trying to s3 bucket directly,
file is getting saved but it has junk data. and all file sizes are 1B.
please help me where iam doing wrong.
python code
df.write.options("header","true").csv("s3a://example/csv")
trying this code also
df.coalesce(1).write.format("csv").option("header", "true").option("path", "s3://example/test.csv").save()
But not getting proper csv in s3 bucket
junk data in csv file
I think you are saving your dataframe as parquet which is the default.
df.write.format("csv")
.option("header", "true")
.option("encoding", "UTF-8")
.option("path", "s3a://example/csv")
.save()
Note: The syntax is also option not options.
update
As #Samkart mentioned you should check if your encoding is correct. I have updated my answer to include the encoding option. You can check here for the encoding options in pyspark.

How to read Json files in a directory separately with a for loop and performing a calculation

Update: Sorry it seems my question wasn't asked properly. So I am analyzing a transportation network consisting of more than 5000 links. All the data included in a big CSV file. I have several JSON files which each consist of subset of this network. I am trying to loop through all the JSON files INDIVIDUALLY (i.e. not trying to concatenate or something), read the JSON file, extract the information from the CVS file, perform calculation, and save the information along with the name of file in new dataframe. Something like this:
enter image description here
This is the code I wrote, but not sure if it's efficient enough.
name=[]
percent_of_truck=[]
path_to_json = \\directory
import glob
z= glob.glob(os.path.join(path_to_json, '*.json'))
for i in z:
with open(i, 'r') as myfile:
l=json.load(myfile)
name.append(i)
d_2019= final.loc[final['LINK_ID'].isin(l)] #retreive data from main CSV file
avg_m=(d_2019['AADTT16']/d_2019['AADT16']*d_2019['Length']).sum()/d_2019['Length'].sum() #calculation
percent_of_truck.append(avg_m)
f=pd.DataFrame()
f['Name']=name
f['% of truck']=percent_of_truck
I'm assuming here you just want a dictionary of all the JSON. If so, use the JSON library ( import JSON). If so, this code may be of use:
import json
def importSomeJSONFile(f):
return json.load(open(f))
# make sure the file exists in the same directory
example = importSomeJSONFile("example.json")
print(example)
#access a value within this , replacing key with what you want like "name"
print(JSON_imported[key])
Since you haven't added any Schema or any other specific requirements.
You can follow this approach to solve your problem, in any language you prefer
Get Directory of the JsonFiles, which needs to be read
Get List of all files present in directory
For each file-name returned in Step2.
Read File
Parse Json from String
Perform required calculation

parsing a deeply nested JSON data present in a .dms file

I am trying to parse a deeply nested json data which is saved as .dms file. I saved some transactions of the file as a .json file. When I try json.load() function to read the .json file. I am getting the error as
JSONDecodeError: Extra data: line 2 column 1 (char 4392)
Opening the .dms file in text editor, I copied 3 transactions from it and saved it as .json file. The transactions in the file are not separated by commas. It is separated by new lines. When I used 1 transaction of it as a .json file and used json.load() function, it successfully read. But when I try the json file with 3 transactions, its showing error.
import json
d = json.load(open('t3.json')) or
with open('t3.json') as f:
data = json.load(f)
print(data)
the example transaction is :
{
"header":{
"msgType":"SOURCE_EVENT",
},
"content":{
"txntype":"ums",
"ISSUE":{
"REQUEST":{
"messageTime":"2019-06-06 21:54:11.492",
"Code":"655400",
},
"RESPONSE":{
"Time":"2019-06-06 21:54:11.579",
}
},
"DATA":{
"UserId":"021",
},
{header:{.....}}}
{header:{......}}}
This is how my json data from an API looks like. I wrote it in a readable way. But its all continuously written and whenever a header starts it starts from a new line. and the .dms file has 3500 transactions. the two transactions are not even seperated by commas. Its separated by new lines. But within a transaction there are extra spaces in a value. for eg; "company": "Target Chips 123 CA"
The output I need:
I need to make a csv by extracting values of keys messageType, messageTime, userid from the data for each transaction.
Please help out to clear the error and suggest ways to extract the data I need from these transactions for every transaction and put in .csv file for me to do further analysis and machine learning modeling.
If each object is contained within a single line, then read one line at a time and decode each line separately:
with open(fileName, 'r') as file_to_read:
for line in filetoread:
json_line = json.loads(line)
If objects are spread over multiple lines, then ideally try and fix the source of the data, otherwise use my library jsonfinder. Here is an example answer that may help.

JSON format strings in a .TXT file that I want to parse in Python

I am currently extracting Tweets from Twitter using Twitter IDs. The tool I am using comes with the dataset (Twitter IDs) that I have downloaded online and will be using for my masters dissertation. The tool takes the Twitter IDs and extracts the information from the Tweets, storing each Tweet as a JSON string in a .TXT file.
Below is a link to my OneDrive, where I have 2 files:
https://1drv.ms/f/s!At39YLF-U90fhJwCdEuzAc2CGLC_fg
1) Extracted Tweet information, each as a JSON string in a .txt file
2) Extracted Tweet information, each as a JSON string in what I believe is a .json file. The reason I say 'believe' is because the tool I am using automatically creates a file that contains '.json' at the end of the filename but in a .TXT format. I have simply renamed the file by removing 'TXT.' from the end
Below is code I have written (it is simple but the more I look for alternative code online, the more confused I become):
import pandas as pd
dftest = pd.read_json('test.json', lines=True)
The following error appears when I run the code:
ValueError: Unexpected character found when decoding array value (2)
I have run the first few Tweet arrays into a free online JSON parser and it breaks out the features of the Tweet exactly how I am wishing it to (to my knowledge this confirms they Tweet arrays are in a JSON format). This can be seen in the screenshot below:
I would be grateful if people could:
1) Confirm the extracted Tweets are in fact in a JSON string format
2) Confirm if the filename is automatically saved as 'text.json.txt' and I remove 'txt' from the filename, does this become a .json file?
3) Suggest how to get my very short Python script to work. The ultimate aim is to identify the features in each Tweet that I want (e.g. "created_at", "text", "hashtags", "location" etc.) in a Dataframe, so I can then save it to a .csv file.

issue in accessing specific columns of a csv file read as a S3 object with boto3

I am reading a csv file from S3 using boto3 and want to access specific columns of that csv. I have this code where I read the csv file into a S3 object using boto3 but I am having trouble in accessing specific columns out of it:
import boto3
s3 = boto3.resource('s3',aws_access_key_id = keyId, aws_secret_access_key = sKeyId)
obj = s3.Object(bucketName, srcFileName)
filedata = obj.get()["Body"].read()
print(filedata.decode('utf8'))
for row in filedata.decode('utf8'):
print(row[1]) # Get the column at index 1
When I execute this above the print(filedata.decode('utf8')) prints following on my output console:
51350612,Gary Scott
10100063,Justin Smith
10100162,Annie Smith
10100175,Lisa Shaw
10100461,Ricardo Taylor
10100874,Ricky Boyd
10103593,Hyman Cordero
But the line print(row[1]) inside for loop throws error as IndexError: string index out of range.
How can I remove this error and access specific columns out of a csv file from S3 using `boto3?
boto3.s3.get().read() will retrieve the whole file bytes object. Your code filedata.decode('utf8') only convert the whole bytes object into String object. There is no parsing happen here. Here is a shameless copy from another answer from another answer.
import csv
# ...... code snipped .... insert your boto3 code here
# Parse your file correctly
lines = response[u'Body'].read().splitlines()
# now iterate over those lines
for row in csv.DictReader(lines):
# here you get a sequence of dicts
# do whatever you want with each line here
print(row)
If you just have a simple CSV file, a quick and dirty fix will do
for row in filedata.decode('utf8').splitlines():
items = row.split(',')
print(items[0]. items[1])
How do I read a csv stored in S3 with csv.DictReader?
To read from the CSV properly, import the CSV python module and use one of its readers.
Documentation: https://docs.python.org/2/library/csv.html

Categories