Not able to differentiate the datatypes while I am doing the profiling for csv file, Giving every filed as string only
I have tried the below code
rdd = sc.textFile(file)
header = rdd.first()
rdd = rdd.filter(lambda x: x != header)
rdd1 = rdd.mapPartitions(lambda x: csv.reader(x))
spark_df = rdd1.toDF(header.split(','))
After done the profiling for the CSV file, I am getting all the fileds are strings only, not able to identify as numeric, date
The function textFile() does not support schema inference.
If you are reading from a structured source (such as csv), use sc.read.csv instead, which supports schema inference.
Your code would be:
df = sc.read.option("header", "true").option("inferSchema", "true").csv(file)
Related
I have a mariaDB database that contains csvs in the form of BLOB objects. I wanted to read these into pandas, but it appears that the csv is stored as a text file in it's own cell, like this:
Name
Data
csv1
col1, col2, ...
csv2
col1, col2, ...
How can I specifically read the cells in the data column as their own csvs into a pandas dataframe.
This is what I have tried:
raw = pd.read_sql_query(query, engine)
cell_as_string = raw.to_string(index=False)
converted_string = StringIO(cell_as_string)
rawdf = pd.read_csv(converted_string, sep = ',')
rawdf
However, rawdf is just the string with spaces, not a dataframe.
Here is a screenshot of what the query returns:
How can I ... read the cells ... into a pandas dataframe
Why is this even interesting?
It appears you already have the answer.
You are able to SELECT each item,
open a file for write, transfer the data,
and then ask .read_csv for a DataFrame.
But perhaps the requirement was to avoid spurious disk I/O.
Ok. The read_csv function accepts a file-like input,
and several libraries offer such data objects.
If the original question
was reproducible it would include
code that started like this:
from io import BytesIO, StringIO
default = "n,square\n2,4\n3,9"
blob = do_query() or default.encode("utf-8")
assert isinstance(blob, bytes)
Then with a binary BLOB in hand it's just a matter of:
f = StringIO(blob.decode("utf-8"))
df = pd.read_csv(f)
print(df.set_index("n"))
Sticking with bytes we might prefer the equivalent:
f = BytesIO(blob)
df = pd.read_csv(f)
So I'm working on taking a txt file and converting it into a csv data table.
I have managed to convert the data into a csv file and put it into a table, but I have a problem with extracting the numbers. In the data table that I made, it's giving me text as well as the value (intensity = 12345).
How do I only put the numerical values into the table?
I tried using regular expressions, but I couldn't get it to work. I would also like to delete all the lines that contain saturated, fragmented and merged. I initially created a code that would delete every uneven line, but this is a code that will be used for several files, so the odd lines in other files might have different data in them. How would I go about doing that?
This is the code that I currently have, plus a picture of what the output looks like.
import pandas as pd
parameters = pd.read_csv("ScanHeader1.txt", header=None)
parameters.columns = ['Packet Number', 'Intensity','Mass/Position']
parameters.to_csv('ScanHeader1.csv', index=None)
df = pd.read_csv('ScanHeader1.csv')
print(df)
I would really appreciate some tips or pointers on how I can do this. Thanks :)
you can try this
def fun_eq(x):
x = x.split(' = ')
return x[1]
def fun_hash(x):
x = x.split(' # ')
return x[1]
df = df.iloc[::2]
df['Intensity'] = df['Intensity'].apply(fun_eq)
df['Mass/Position'] = df['Mass/Position'].apply(fun_eq)
df['Packet Number'] = df['Packet Number'].apply(fun_hash)
I am trying to read an xlsx file to the pandas data frame using the following code
df = pd.read_excel(file_url, dtype=str, engine='openpyxl', index_col='SNo')
However, I am getting the following error
from_ISO8601\n dt = datetime.time(parts['hour'], parts['minute'], parts['second'])\nKeyError: 'hour'\n"
Two columns in my file are date values ( column names, manufactured, expired). It seems like there's some problem with them. When I removed all the values in these two columns, it worked. But with them, it is not working.
I don't really care about reading them as DateTime values. I just want to read them as a string. But setting dtype as str, or setting converters as shown below did not work
df = pd.read_excel(file_url, dtype=str, engine='openpyxl', index_col='SNo', converters = {'manufactured': lambda x: str(x) , 'expired': lambda x: str(x)})
I also tried dtype="string" and dtype={'manufactured':'string', 'expired':'string'}. But none of them are working.
Update:
My excel file has a few empty rows at the end. When I removed these and exported it into a new excel file, it is working. Why is this happening? How do I load the file without any such manual intervention?
I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using "from_json" and "get_json_object", but have been unable to read the data. Here's the smallest snippet of the source data that I've been trying to read:
{"value": "\u0000\u0000\u0000\u0000/{\"context\":\"data\"}"}
I need to extract the nested dict value. I used below code to clean the data and read it into a dataframe
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_path = '/FileStore/tables/enrl/context2.json #path for the above file
schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing
raw_df = spark.read.json(input_path)
cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data
cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)
I get a null dataframe each time I run the above code. Please help.
Tried below stuff and it didn't work:
PySpark: Read nested JSON from a String Type Column and create columns
Also tried to write it to a JSON file and read it. It didn't work as well:
reading a nested JSON file in pyspark
The null chars \u0000 affect the parsing of the JSON. You can replace them as well:
df = spark.read.json('path')
df2 = df.withColumn(
'cleansed_value',
F.regexp_replace('value','[\u0000/]','')
).withColumn(
'parsed',
F.from_json('cleansed_value','context string')
)
df2.show(20,0)
+-----------------------+------------------+------+
|value |cleansed_value |parsed|
+-----------------------+------------------+------+
|/{"context":"data"}|{"context":"data"}|[data]|
+-----------------------+------------------+------+
I have a resulting RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions). This has output in this format:
[(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....]
What I want is to create a CSV file with one column for labels (the first part of the tuple in above output) and one for predictions(second part of tuple output). But I don't know how to write to a CSV file in Spark using Python.
How can I create a CSV file with the above output?
Just map the lines of the RDD (labelsAndPredictions) into strings (the lines of the CSV) then use rdd.saveAsTextFile().
def toCSVLine(data):
return ','.join(str(d) for d in data)
lines = labelsAndPredictions.map(toCSVLine)
lines.saveAsTextFile('hdfs://my-node:9000/tmp/labels-and-predictions.csv')
I know this is an old post. But to help someone searching for the same, here's how I write a two column RDD to a single CSV file in PySpark 1.6.2
The RDD:
>>> rdd.take(5)
[(73342, u'cells'), (62861, u'cell'), (61714, u'studies'), (61377, u'aim'), (60168, u'clinical')]
Now the code:
# First I convert the RDD to dataframe
from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])
The DF:
>>> df.show()
+-----+-----------+
|count| word|
+-----+-----------+
|73342| cells|
|62861| cell|
|61714| studies|
|61377| aim|
|60168| clinical|
|59275| 2|
|59221| 1|
|58274| data|
|58087|development|
|56579| cancer|
|50243| disease|
|49817| provided|
|49216| specific|
|48857| health|
|48536| study|
|47827| project|
|45573|description|
|45455| applicant|
|44739| program|
|44522| patients|
+-----+-----------+
only showing top 20 rows
Now write to CSV
# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')
P.S: I am just a beginner learning from posts here in Stackoverflow. So I don't know whether this is the best way. But it worked for me and I hope it will help someone!
It's not good to just join by commas because if fields contain commas, they won't be properly quoted, e.g. ','.join(['a', 'b', '1,2,3', 'c']) gives you a,b,1,2,3,c when you'd want a,b,"1,2,3",c. Instead, you should use Python's csv module to convert each list in the RDD to a properly-formatted csv string:
# python 3
import csv, io
def list_to_csv_str(x):
"""Given a list of strings, returns a properly-csv-formatted string."""
output = io.StringIO("")
csv.writer(output).writerow(x)
return output.getvalue().strip() # remove extra newline
# ... do stuff with your rdd ...
rdd = rdd.map(list_to_csv_str)
rdd.saveAsTextFile("output_directory")
Since the csv module only writes to file objects, we have to create an empty "file" with io.StringIO("") and tell the csv.writer to write the csv-formatted string into it. Then, we use output.getvalue() to get the string we just wrote to the "file". To make this code work with Python 2, just replace io with the StringIO module.
If you're using the Spark DataFrames API, you can also look into the DataBricks save function, which has a csv format.
def toCSV(RDD):
for element in RDD:
return ','.join(str(element))
rows_of_csv=RDD.map(toCSV)
rows_of_csv.saveAsTextFile('/FileStore/tables/name_of_csv_file.csv')
# choose your path based on your distributed file system