I am new to pyspark and I want to convert a txt file into a Dataframe in Pyspark. I am trying to make the tidy data in pyspark. Any help? Thanks
I´ve already tried to convert it as an RDD and then into datafram, but it is not working for me, so I decided to convert it once into a dataframe from a txt file
I was trying with this but it has not worked yet.
# read input text file to RDD
lines = sc.textFile("/home/h110-3/workspace/spark/weather01.txt")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
I have not being able to convert it into a Dataframe. Help please
You can via the text reader ... example here:
! cat sample.txt
hello there
loading line by line
via apache spark
text df api
print(spark.version)
df = spark.read.text("sample.txt")
df.printSchema()
df.show()
df.selectExpr("split(value, ' ') as rows").show(3, False)
2.4.3
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| hello there|
|loading line by line|
| via apache spark|
| text df api|
+--------------------+
+-------------------------+
|rows |
+-------------------------+
|[hello, there] |
|[loading, line, by, line]|
|[via, apache, spark] |
+-------------------------+
Related
i am loading a dataframe of tweets in as JSON objects in pyspark.
I am trying to split the text in to individual words, and then select all the words that include a #. I want to try to avoid using regular python functions, adn try to stick with what is available inside of pyspark.
I am running the code as a jupyter notebook window: but this is the code overall.
import findspark
findspark.init()
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession \
.builder \
.appName("Jupyter Spark shell") \
.getOrCreate()
sc = spark.sparkContext
folder = 'tweet-id-text-345'
tweets = spark.read.format("json").option("delimeter", "\t").load(folder)
tweets.count()
I am very unsure on how to do this. the viable result would be to get a sort of array of all the different words, and one for the different words that include a #. These would be two seperate lists.
Here is what the content looks like
+------------------------+
| text|
+------------------------+
| โปรทุนน้อย สุดประ...|
| RT #sOLehOXClj1XE...|
|RT #rkayama: 論文「関...|
| SixTONES OneSTのグッ...|
| मुख्यमंत्री #mlkh...|
+------------------------+
only showing top 5 rows
Assuming that the json follows this format the text of the tweet is stored in a field called text.
The text column is split into single words and the resulting array is filtered using rlike:
from pyspark.sql import functions as f
df=spark.read.option("multiline", "true").json(<...>).select("text")
df.withColumn("all_words", f.split("text", " "))\
.withColumn("only_hash", f.expr("filter(all_words, w -> rlike(w, '.*#.*'))")) \
.show(truncate=False)
If the original text was hello #world how a#re you today# the output would be
+--------------------------------+---------------------------------------+----------------------+
|text |all_words |only_hash |
+--------------------------------+---------------------------------------+----------------------+
|hello #world how a#re you today#|[hello, #world, how, a#re, you, today#]|[#world, a#re, today#]|
+--------------------------------+---------------------------------------+----------------------+
In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns.
To say it more easily, i want only one dataframe from multiple csv files with different headers.
I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility".
This is my code in python 3 :
df = spark.read.load(
s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true',
quote = ''
)
This is an example of file_1.csv :
|raw_id|title|civility|
|1 |M |male |
And an example of file2.csv :
|raw_id|first_name|civility|
|2 |Tom |male |
The result i expect in my dataframe is :
|raw_id|first_name|title|civility|
|1 | |M |male |
|2 |Tom | |male |
But, what is happening is that i have all united columns but the data is not in the right place after the first file.
Do you know how to do this ?
Thank you very much by advance.
You need to load each of them in a different dataframe and join them together on the raw_id column.
I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.
I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')
My dataframe output is as below,
DF.show(2)
+--------------+
|col1|col2|col3|
+--------------+
| 10| 20| 30|
| 11| 21| 31|
+--------------+
after saving it as textfile - DF.rdd.saveAsTextFile("path")
Row(col1=u'10', col2=u'20', col3=u'30')
Row(col1=u'11', col2=u'21', col3=u'31')
the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes
10|20|30
11|21|31
while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,
data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))
Thanks in advance !
I think you can do just
.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)
In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
So in you're case, could just do something like
df.write.option("sep", "|").option("header", "false").csv("some/path/")
There is a databricks plugin that provides this functionality in spark 1.x
https://github.com/databricks/spark-csv
As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)