In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns.
To say it more easily, i want only one dataframe from multiple csv files with different headers.
I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility".
This is my code in python 3 :
df = spark.read.load(
s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true',
quote = ''
)
This is an example of file_1.csv :
|raw_id|title|civility|
|1 |M |male |
And an example of file2.csv :
|raw_id|first_name|civility|
|2 |Tom |male |
The result i expect in my dataframe is :
|raw_id|first_name|title|civility|
|1 | |M |male |
|2 |Tom | |male |
But, what is happening is that i have all united columns but the data is not in the right place after the first file.
Do you know how to do this ?
Thank you very much by advance.
You need to load each of them in a different dataframe and join them together on the raw_id column.
Related
I have a list of data which I want to write into a specific column (B2 column) in Excel
Input example:
mydata =[12,13,14,15]
Desired Output in Excel:
A2| B2|
| 12|
| 13|
| 14|
| 15|
I have tried using openpyxl to access the specific sheet (which works fine) and specific cell(B2) but it throws an error to write to the excel file as it is a list. It works fine if I assign a single value as shown in the code extract below:
mydata= my_wb['sheet2']['B2'] = 4
Can anyone point me in the right direction?
Iterate over the list and paste each value into the desired row in column B:
for i, n in enumerate(mydata):
my_wb["sheet2"].cell(i+2, 2).value = n
I have list multiple 1000's of huge files in a folder ..
Each file is having 2 header rows and trailer row
file1
H|*|F|*|TYPE|*|EXTRACT|*|Stage_|*|2021.04.18 07:35:26|##|
H|*|TYP_ID|*|TYP_DESC|*|UPD_USR|*|UPD_TSTMP|##|
E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##|
H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##|
S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##|
T|*|3|*|2021.04.18 07:35:43|##|
file 2
H|*|F|*|PA__STAT|*|EXTRACT|*|Folder|*|2021.04.18 07:35:26|##|
H|*|STAT_ID|*|STAT_DESC|*|UPD_USR|*|UPD_TSTMP|##|
A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##|
D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##|
I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##|
L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##|
P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##|
T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##|
U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##|
T|*|7|*|2021.04.18 07:35:55|##|
file3
H|*|K|*|PA_CPN|*|EXTRACT|*|SuccessFactors|*|2021.04.22 23:09:26|##|
H|*|COL_NUM|*|CPNT_TYP_ID|*|CPNT_ID|*|REV_DTE|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##|
40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##|
T|*|3|*|2021.04.22 23:27:17|##|
I am applying a filter on lines starting with H|| and T|| but it is rejecting the data for few rows.
df_cleanse=spark.sql("select replace(replace(replace(value,'~','-'),'|*|','~'),'|##|','') as value from linenumber3 where value not like 'T|*|%' and value not like 'H|*|%'")
I know we can use zipwithindex , but i have to read file by file and and they apply zip index and then filter on the rows .
for each file:
df = spark.read.text('file1')
#Adding index column each row get its row numbers , Spark distibutes the data and to maintain the order of data we need to perfrom this action
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("linenumber")
spark.sql("select * from linenumber where index >1 and value.value not like 'T|*|%'")
Please let know the optimal solution for the same. I do not want to run a extensive program all i need is to juts remove 3 lines . Even a regex to remove the rows is fine we need to process TB's of files in this format
Unix Commands and Sed operators are ruled out due to the file sizes
Meanwhile I wait your answer, try this to remove the first two lines and the last:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.read.csv('your_path', schema='value string')
df = df.withColumn('filename', f.input_file_name())
df = df.repartition('filename')
df = df.withColumn('index', f.monotonically_increasing_id())
w = Window.partitionBy('filename')
df = (df
.withColumn('remove', (f.col('index') == f.max('index').over(w)) | (f.col('index') < f.min('index').over(w) + f.lit(2)))
.where(~f.col('remove'))
.select('value'))
df.show(truncate=False)
Output
+-------------------------------------------------------------+
|value |
+-------------------------------------------------------------+
|E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##| |
|H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##| |
|S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##| |
|A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##| |
|D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##| |
|I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##| |
|L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##| |
|P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##| |
|T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##||
|U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##| |
|40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##| |
+-------------------------------------------------------------+
I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.
I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')
I have constructed a Spark dataframe from a query. What I wish to do is print the dataframe to a text file with all information delimited by '|', like the following:
+-------+----+----+----+
|Summary|col1|col2|col3|
+-------+----+----+----+
|row1 |1 |14 |17 |
|row2 |3 |12 |2343|
+-------+----+----+----+
How can I do this?
You can try to write to csv choosing a delimiter of |
df.write.option("sep","|").option("header","true").csv(filename)
This would not be 100% the same but would be close.
Alternatively you can collect to the driver and do it youself e.g.:
myprint(df.collect())
or
myprint(df.take(100))
df.collect and df.take return a list of rows.
Lastly you can collect to the driver using topandas and use pandas tools
In Spark 2.0+, you can use in-built CSV writer. Here delimiter is , by default and you can set it to |
df.write \
.format('csv') \
.options(delimiter='|') \
.save('target/location')