Processing .txt file using wholeTextFiles & wanting to extract filename - python

I am reading a .txt file using wholeTextFiles() in python spark. I know that after reading wholeTextFiles(), the resultant rdd will be of format (filepath, content). I have multiple files to read. I want to cut the file name from the filepath and save to a spark dataframe and a part of the filename as a date folder in HDFS location. But while saving, I am not getting the corresponding filenames. Is there any way to do so? Below is my code
base_data = sc.wholeTextFiles("/user/nikhil/raw_data/")
data1 = base_data.map(lambda x : x[0]).flatMap(lambda x : x.split('/')).filter(lambda x : x.startswith('CH'))
data2=data1.flatMap(lambda x : x.split('F_')).filter(lambda x : x.startswith('2'))
print(data1.collect())
print(data2.collect())
df.repartition(1).write.mode('overwrite').parquet(outputLoc + "/xxxxx/" + data2)
logdf = sqlContext.createDataFrame(
[(data1, pstrt_time, pend_time, 'DeltaLoad Completed')],
["filename","process_start_time", "process_end_time", "status"])`
output :
data1: ['CHNC_P0BCDNAF_20200217', 'CHNC_P0BCDNAF_20200227', 'CHNC_P0BCDNAF_20200615', 'CHNC_P0BCDNAF_20200925']
data2: ['20200217', '20200227', '20200615', '20200925']

Here a Scala version that is easily convertible to pyspark by your good self:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
val files = sc.wholeTextFiles("/FileStore/tables/*ZZ.txt",0)
val res1 = files.map(line => (line._1, line._2.split("\n").flatMap(x => x.split(" ")) )).map(elem => {(elem._1, elem._2) })
val res2 = res1.flatMap {
case (x, y) => {
y.map(z => (x, z))
}}
val res3 = res2.map(line => (line._1, line._1.split("/")(3), line._2))
val df = res3.toDF()
val df2 = df.withColumn("s", split($"_1", "/"))
.withColumn("f1", $"s"(3))
.withColumn("f2", $"f1".cast(StringType)) // avoid issues with split subsequently
.withColumn("filename", substring_index(col("f2"), ".", 1))
df2.show(false)
df2.repartition($"filename").write.mode("overwrite").parquet("my_parquet") // default 200 and add partitionBy as well for good measure on your `write`.
Some sample data, you strip away via .drop or using select:
+--------------------------------+---------+-------+-------------------------------------+---------+---------+--------+
|_1 |_2 |_3 |s |f1 |f2 |filename|
+--------------------------------+---------+-------+-------------------------------------+---------+---------+--------+
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|wwww |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|wwww |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|rrr |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt| |[dbfs:, FileStore, tables, AAAZZ.txt]|AAAZZ.txt|AAAZZ.txt|AAAZZ |
|dbfs:/FileStore/tables/AAAZZ.txt|AAAZZ.txt|4445
...
Usual aspects of punctuation removal, trimming of spaces to apply. You need to adapt for your filename situation of course, I cannot see that.
The issue is you cannot split on an already splitted thing.

Related

How to read empty delta partitions without failing in Azure Databricks?

I'm looking for a workaround. Sometimes our automated framework will read delta partitions, that does not exist. It will fail because no parquet files are in this partition.
I don't want it to fail.
What I do then is :
spark_read.format('delta').option("basePath",location) \
.load('/mnt/water/green/date=20221209/object=34')
Instead, I want it to return the empty dataframe. Return a dataframe with no records.
I did that, but found it a bit cumbersome, and was wondering if there was a better way.
df = spark_read.format('delta').load(location)
folder_partition = /date=20221209/object=34'.split("/")
for folder_pruning_token in folder_partition :
folder_pruning_token_split = folder_pruning_token.split("=")
column_name = folder_pruning_token_split[0]
column_value = folder_pruning_token_split[1]
df = df .filter(df [column_name] == column_value)
You really don't need to do that trick with Delta Lake tables. This trick was primarily used for Parquet & other file formats to avoid scanning of files on HDFS or cloud storage that is very expensive.
You just need to load data, and filter data using where/filter. It's similar to what you do:
df = spark_read.format('delta').load(location) \
.filter("date = '20221209' and object = 34")
If you need, you can of course extract that values automatically, maybe slightly simpler code:
df = spark_read.format('delta').load(location)
folder_partition = '/date=20221209/object=34'.split("/")
cols = [f"{s[0]} = '{s[1]}'"
for s in [f.split('=')for f in folder_partition]
]
df = df.filter(" and ".join(cols))

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

df.replace() not being converted into the text or csv file

When I use:
df = df.replace(oldvalue, newvalue)
it replaces the file, but when I try to put the new dataframe into either a text file or a csv file, it does not update and continues to be the original output before the replace.
I am getting the data from two files and trying to add them together. Right now I am trying to change the formatting to match the original formatting.
I have tried altering the placement of the replacement, as well as editing my df.replace command numerous times to either include regrex=True, to_replace, value=, and other small things. Below is a small sampling of code.
drdf['adg'] = adgvals #adds adg values into dataframe
for column, valuex in drdf.iteritems():
#value = value.replace('444.000', '444.0')
for indv in valuex:
valuex = valuex.replace('444.000', '444.0')
for difindv in valuex:
fourspace = ' '
if len(difindv) == 2:
indv1 = difindv + fourspace
value1 = valuex.replace(difindv, indv1)
drdf = drdf.replace(to_replace=valuex, value=value1)
#Transfers new dataframe into new text file
np.savetxt(r'/Users/username/test.txt', drdf.values, fmt='%s', delimiter='' )
drdf.to_csv(r'/Users/username/089010219.tot')
It should be replacing the values (for example 40 with 40(four spaces). It does this within the spyder interface, but it does not translate into the files that are being created.
Did you try:
df.replace(old, new, inplace=True)
Inplace essentially puts the new value 'inplace' of the old in some cases. However, I do not claim to know all the inner technical workings of inplace.
This is how I would do it with map:
drdf['adg'] = adgvals #adds adg values into dataframe
for column, valuex in drdf.iteritems():
#value = value.replace('444.000', '444.0')
for indv in valuex:
valuex = valuex.map('444.000':'444.0')
for difindv in valuex:
fourspace = ' '
if len(difindv) == 2:
indv1 = difindv + fourspace
value1 = valuex.map(difindv:indv1)
drdf = drdf.replace(valuex,value1)
#Transfers new dataframe into new text file
np.savetxt(r'/Users/username/test.txt', drdf.values, fmt='%s', delimiter='' )
drdf.to_csv(r'/Users/username/089010219.tot')

Spark: Load multiple files, analyze individually, merge results, and save

I'm new to Spark and not quite how to ask this (which terms to use, etc.), so here's a picture of what I'm conceptually trying to accomplish:
I have lots of small, individual .txt "ledger" files (e.g., line-delimited files with a timestamp and attribute values at that time).
I'd like to:
Read each "ledger" file into individual data frames (read: NOT combining into one, big data frame);
Perform some basic calculations on each individual data frame, which result in a row of new data values; and then
Merge all the individual result rows into a final object & save it to disk in a line-delimited file.
It seems like nearly every answer I find (when googling related terms) is about loading multiple files into a single RDD or DataFrame, but I did find this Scala code:
val data = sc.wholeTextFiles("HDFS_PATH")
val files = data.map { case (filename, content) => filename}
def doSomething(file: String) = {
println (file);
// your logic of processing a single file comes here
val logData = sc.textFile(file);
val numAs = logData.filter(line => line.contains("a")).count();
println("Lines with a: %s".format(numAs));
// save rdd of single file processed data to hdfs comes here
}
files.collect.foreach( filename => {
doSomething(filename)
})
... but:
A. I can't tell if this parallelizes the read/analyze operation, and
B. I don't think it provides for merging the results into a single object.
Any direction or recommendations are greatly appreciated!
Update
It seems like what I'm trying to do (run a script on multiple files in parallel and then combine results) might require something like thread pools (?).
For clarity, here's an example of the calculation I'd like to perform on the DataFrame created by reading in the "ledger" file:
from dateutil.relativedelta import relativedelta
from datetime import datetime
from pyspark.sql.functions import to_timestamp
# Read "ledger file"
df = spark.read.json("/path/to/ledger-filename.txt")
# Convert string ==> timestamp & sort
df = (df.withColumn("timestamp", to_timestamp(df.timestamp, 'yyyy-MM-dd HH:mm:ss'))).sort('timestamp')
columns_with_age = ("location", "status")
columns_without_age = ("wh_id")
# Get the most-recent values (from the last row of the df)
row_count = df.count()
last_row = df.collect()[row_count-1]
# Create an empty "final row" dictionary
final_row = {}
# For each column for which we want to calculate an age value ...
for c in columns_with_age:
# Initialize loop values
target_value = last_row.__getitem__(c)
final_row[c] = target_value
timestamp_at_lookback = last_row.__getitem__("timestamp")
look_back = 1
different = False
while not different:
previous_row = df.collect()[row_count - 1 - look_back]
if previous_row.__getitem__(c) == target_value:
timestamp_at_lookback = previous_row.__getitem__("timestamp")
look_back += 1
else:
different = True
# At this point, a difference has been found, so calculate the age
final_row["days_in_{}".format(c)] = relativedelta(datetime.now(), timestamp_at_lookback).days
Thus, a ledger like this:
+---------+------+-------------------+-----+
| location|status| timestamp|wh_id|
+---------+------+-------------------+-----+
| PUTAWAY| I|2019-04-01 03:14:00| 20|
|PICKABLE1| X|2019-04-01 04:24:00| 20|
|PICKABLE2| X|2019-04-01 05:33:00| 20|
|PICKABLE2| A|2019-04-01 06:42:00| 20|
| HOTPICK| A|2019-04-10 05:51:00| 20|
| ICEXCEPT| A|2019-04-10 07:04:00| 20|
| ICEXCEPT| X|2019-04-11 09:28:00| 20|
+---------+------+-------------------+-----+
Would reduce to (assuming the calculation was run on 2019-04-14):
{ '_id': 'ledger-filename', 'location': 'ICEXCEPT', 'days_in_location': 4, 'status': 'X', 'days_in_status': 3, 'wh_id': 20 }
Using wholeTextFiles is not recommended as it loads the full file into memory at once. If you really want to create an individual data frame per file, you can simply use the full path instead of a directory. However, this is not recommended and will most likely lead to poor resource utilisation. Instead, consider using input_file_path https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/functions.html#input_file_name--
For example:
spark
.read
.textFile("path/to/files")
.withColumn("file", input_file_name())
.filter($"value" like "%a%")
.groupBy($"file")
.agg(count($"value"))
.show(10, false)
+----------------------------+------------+
|file |count(value)|
+----------------------------+------------+
|path/to/files/1.txt |2 |
|path/to/files/2.txt |4 |
+----------------------------+------------+
so the files can be processed individually and then later combined.
You could fetch the file paths in hdfs
import org.apache.hadoop.fs.{FileSystem,Path}
val files=FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path(your_path)).map( x => x.getPath ).map(x=> "hdfs://"+x.toUri().getRawPath())
create a unique dataframe for each path
val arr_df= files.map(spark.read.format("csv").option("delimeter", ",").option("header", true).load(_))
The apply your filter or any transformation before unioning to one dataframe
val df= arr_df.map(x=> x.where(your_filter)).reduce(_ union _)

merge all pairs of objects within a folder based on (pairwaise) string pattern (R or Python)

I have two types of tables containing time series. One type contains data referring to population and are stored in files with a particular pattern at the end. The other type contains data regarding resources. Furthermore, I have files for different farms (hundreds). Thus, the content of the folder is:
Farm01_population
Farm01_resources
Farm02_population
Farm02_resources
Farm03_population
Farm03_resources
Farm04_population
Farm04_resources
........
And so on.
I also must do computations within each file. So far, I´ve started the task by first performing the calculations separately for population and resources.
population_files <- list.files("path",pattern="population.txt$")
resources_files <- list.files("path",pattern="resources.txt$")
for(i in 1:length(population_files)){......}
for(j in 1:length(resources_files)){......}
How could I merge now every pair of tables referring to each farm?, thus obtaining:
Farm01_finaltable
Farm02_finaltable
Farm03_finaltable
Farm04_finaltable
......
And so on.
As the number of farms is very big, I cannot write an specific string as pattern at the beginning of each file name. What I need to state is that tables must be merged that share the same pattern at the beginning, whichever this pattern (farm) is.
I am using R but solutions with Python are also welcomed.
Just keep everything all in one file
library(dplyr)
library(rex)
file_regex =
rex(capture(digits),
"_",
capture(anything))
catalog =
data_frame(file = list.files("path") ) %>%
extract(file,
c("ID", "type"),
file_regex,
remove = FALSE)
population =
catalog %>%
filter(type == "population")
group_by(ID) %>%
do(.$file %>% first %>% read.csv)
resources =
catalog %>%
filter(type == "resources")
group_by(ID) %>%
do(.$file %>% first %>% read.csv)
together = full_join(population, resources)
Assuming files are csv format, consider the following base R and Python 3 (using pandas) solutions. Both use regex patterning to find corresponding population and resources files and then merge to a final table using a linked Farm ID. Do note if you need to iterate past 99 files, be sure to adjust regex digit count {#} to {3} (for Python do not change the string format operator {0}).
R
path = "C:/Path/To/Files"
numberoffiles = 2
for (i in (1:numberoffiles)) {
if (i < 10) { i = paste0('0', i) } else { i = as.character(i) }
filespop <- list.files(path, pattern=sprintf("^[a-zA-Z]*[%s]{2}_population.csv$", i))
dfpop <- read.csv(paste0(path, "/", filespop[[1]]))
filesres <- list.files(path, pattern=sprintf("^[a-zA-Z]*[%s]{2}_resources.csv$", i))
dfres <- read.csv(paste0(path, "/", filesres[[1]]))
farm <- gsub(sprintf("[%s]{2}_population.csv", i), "", filespop[[1]])
mergedf <- merge(dfpop, dfres, by=c('FarmID'), all=TRUE)
write.csv(mergedf, paste0(path, "/", farm,
sprintf("%s_FinalTable_r.csv", i)), row.names=FALSE)
}
Python
import os
import re
import pandas as pd
# CURRENT DIRECTORY OF SCRIPT
cd = os.path.dirname(os.path.abspath(__file__))
numberoffiles = 2
for item in os.listdir(cd):
for i in range(1, numberoffiles+1):
i = '0'+str(i) if i < 10 else str(i)
filepop = re.match("^[a-zA-Z]*[{0}]{{2}}_population.csv$".format(i), item, flags=0)
fileres = re.match("^[a-zA-Z]*[{0}]{{2}}_resources.csv$".format(i), item, flags=0)
if filepop:
dfpop = pd.read_csv(os.path.join(cd, item))
if fileres:
dfres = pd.read_csv(os.path.join(cd, item))
farm = item.replace("{0}_resources.csv".format(i), "")
mergedf = pd.merge(dfpop, dfres, on=['FarmID'])
mergedf.to_csv(os.path.join(cd, "{0}{1}_FinalTable_py.csv"\
.format(farm, i)), index=False)

Categories