Is it possible to modify output data file names in pySpark? - python

Simplified case.
Given that I have 5 input files in directory data_directory:
data_2020-01-01.txt,
data_2020-01-02.txt,
data_2020-01-03.txt,
data_2020-01-04.txt,
data_2020-01-05.txt
I read them all to pySpark RDD and perform some operation on them that doesn't do any shuffling.
spark = SparkSession.builder.appName("Clean Data").getOrCreate()
sparkContext = spark.sparkContext
input_rdd = sparkContext.textFile("data_directory/*")
result = input_rdd.mapPartitions(lambda x: remove_corrupted_rows(x))
Now I want to save data:
result.saveAsTextFile(
"results",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec",
)
And I get 5 files where each contains name "part". So I've lost information about from which input file does the output file come from:
._SUCCESS.crc
.part-00000.gz.crc
.part-00001.gz.crc
.part-00002.gz.crc
.part-00003.gz.crc
.part-00004.gz.crc
_SUCCESS
part-00000.gz
part-00001.gz
part-00002.gz
part-00003.gz
part-00004.gz
Is there anyway to keep the input file names or introduce my own naming pattern in this case?
Expected desired result:
._SUCCESS.crc
.data_2020-01-01.gz.crc
.data_2020-01-02.gz.crc
.data_2020-01-03.gz.crc
.data_2020-01-04.gz.crc
.data_2020-01-05.crc
_SUCCESS
data_2020-01-01.gz
data_2020-01-02.gz
data_2020-01-03.gz
data_2020-01-04.gz
data_2020-01-05.gz

You could use pyspark.sql.functions.input_file_name() (docs here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=functions#pyspark.sql.functions.input_file_name) and then partition your dataframe by the column created.
This way, 5 input files should give you a categorical column with 5 different values and partitioning on it should split your output into 5 parts.
Alternatively, if you wish to have a full naming pattern, then functionally split the dataframe on the input_file_name() column (here into 5 dataframes), repartition (e.g. to 1 using coalesce(1)) and then save with custom logic (e.g. a dict mapping or by extracting the filename from the column and parsing to DataFrameWriter.csv() as name).
N.B.: When changing to 1 partition, be sure that the data fits all into your memory!

Related

Best way to compare two huge dataframes in python

I have a use case where I need to compare a file from s3 bucket with the output of a sql query.
Below is how I am reading the s3 file.
if s3_data is None:
s3_data = pd.read_feather(BytesIO(obj.get()['Body'].read()))
else:
s3_data = s3_data.append(pd.read_feather(BytesIO(obj.get()['Body'].read())), ignore_index=True)
Below is how I am reading from the database.
db_data=pandas.read_sql
Now I need to compare s3_data with db_data. Both of these dataframes are huge with as much as 2 million rows of data each.
The format of the data is somewhat like below.
name | Age | Gender
--------------------
Peter| 30 | Male
Tom | 24 | Male
Riya | 28 | Female
Now I need to validate whether exact same rows with same column data exist in both s3 file and db.
I tried using dataframe.compare() but the kind of results it gives is not what I am looking for. The position of the row in the dataframe is not relevant for me. So if 'Tom' appears in row 1 or 3 in one of the dataframes while on a different position in other, it should still pass the equality validation as long as the record itself with same column values is present in both. dataframe.compare() is not helping me in this case.
The alternate approach which I took is to use csv_diff. I merged all the columns into one in the following way and saved it as a csv in my local creating two csv files- one for s3 data and one for db data.
data
------------
Peter+30+Male
Tom+24+Male
Riya+28+Female
Then, using below code, I am comparing the files.
s3_file = load_csv(open("s3.csv"),key="data")
db_file = load_csv(open("db.csv"),key="data")
diff = compare(s3_file,db_file)
This works but is not very performant as I have to first write huge csv files with size as big as 500mb to local and then read and compare them this way.
Is there a better way to handle the comparison part without the need to write files to local and also at the same time ensuring that I am able to compare entire records(with each column value compared for a given row) for equality irrespective of the position of the row in the dataframe?

Finding all files associated with an id within a folder of images?

I'm trying to populate a dataframe based on a class label and images in a folder.
I have a folder have over 10,000 images with the following name structure: ['leaflet_10000_1.jpg', 'leaflet_10000_2.jpg', 'leaflet_10001_1.jpg', 'leaflet_10001_2.jpg', 'leaflet_10002_1.jpg', 'leaflet_10002_2.jpg', 'leaflet_10003_1.jpg', 'leaflet_10003_2.jpg'
And an accompanying csv file of the structure:
ID,Location,Party,Representative/Candidate,Date
1000,Glasgow North,Liberal Democrats,,02-Apr-10
1001,Erith and Thamesmead,Labour Party,,02-Apr-10
I want to create a new csv file which has the paths of all the images for a said Party. I can separate a certain party from the full csv file using the commands:
df_ = df.loc[df["Party"] == "Labour Party"]
This will give me the party I am interested in, but how do I create a FULL list of all images associated with it.. from the image list shared above, it can be seen that ID 1001 has 2 images associated with it.. this is not a fixed number, some ID's have 3 to 5 images associated with them.
How do I get this new dataframe populated with all the required paths?
My thought process is to apply str.split(name, '_') on each file name and then search every ID against all the results but where to go from there?
You're on the right track!
If all IDs are unique and you want an output dataframe with just the party and image number, you can do something like:
from pathlib import Path
import numpy as np
import pandas as pd
partySer = df.loc[:, ['ID', 'Party']].set_index('ID')
# Get image names
imgFiles = list(Path('./<your-image-path>/').glob('*.jpg'))
imgFiles_str = np.array([str(f) for f in imgFiles])
# Grab just the integer ID from each image name
imgIds = np.array([int(f.stem.split('_')[1]) for f in imgFiles])
# Build dataframe with matching ids
outLst = []
for curId, party in partySer.iterrows():
matchingImgIdxs = imgIds == curId
matchingImgs = imgFiles_str[matchingImgIdxs]
outLst.append({'Party': party, 'images': matchingImgs})
outDf = pd.DataFrame(outLst)
I haven't tested this code, but it should lead you on the right track.
Lets create a dataframe of your images and extract the id.
from pathlib import Path
img_df = pd.DataFrame({'img' : [i.stem for i Path(your_images).glob('*.jpg')]})
img_df['ID'] = img_df['imgs'].astype(str).str.split('_',expand=True)[1].astype(int)
img_dfg = img_df.groupby('ID',as_index=False).agg(list)
ID imgs
0 10000 [leaflet_10000_1.jpg, leaflet_10000_2.jpg]
1 10001 [leaflet_10001_1.jpg, leaflet_10001_2.jpg]
2 10002 [leaflet_10002_1.jpg, leaflet_10002_2.jpg]
3 10003 [leaflet_10003_1.jpg, leaflet_10003_2.jpg]
then we just need to merge the ID columns.
df_merged = pd.merge(df,img_dfg,on='ID',how='left')
you can then do any further operations to group or list your images.
What do you want in your DataFrame ? You said here that you wanted to populate your df with the required paths ? If so, then using the str.split(name, '_') would allow you to get the following information for every file : its ID, and its number.
You can now insert elements in your dataframe using both of these characteristics, adding any other characteristic obtained from the relative .csv file that you described. In the end, filtering the dataframe to get all elements that correspond to a given criteria should give you what you are looking for.
You seem to think that one ID will mean one line inside the dataframe, but its incorrect as each line is described by a (ID, number) in your case, and thus, your function would already give you the full list of all images associated with the party/ID/other characteristic.
If you want to reduce the size of your dataframe, since all images related to the same ID only have one characteristic that differ, you could also have a "Files" column, which contain a list of all images related to this ID (and thus, drop the "number" column), or just the number associated with them as their path is composed of the main path, followed by "_number.jpg". This solution would be a lot more efficient

Storing processed text in pandas dataframe

I've used gensim for text summarizing in Python. I want my summarized output to be stored in a different column in the same dataframe.
I've used this code:
for n, row in df_data_1.iterrows():
text=df_data_1['Event Description (SAP)']
print(text)
*df_data_1['Summary']=summarize(text)*
print(df_data_1['Summary'])
The error is coming on line 4 of this code, which states: TypeError: expected string or bytes-like object.
How to store the processed text in the pandas dataframe
If it's not string or bytes-like, what is it? You could check the type of your summarize function and move forward from there.
test_text = df_data_1['Event Description (SAP)'].iloc[0]
print(type(summarize(test_text))
Another remark: typically you'd want to avoid looping over a dataframe (see discussion). If you want to apply a function to an entire column, use df.apply() as follows:
df_data1[‘Summary’] = df_data1['Event Description (SAP)'].apply(lambda x: summarize(x))

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.
You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)

parsing CSV to pandas dataframes (one-to-many unmunge)

I have a csv file imported to a pandas dataframe. It probably came from a database export that combined a one-to-many parent and detail table. The format of the csv file is as follows:
header1, header2, header3, header4, header5, header6
sample1, property1,,,average1,average2
,,detail1,detail2,,
,,detail1,detail2,,
,,detail1,detail2,,
sample2, ...
,,detail1,detail2,,
,,detail1,detail2,,
...
(i.e. line 0 is the header, line 1 is record 1, lines 2 through n are details, line n+1 is record 2 and so on...)
What is the best way to extricate (renormalize?) the details into separate DataFrames that can be referenced using values in the sample# records? The number of each subset of details are different for each sample.
I can use:
samplelist = df.header2[pd.notnull(df.header2)]
to get the starting index of each sample so that I can grab samplelist.index[0] to samplelist.index[1] and put it in a smaller dataframe. Detail records by themselves have no reference to which sample they came from, so that has to be inferred from the order of the csv file (notice that there is no intersection of filled/empty fields in my example).
Should I make a list of dataframes, a dict of dataframes, or a panel of dataframes?
Can I somehow create variables from the sample1 record fields and somehow attach them to each dataframe that has only detail records (like a collection of objects that have several scalar members and one dataframe each)?
Eventually I will create statistics on data from each detail record grouping and plot them against values in the sample records (e.g. sampletype, day or date, etc. vs. mystatistic). I will create intermediate Series to also be attached to the sample grouping like a kernel density estimation PDF or histogram.
Thanks.
You can use the fact that the first column seems to be empty unless it's a new sample record to .fillna(method='ffill') and then .groupby('header1') to get all the separate groups. On these, you can calculate statistics right away or store as separate DataFrame. High level sketch as follows:
df.header1 = df.header1.fillna(method='ffill')
for sample, data in df.groupby('header1'):
print(sample) # access to sample name
data = ... # process sample records
The answer above got me going in the right direction. With further work, the following was used. It turns out I needed to use two columns as a compound key to uniquely identify samples.
df.header1 = df.header1.fillna(method='ffill')
df.header2 = df.header2.fillna(method='ffill')
grouped = df.groupby(['header1','header2'])
samplelist = []
dfParent = pd.DataFrame()
dfDetail = pd.DataFrame()
for sample, data in grouped:
samplelist.append(sample)
dfParent = dfParent.append(grouped.get_group(sample).head(n=1), ignore_index=True)
dfDetail = dfDetail.append(data[1:], ignore_index=True)
dfParent = dfParent.drop(['header3','header4',etc...]) # remove columns only used in
# detail records
dfDetail = dfDetail.drop(['header5','header6',etc...]) # remove columns only used once
# per sample
# Now details can be extracted by sample number in the sample list
# (e.g. the first 10 for sample 0)
samplenumber = 0
dfDetail[
(dfDetail['header1'] == samplelist[samplenumber][0]) &
(dfDetail['header2'] == samplelist[samplenumber][1])
].header3[:10]
Useful links were:
Pandas groupby and get_group
Pandas append to DataFrame

Categories