Change Delimiter using pyspark and save it as textfile in HDFS

Change Delimiter using pyspark and save it as textfile in HDFS - python

I have a input data file in HDFS. I will be reading that file and perform some validations like below one. After performing validations i am getting the result as below. I want to change the delimiter of comma to '\t' using pyspark and store it in HDFS. Can any one help me with this. (No csv ans please). Thanks in advance.
Validation Code:
dc = data_f.filter("age > 25").filter(data_f.mar == '"married"').groupBy("job","edu").avg("bal","age").sort(data_f.job.desc(),"edu").rdd.map(list).collect()
Result:
[[u'"unknown"', u'"primary"', 1515.974358974359, 48.61538461538461],
[u'"unknown"', u'"secondary"', 1314.2045454545455, 47.84090909090909],
[u'"unknown"', u'"tertiary"', 2328.64, 51.84],
[u'"unknown"', u'"unknown"', 1977.1157894736841, 51.694736842105264],
[u'"unemployed"', u'"primary"', 1685.6097560975609, 44.957317073170735],
[u'"unemployed"', u'"secondary"', 1472.3518072289157, 43.8433734939759],
[u'"unemployed"', u'"tertiary"', 1865.968992248062, 41.031007751937985],
[u'"unemployed"', u'"unknown"', 859.1875, 45.375],
[u'"technician"', u'"primary"', 1512.704, 47.912]]

If You need to avoid
.csv.write
method, you could just use this snippet on rdd
def concatenate_row(row):
concatenated_row = ""
for col in row:
concatenated_row += str(col) + "\t"
return concatenated_row
result = rdd.map(lambda row : concatenate_row(row))
and then just call
saveAsTextFile
method on it

Related

Exporting a list of genes to .csv after carrying out sc.tl.rank_genes_groups in scanpy

I am relatively new to Python and Scanpy and recently i have generated a list of differentially expressed genes by using the
sc.tl.rank_genes_groups
function in scanpy.. I can then get these genes to be listed in the console, by carrying out this command set
result = adata_subset.uns['rank_genes_groups']
` groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names','logfoldchanges','pvals','pvals_adj']})'
However, i want this to be accessible via a csv file, as this list doesnt show all the genes...
Any help would be much appreciated!
Thanks in advance

Once you've created the dataframe, you simply need to use the to_csv function:
result = adata_subset.uns['rank_genes_groups']
groups = result['names'].dtype.names
df = pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names','logfoldchanges','pvals','pvals_adj']})
df.to_csv('path/to/file.csv')

How to save results from an API call that uses a pandas column for the requests before the whole thing times out when using apply?

I have a pandas dataframe with strings that I'm using to query an API and return the results.
I'm trying to call the API using a function and .apply and then save the results from the api call into a csv file. The problem is that I'm trying to do 10000+ requests and my kernel/notebook crashes. Basically I'm trying to do a big operation and I'm guessing I'm running out of memory. So I'm trying to think of a way I can do these api calls and save the results and not have it all crash. My version with .apply works with a small amount of data but not once it gets larger.
So my notebook code currently looks something like this.
df = pd.read_csv('bigstringlist.csv')
df = df.loc[0:3000]
My function looks something like this.
def api_fetch_func(address):
sleep(.2)
API_PRIVATE = 'awewaefawefawef'
encoded = urllib.parse.quote(address)
query ='https://apitocall' + str(encoded) + \
'.json?limit=1&key=' \
+ API_PRIVATE
response = requests.get(query)
while True:
try:
jsonResponse = response.json()
break
except:
response = requests.get(query)
try:
return jsonResponse['results']
except:
return
else:
return
Then I'm calling the function like so
df['response_col'] = df['string_col'].apply(api_fetch_func)
Something tells me that .apply isn't the right thing to do here. Would be better if I just push the api responses into an array or another dataframe?
Should I just use .iterrows to loop over the list of strings and call the function? Something tells me .apply tries to jam too much into memory and that's why this doesn't work.
So I was going to try
results = []
for index, row in df.iterrows():
# call API
# push results to array
Or is there another way to do this?

If it's a memory issue, what I'd do is write the API calling function as a generator with the yield statement. Then, you can loop through the api_fetch_function generator and save smaller data frames for the csv files rather than holding everything in memory in one go.
for idx, response in api_fetch_generator():
if idx % 500 == 0:
df = create_df() # create a fresh df as you did above with 'string_col'.
df['response_col'] = df['string_col'].apply(response)
if (idx % 500 == 0) and idx != 0:
# Save the df using idx to control the file name
df.to_csv(f"response_batch_{idx / 500}.csv")
# Combine the csv's after everything is saved.

How to combine a Pandas dataframe with a Tensor dataset?

I have a Tensor dataset that is a list of file names and a Pandas dataframe that contains metadata for each file.
filename_ds = tf.data.Dataset.list_files(path + "/*.bmp")
metadata_df = pandas.read_csv(path + "/metadata.csv")
File names contain an idx that references a line in the metadata dataframe, like "3_data.bmp" where 3 is the idx. I hoped to call filename_ds.map(combine_data).
It appears to be not as simple as parsing the file name and doing a dataframe lookup. The following fails because filename is a Tensor, and since I'm running this on a Dataset.map() call, I do not have access to tf.executing_eagerly() methods like .numpy() and cannot get a string value from the filename to do my regex and df lookup.
combine_data(filename)
idx = re.findall("(\d+)_data.bmp", filename)[0]
val = metadata_df.loc[metadata_df["idx"] == idx]["test-col"]
...
New to Tensorflow, and I suspect I'm going about this in an odd way. What would be the correct way to go about this? I could list my files and concatenate a dataset for each file, but I'm wondering if I'm just missing the "Tensorflow way" of doing it.

One way of iteration is through as_numpy_iterator()
dataset_list=list(filename_ds.as_numpy_iterator())
for each_file in dataset_list:
file_name=each_file.decode('utf-8') # this will contain the abs path /user/me/so/file_1.png
try:
idx=re.findall("(\d+).*.png", file_name)[0] # changed for my case
except :
print("Exception==>")
print(f"File:{file_name},idx:{idx}")

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?

convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

PySpark: read files without knowing the key of very single row [duplicate]

I'm trying to read in retrosheet event file into spark. The event file is structured as such.
id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX
As you can see for each game the events loops back.
I've read the file into a RDD, and then via a second for loop added a key for each iteration, which appears to work. But I was hoping to get some feedback on if there was a cleaning way to do this using spark methods.
logFile = '2014TEX.EVA'
event_data = (sc
.textFile(logfile)
.collect())
idKey = 0
newevent_list = []
for line in event_dataFile:
if line.startswith('id'):
idKey += 1
newevent_list.append((idKey,line))
else:
newevent_list.append((idKey,line))
event_data = sc.parallelize(newevent_list)

PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below
from operator import itemgetter
retrosheet = sc.newAPIHadoopFile(
'/path/to/retrosheet/file',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
.filter(itemgetter(1))
.values()
.filter(lambda x: x)
.map(lambda v: (
v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))
Since Spark 2.4 you can also read data into DataFrame using text reader
spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change Delimiter using pyspark and save it as textfile in HDFS - python

Related

Exporting a list of genes to .csv after carrying out sc.tl.rank_genes_groups in scanpy

How to save results from an API call that uses a pandas column for the requests before the whole thing times out when using apply?

How to combine a Pandas dataframe with a Tensor dataset?

How to work with Rows/Columns from CSV files?

PySpark: read files without knowing the key of very single row [duplicate]

Categories

Resources