How can I add a value to a row in pyspark? - python

I have a dataframe that looks like this:
preds.take(1)
[Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))]
I want the whole thing to be one row, without the nested row in there. So, the first value would get a name and be a part of the one row object. If I wanted to name it "ID", it would look like this:
preds.take(1)
[Row(ID=0, val1=False, val2=1, val3='high_school')]
I've tried various things within a map, but nothing is producing what I'm looking for (or getting errors). I've tried:
preds.map(lambda point: (point._1, point._2))
preds.map(lambda point: point._2.append(point._1))
preds.map(lambda point: point._2['ID']=point._1)
preds.map(lambda point: (point._2).ID=point._1)

Since Row is a tuple and tuples are immutable you can only create a new object. Using plain tuples:
from pyspark.sql import Row
r = Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))
r[:1] + r[1]
## (0, False, 1, 'high_school')
or preserving __fields__:
Row(*r.__fields__[:1] + r[1].__fields__)(*r[:1] + r[1])
## Row(_1=0, val1=False, val2=1, val3='high_school')
In practice operating directly on rows should should be avoided in favor of using DataFrame DSL without fetching data to Python interpreter:
df = sc.parallelize([r]).toDF()
df.select("_1", "_2.val1", "_2.val2", "_2.val3")

Related

How can I align columns if rows have different number of values?

I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).

How to combine a Pandas dataframe with a Tensor dataset?

I have a Tensor dataset that is a list of file names and a Pandas dataframe that contains metadata for each file.
filename_ds = tf.data.Dataset.list_files(path + "/*.bmp")
metadata_df = pandas.read_csv(path + "/metadata.csv")
File names contain an idx that references a line in the metadata dataframe, like "3_data.bmp" where 3 is the idx. I hoped to call filename_ds.map(combine_data).
It appears to be not as simple as parsing the file name and doing a dataframe lookup. The following fails because filename is a Tensor, and since I'm running this on a Dataset.map() call, I do not have access to tf.executing_eagerly() methods like .numpy() and cannot get a string value from the filename to do my regex and df lookup.
combine_data(filename)
idx = re.findall("(\d+)_data.bmp", filename)[0]
val = metadata_df.loc[metadata_df["idx"] == idx]["test-col"]
...
New to Tensorflow, and I suspect I'm going about this in an odd way. What would be the correct way to go about this? I could list my files and concatenate a dataset for each file, but I'm wondering if I'm just missing the "Tensorflow way" of doing it.
One way of iteration is through as_numpy_iterator()
dataset_list=list(filename_ds.as_numpy_iterator())
for each_file in dataset_list:
file_name=each_file.decode('utf-8') # this will contain the abs path /user/me/so/file_1.png
try:
idx=re.findall("(\d+).*.png", file_name)[0] # changed for my case
except :
print("Exception==>")
print(f"File:{file_name},idx:{idx}")

pyspark - dynamically type cast and alias using metadata defined in json variable

Using Apache Spark 3.0.1
I have an incoming data frame df_src like the following.
df_src = spark.createDataFrame(
[
('1', 'foo', '2020-02-03', '2020-03-24T09:21:20+00:00'),
('2', 'bar', '2019-01-29', '2020-01-15T17:00:20+00:00'),
],
['a_col', 'b_col', 'c_col', 'd_col']
)
I also have metadata (as json object within array) that needs to be applied, - I would like to apply this metadata to df_src
I need to do 3 things
select only needed columns (projection)
apply data type
apply alias
I have tried the following and got steps 1 and 2.
import json
metadata_json = """
[
{"source_field":"a_col", "alias":"x_col", "datatype":"string"}
,{"source_field":"b_col", "alias":"y_col", "datatype":"timestamp"}
,{"source_field":"c_col", "alias":"z_col", "datatype":"string"}
]
"""
# Transform json input to python objects
metadata_dict = json.loads(metadata_json)
# Filter python objects with list comprehensions
source_fields = [x['source_field'] for x in metadata_dict]
print(source_fields)
# 1) projection - DONE i.e. my incoming data frame had d_col which I do not need - so doing select here based on metadata_json
df = df_src.select(source_fields)
# 2) apply data type - DONE i.e. from metata_json I am selecting fields that needs to be timestamp and casting.
for column in metadata_dict:
if column['datatype'] == 'timestamp':
df_dest = df.withColumn(column['source_field'], col(column['source_field']).cast("timestamp"))
# 3) apply alias ?
After step 2 my destination data frame df_dest looks as above.
Now, how do I apply alias dynamically based on metadata_json above using pyspark? (Also please suggest if there is an elegant way to do all the 3 steps, I cannot change the metadata_json)
Given your input object (and straightforward strings), consider something like this:
import pyspark.sql.functions as F
# string backticks to protect the names against "." and other characters
input_df.select(
*[
F.col(f"`{x["source_field"]}`").cast(x["datatype"]).alias(x["alias"])
for x in metadata_dict
]
)
If your strings become a little bit more complex, a simple cast() may not hack it. If that's the case, consider wrapping the entire F.col().cast().alias() statement implements a simple strategy pattern (or if ... elif ... else switch) that can handle more complex logic.

Loop through list of dataframes and save as new dataframe name

I'm trying to loop through a list of dataframes and perform operations on them. In the final command I want to rename the dataframe as the original key plus '_rand_test'. I'm getting the error:
SyntaxError: cannot assign to operator
Is there a way to do this?
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
seg_name[i]+'_rand_test' = pd.concat([control,test])
The error is because you are trying to perform addition on the left side of an = sign, which you can never do. If you want to rename the dataframe you could just do it on the next line. I'm unsure of what exactly you're trying to rename based off of the code, but if it's just the corresponding string in the seg_name list then the next line would look like this:
seg_name[segments.index(i)] += 'rand_test'
The reason for the segments.index(i) is because you're looping over the elements in segments, not their indexes, so you need to get the index of the element.
Maybe this will work for you?
Create an empty list befor you run the loop and fill that list with append function. And then you rename all the elements of the new list.
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
new_list= []
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
new_list.append(df)
new_names_list=[item +'_rand_test' for item in new_list]

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

Categories