Convert XML data to pandas dataframe via pyspark.sql.dataframe - python

My background: long-time SAS and R user, trying to figure out how to do some elementary things in Azure Databricks using Python and Spark. Sorry for the lack of a reproducible example below; I'm not sure how to create one like this.
I'm trying to read data from a complicated XML file. I've reached this point, where I have a pyspark.sql.dataframe (call it xml1) with this arrangement:
RESPONSE:array
element:array
element:struct
VALUE:string
VARNAME:string
The xml1 dataframe looks like this:
[Row(RESPONSE=[[Row(VALUE='No', VARNAME='PROV_U'), Row(VALUE='Included', VARNAME='ADJSAMP'), Row(VALUE='65', VARNAME='AGE'), ...
When I use xml2=xml1.toPandas(), I get this:
RESPONSE
0 [[(No, PROV_U), (Included, ADJSAMP), (65, AGE)...
1 [[(Included, ADJSAMP), (71, AGE), ...
...
At a minimum, I would like to convert this to a Pandas dataframe with two columns VARNAME and VALUE. A better solution would be a dataframe with columns named with VARNAME values (such as PROV_U, ADJSAMP, AGE), with one row per RESPONSE. Helpful hints with names of correct Python terms in intermediate steps are appreciated!

To deal with array of structs explode is your answer. Here is link on how to use explode https://hadoopist.wordpress.com/2016/05/16/how-to-handle-nested-dataarray-of-structures-or-multiple-explodes-in-sparkscala-and-pyspark/

Related

why and how to solve the data lost when multi encode in python pandas

Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))

How can I iterate over columns of a csv file to split it into several files?

I am completely new to Python (I started last week!), so while I looked at similar questions, I have difficulty understanding what's going on and even more difficulty adapting them to my situation.
I have a csv file where rows are dates and columns are different regions (see image 1). I would like to create a file that has 3 columns: Date, Region, and Indicator where for each date and region name the third column would have the correct indicator (see image 2).
I tried turning wide into long data, but I could not quite get it to work, as I said, I am completely new to Python. My second approach was to split it up by columns and then merge it again. I'd be grateful for any suggestions.
This gives your solution using stack() in pandas:
import pandas as pd
# In your case, use pd.read_csv instead of this:
frame = pd.DataFrame({
'Date': ['3/24/2020', '3/25/2020', '3/26/2020', '3/27/2020'],
'Algoma': [None,0,0,0],
'Brant': [None,1,0,0],
'Chatham': [None,0,0,0],
})
solution = frame.set_index('Date').stack().reset_index(name='Indicator').rename(columns={'level_1':'Region'})
solution.to_csv('solution.csv')
This is the inverse of doing a pivot, as explained here: Doing the opposite of pivot in pandas Python. As you can see there, you could also consider using the melt function as an alternative.
first, you're region column is currently 'one hot encoded'. What you are trying to do is to "reverse" one hot encode your region column. Maybe check if this link answers your question:
Reversing 'one-hot' encoding in Pandas.

Create an array from data in an excel file in python

I'm really new to python and pandas so would you please help me answer this seemingly simple question? I already have an excel file containing my data, now I want to create an array containing those data in python. For example, I have data in excel that look like this:
I want from those data to create a matrix of the form like the python code below:
Actually, my data is much longer so is there any way that I can take advantage of pandas to put the data from my excel file into a matrix in python similar to the simple example above?
Thank you!
you can put all your values into a=np.array([40,56,87,98,58,98,56,63]), then a.reshape(4,2) but in your case a.reshape(3,9). Hope you get my point.
You can use pandas.read_excel()
In the documentation there is also some examples like:
pd.read_excel('tmp.xlsx', index_col=0)
Name Value
0 string1 1
1 string2 2
2 #Comment 3

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

How to do a calculation on each line of a pandas dataframe in python?

I am brand new to python, I am attempting to convert the function I made in R to Python, R function described here:
How to optimize this process?
From my reading it looks like the best way to do this in python would be to use a for loop that would take the following form
for line 1 in probe test
find user in U_lookup
find movie in M_lookup
take the value found in U_lookup and retrieve that line number from knn_text
take the values found in that row of knn_text, and retrieve the line numbers from dfm
for those line numbers in dfm, retrieve column=U_lookup
take the average of the non zero values found
save value into pandas datafame in new column for that line
Is this the most efficient (in terms of speed of calculation) way to complete an operation like this? Coming from R so I wasn't sure if there was better functionality for something like this within the pandas package or not.
As a followup, is there an equivalent in python to the function dput() in R? dput essentially provides code to easily share a subset of data for questions like this.
You can use df.apply(my_func, axis=1) to apply the function/calculation to each row of a dataframe.
Where, my_func would contain the required calculations

Categories