Problems joining two Pandas Dataframes

Problems joining two Pandas Dataframes - python

I'm trying to create a report of the cards I have in Trello through Rest API, where I need to show in the same report the card data and the names of the members assigned to each card.
The problem is that the Trello JSON is very cumbersome, and I need to make several queries, and then merge the different data frames.
I'm currently stuck, trying to add the cardmember names to the main card data frame.
I'm sending you a summary of the problem:
I have created the main data frame (trello_dataframe), where I have card level information from Trello, including the "ID Members" column (trello_dataframe['ID Members'], in list form, which I need to merge with another data frame.
More info about trello_dataframe: https://prnt.sc/boC6OL50Glwu
The second data frame (df_response_members) results from the query at the board member level, where I have 3 columns (ID Members (df_response_members['ID Members']), FullName (df_response_members['Member (Full Name)']), and Username (df_response_members['Member (Username)']).
More info about "df_response_members": https://prnt.sc/x6tmzI04rohs
Now I want to merge these two data frames, grouped by df_response_members['ID Members'], so that the full name and username of the card members appear in the card data frame (it's the main one).
The problem occurs when I try to merge the two data frames, with the following code, and I get the error
TypeError: unhashable type: 'list'.
at
trello_dataframe = pd.merge(df_response_members,trello_dataframe, on="ID Members", how='outer')
Here is how I would like to see the main data frame: https://prnt.sc/7PSTmG2zahZO
Thank you in advance!

You can't do that for two reasons: A) as the error says, lists aren't hashable, and DataFrame operations tipically don't work on unhashable data types, and, B) you are trying to merge a list column with a string column. Both column types should be the same in order to perform a merge.
A solution could be to first use DataFrame.explode() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) on your first DataFrame trello_dataframe using the 'ID Members' column, this will generate an independent row for each 'ID Member' on each list. Now you can perform your merge with this DataFrame.
To convert back to your desired format you can use GroupBy, as stated here: How to implode(reverse of pandas explode) based on a column.

Related

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.

It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]

Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Break a dictionary out of a StringType column in a spark dataframe

I have a spark table that I want to read in python (I'm using python 3 in databricks) In effect the structure is below. The log data is stored in a single string column but is a dictionary.
How do I break out the dictionary items to read them.
dfstates = spark.createDataFrame([[{"EVENT_ID":"123829:0","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":0},
{"EVENT_ID":"123829:1","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":1},
{"EVENT_ID":"123828:0","EVENT_TS":"2020-06-20T21:17:39.000+0000","RECORD_INDEX":0}],
['texas','24','01/04/2019'],
['colorado','13','01/07/2019'],
['maine','14','']]).toDF('LogData','State','Orders','OrdDate')
What I want to do is read the spark table into a dataframe, find the max event timestamp, find the rows with that timestamp then count and read just those rows into a new dataframe with the data columns and from the log data, add columns for event id (without the record index), event date and record index.
Downstream I'll be validating the data, converting from StringType to appropriate data type and filling in missing or invalid values as appropriate. All along I'll be asserting that row counts = original row counts.
The only thing I'm stuck on though is how to read this log data column and change it to something I can work with. Something in spark like pandas.series()?

You can split your single struct type column into multiple columns using dfstates.select('Logdata.*) Refer this answer : How to split a list to multiple columns in Pyspark?
Once you have seperate columns, then you can do standard pyspark operations like filtering

pyspark merge import column from one df to another based on matching data

I am extremely new to working with data frames.
I have two frames.
One is called new the other is called existing.
new has a single column called ID. existing has three columns: ID, color, size.
I want to operate on these frames such that when a row can be found in new with the same ID as a row in existing we add the value of the color column (but not the size column) to the new data frame. If no match is found I would like to assign a random value to the color column of new
It occurs to me that I can do this with rdd.map but I am trying to restrict myself to working with frames because I'm told it's more efficient.

What you are looking for is a join, a left join to be precise:
from pyspark.sql import functions as f
new_df = new_df.join(existing_df, "id", "left_outer") \
.select(new_df.id, f.coalesce(f.col("color"), f.rand())
The coalesce function will give you color if it is not null (i.e. if there is a match) or a random number. You probably need to map the random number to your color spectrum somehow (dependent on what representation you have there).
As a general note: Using dataframes and the spark-sql API is way faster than doing RDD operations

Grpahlab SFrames: Error in using SFrames with the dataset

In Graphlab,
I am working with small set of fitness data, to use recommender functions that could provide recommendations. The dataset has userid's column but not item id's, instead different items arranged in columns and their respective ratings in rows corresponding to each userid. In order to use any graphlab recommender method, I need to have userid's and item id's. Here is what I did:
v = graphlab.SFrame.read_csv('Data.csv')
userId = v["user_id"]
itemId = v["x","y","z","x1","y1","z1"] //x,y,z,x1,y1,z1 are activities that are actually the columns in Data and contains corresponding ratings given by user
sf= graphlab.SFrame({'UserId':userId,'ItemId':itemId})
print sf.head(5)
Basically, i extracted the user_id col from Data and tried making a column for ItemId using the x,y,z,etc columns extracted from the same data in order to make another sframe with just these 2 columns. This code results in a tabular format sframe with 2 column as expected, but not arranged in the same order I pass arguments in SFrame. So, the output gives ItemId as the first column and then UserId. Even though I tried to change the order of passing these 2 in sframe, it still gives the same output. Does anyone know the reason why ?
This is creating a problem further when using any recommender method as it gives the error: Column name user_id does not exist.

The reason for the column ordering is because you are passing a Python dictionary to the SFrame constructor. Dictionaries in Python will not keep keys in the order they are specified; they have their own order. If you prefer "UserId" to be first, you can call sf.swap_columns('UserId','ItemId').
The order of the columns does not affect the recommender method though. The Column name 'user_id' does not exist error will appear if you don't have a column named exactly user_id AND don't specify what the name of the user_id column is. In your case, you would want to do: graphlab.recommender.create(sf, user_id='UserId', item_id='ItemId').
Also, you may want to look at the stack method, which could help get your data in to the form the recommender method expects. Your current SFrame sf I think will have a column of dictionaries where the item id is the key and the rating is the value. I believe this would work in this case:
sf.stack('ItemId', new_column_name=['ItemId','Rating'])

multiple features in collaborative filtering- spark

I have a CSV file that looks like:
customer_ID, location, ....other info..., item-bought, score
I am trying to build a collaborative filtering recommender in Spark. Spark takes data of the form:
userID, itemID, value
but my data is longer, I want all user's info to be used instead of just userID. I tried grouping the columns in one column as:
(customerID,location,....),itemID,score
but the ALS.train gives me this error:
TypeError: int() argument must be a string or a number, not 'tuple'
How can I let spark take multiple key/values and not only three columns?
thanks

For each customer, identify the columns which you would like to use to distinguish these user-entities. Create a table (e.g. in SQL) in which each row contains the information for one user-entity, and use the row number in this table as the userID.
Do the same for your items if necessary, and provide these IDs to your classifier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems joining two Pandas Dataframes - python

Related

Converting for loop to numpy calculation for pandas dataframes

Break a dictionary out of a StringType column in a spark dataframe

pyspark merge import column from one df to another based on matching data

Grpahlab SFrames: Error in using SFrames with the dataset

multiple features in collaborative filtering- spark

Categories

Resources