How to count variable in one dataframe base on another dataframe? - python

I have two df one called 'order' and another called 'asian_food'. Two table have a common column 'product_id'. I want to know how many time each of the product in the 'asian_food' table was ordered in the 'order' table.
'order' table:
'asian_food' table:
I've tried the following code:
asian['frequency'] = asian['product_id'].map(order_copy['product_id'].value_counts()).fillna(0).astype(int)
but it returns a error saying:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead
How can I use .loc to get what I want? Thank you in advance.

Could you do something like this?
def get_order_count_totals(order_df, asian_food_df):
"""
Function returns a dataframe with the following columns:
| product_id | product_name | total_orders |
|------------:|:--------------:|:--------------:|
| 14 | Asian Food | 1 |
"""
df = order_df.merge(asian_food_df, on="product_id")
df = df.groupby(["product_id", "product_name"])["order_id"].count().reset_index()
df.rename(columns={"order_id": "total_orders"}, inplace=True)
return df

This would do the job,
asian_indices = [orders_df[orders_df["product_id"] == product_id].index[0] for product_id in asian_orders_df["product_id"]]
new_df = orders_df.loc[asian_indices, :]

Related

How to use qcut in a dataframe with conditions value from columns

I have the following scenario in a sales dataframe, each row being a distinct sale:
Category Product Purchase Value | new_column_A new_column_B
A C 30 |
B B 50 |
C A 100 |
I'm trying to find in the qcut documentation but can't find it anywhere, how to add a series of columns based on the following logic:
df['new_column_A'] = when category = A and product = A then df['new_column_A'] = pd.qcut(df['Purchase_Value'], q=4)
df['new_column_B' = when category = A and product = B then
df['new_column_B'] = pd.qcut(df['Purchase_Value'], q=4)
Preferably i would like for this new column of percentile cut to be created in the same original dataframe.
The first thing that comes to my mind is to split the dataframe into separate ones by doing the filtering I need, but i would like to keep all these columns in the original dataframe.
Does anyone knows if this is possible and how I can do it?

Pyspark - how to merge transformed columns with an original DataFrame?

I created a function to test transformations on a DataFrame. This returns only the transformed columns.
def test_concat(df: sd.DataFrame, col_names: list) -> sd.DataFrame:
return df.select(*[F.concat(df[column].cast(StringType()), F.lit(" new!")).alias(column) for column in col_names])
How can I replace the existing columns with the transformed once in the original DF and return the whole DF?
Example DF:
test_df = self.spark.createDataFrame([(1, 'metric1', 10), (2, 'metric2', 20), (3, 'metric3', 30)], ['id', 'metric', 'score'])
cols = ["metric"]
new_df = perform_concat(test_df, cols)
new_df.show()
Expected result:
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
It looks like I can drop the original columns from the DF and then somehow append the transformed. But not sure that it's the right way to achieve this.
I can see you have only adding a keyword in metric column , the same can be achieved using inbuilt spark function as below
The withColumn has two functionality
If the column is not present it will create a new clumn
If the column is there, it will perform the operation on the same column
Logic to Concat
from pyspark.sql import functions as F
df = df.withColumn('metric', F.concat(F.col('metric'), F.lit(' '), F.lit('new!')))
df = df.select('metric', 'score')
df.show()
Output---------
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
If you want to do it for many columns you would make a foldLeft call.
#dsk has the right approach.
You probably want to avoid joins in this case since there is no need to decouple operation you are describing from original dataframe (this is based on the examples you provided, if you have different needs in real case then maybe different example is needed).
columnsToTransform.foldLeft(df)(
(acc, next) => acc.withColumn(next, concat(col(next), lit("new !")))
)
Edit: Just realised what I am proposing only works for scala and that your snippet is in python.
For python similar will still work just instead of fold you will do a for:
df = yourOriginalDf
for(next in columnsToTransform):
df = df.withColumn(next, concat(col(next), lit("new !")))
Create a new dataframe with updated column values and a monotonically increasing id
new_df = test_concat(test_df, cols).withColumn("index", F.monotonically_increasing_id())
Drop the list of columns from first dataframe and a monotonically increasing id
test_df_upt = test_df.drop(*cols).withColumn("index", F.monotonically_increasing_id())
Join the above 2 dataframes and drop the index colum
test_df_upt.join(new_df, "index").drop("index").show()

Count elements satisfying an extra condition on another column when group-bying in pyspark

The following pyspark command
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
created the following result.
|URL_short |NumOfReqs|
+-----------------------------------------------------------------------------------------+---------+
|http1 | 500 |
|http4 | 500 |
|http2 | 500 |
|http3 | 500 |
In the original DataFrame dataFrame I have a column named success whose type is text. The value can be "true" or "false".
In the result I would like to have an additional column named for example NumOfSuccess which counts the elements having entry "true" in the original column success per category URL_short.
How can I modify
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
to output also the column satisfying the condition success=="trueperURL_short` category?
One way to do it is to add another aggregation expression (also turn the count into an agg expression):
import pyspark.sql.functions as f
dataFrame.groupBy("URL_short").agg(
f.count('*').alias('NumOfReqs'),
f.sum(f.when(f.col('success'), 1).otherwise(0)).alias('CountOfSuccess')
).show()
Note this assumes your success column is boolean type, if it's string, change the expression to f.sum(f.when(f.col('success') == 'true', 1).otherwise(0)).alias('CountOfSuccess')

Appending to a Pandas dataframe and specifying the row index

I'm a little confused about the workings of the Pandas dataframe.
I have a pd.dataframe that looks like this:
index | val1 | val2
-----------------------------------
20-11-2017 22:33:20 | 0.33 | 05.43
23-11-2017 23:34:14 | 4.23 | 09.43
I'd like to append a row to it, and be able to specify the index, which in my case is a date and time.
I have tried the following methods:
dataframe = pd.DataFrame(columns=['val1', 'val2'])
dataframe.loc[someDate] = [someVal, someVal]
This seems to overwrite if the index already exists, but I want to be able to have duplicate indices.
dataframe = pd.DataFrame(columns=['val1', 'val2'])
record = pd.Series(
index=[someDate],
data=[someVal, someVal]
)
dataframe.append(record)
This causes the application to hang without returning an exception or error.
Am I missing something? Is this the correct way of doing the thing I want to achieve?

Fill in missing boolean rows in Pandas

I have a MySQL query that is doing a groupby and returning data in the following form:
ID | Boolean | Count
Sometimes there isn't data in the table for one of the boolean states, so data for a single ID might be returned like this:
1234 | 0 | 10
However I need it in this form for downstream analysis:
1234 | 0 | 10
1234 | 1 | 0
with an index on [ID, Boolean].
From querying Google and SO, it seems like getting MySQL to do this transform is a bit of a pain. Is there a simple way to do this in Pandas? I haven't been able to find anything useful in the docs or the Pandas cookbook.
You can assume that I've already loaded the data into a Pandas dataframe with no indexes.
Thanks.
I would set the index of your dataframe to the ID and Boolean columns, and the construct an new index from the Cartesian product of the unique values.
That would look like this:
import pandas
indexcols = ['ID', 'Boolean']
data = pandas.read_sql_query(engine, querytext)
full_index = pandas.MultiIndex.from_product(
[data['ID'].unique(), [0, 1]],
names=indexcols
)
data = (
data.set_index(indexcols)
.reindex(full_index)
.fillna(0)
.reset_index()
)

Categories