I have a MySQL query that is doing a groupby and returning data in the following form:
ID | Boolean | Count
Sometimes there isn't data in the table for one of the boolean states, so data for a single ID might be returned like this:
1234 | 0 | 10
However I need it in this form for downstream analysis:
1234 | 0 | 10
1234 | 1 | 0
with an index on [ID, Boolean].
From querying Google and SO, it seems like getting MySQL to do this transform is a bit of a pain. Is there a simple way to do this in Pandas? I haven't been able to find anything useful in the docs or the Pandas cookbook.
You can assume that I've already loaded the data into a Pandas dataframe with no indexes.
Thanks.
I would set the index of your dataframe to the ID and Boolean columns, and the construct an new index from the Cartesian product of the unique values.
That would look like this:
import pandas
indexcols = ['ID', 'Boolean']
data = pandas.read_sql_query(engine, querytext)
full_index = pandas.MultiIndex.from_product(
[data['ID'].unique(), [0, 1]],
names=indexcols
)
data = (
data.set_index(indexcols)
.reindex(full_index)
.fillna(0)
.reset_index()
)
Related
I have two df one called 'order' and another called 'asian_food'. Two table have a common column 'product_id'. I want to know how many time each of the product in the 'asian_food' table was ordered in the 'order' table.
'order' table:
'asian_food' table:
I've tried the following code:
asian['frequency'] = asian['product_id'].map(order_copy['product_id'].value_counts()).fillna(0).astype(int)
but it returns a error saying:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead
How can I use .loc to get what I want? Thank you in advance.
Could you do something like this?
def get_order_count_totals(order_df, asian_food_df):
"""
Function returns a dataframe with the following columns:
| product_id | product_name | total_orders |
|------------:|:--------------:|:--------------:|
| 14 | Asian Food | 1 |
"""
df = order_df.merge(asian_food_df, on="product_id")
df = df.groupby(["product_id", "product_name"])["order_id"].count().reset_index()
df.rename(columns={"order_id": "total_orders"}, inplace=True)
return df
This would do the job,
asian_indices = [orders_df[orders_df["product_id"] == product_id].index[0] for product_id in asian_orders_df["product_id"]]
new_df = orders_df.loc[asian_indices, :]
I have the following scenario in a sales dataframe, each row being a distinct sale:
Category Product Purchase Value | new_column_A new_column_B
A C 30 |
B B 50 |
C A 100 |
I'm trying to find in the qcut documentation but can't find it anywhere, how to add a series of columns based on the following logic:
df['new_column_A'] = when category = A and product = A then df['new_column_A'] = pd.qcut(df['Purchase_Value'], q=4)
df['new_column_B' = when category = A and product = B then
df['new_column_B'] = pd.qcut(df['Purchase_Value'], q=4)
Preferably i would like for this new column of percentile cut to be created in the same original dataframe.
The first thing that comes to my mind is to split the dataframe into separate ones by doing the filtering I need, but i would like to keep all these columns in the original dataframe.
Does anyone knows if this is possible and how I can do it?
I created a function to test transformations on a DataFrame. This returns only the transformed columns.
def test_concat(df: sd.DataFrame, col_names: list) -> sd.DataFrame:
return df.select(*[F.concat(df[column].cast(StringType()), F.lit(" new!")).alias(column) for column in col_names])
How can I replace the existing columns with the transformed once in the original DF and return the whole DF?
Example DF:
test_df = self.spark.createDataFrame([(1, 'metric1', 10), (2, 'metric2', 20), (3, 'metric3', 30)], ['id', 'metric', 'score'])
cols = ["metric"]
new_df = perform_concat(test_df, cols)
new_df.show()
Expected result:
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
It looks like I can drop the original columns from the DF and then somehow append the transformed. But not sure that it's the right way to achieve this.
I can see you have only adding a keyword in metric column , the same can be achieved using inbuilt spark function as below
The withColumn has two functionality
If the column is not present it will create a new clumn
If the column is there, it will perform the operation on the same column
Logic to Concat
from pyspark.sql import functions as F
df = df.withColumn('metric', F.concat(F.col('metric'), F.lit(' '), F.lit('new!')))
df = df.select('metric', 'score')
df.show()
Output---------
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
If you want to do it for many columns you would make a foldLeft call.
#dsk has the right approach.
You probably want to avoid joins in this case since there is no need to decouple operation you are describing from original dataframe (this is based on the examples you provided, if you have different needs in real case then maybe different example is needed).
columnsToTransform.foldLeft(df)(
(acc, next) => acc.withColumn(next, concat(col(next), lit("new !")))
)
Edit: Just realised what I am proposing only works for scala and that your snippet is in python.
For python similar will still work just instead of fold you will do a for:
df = yourOriginalDf
for(next in columnsToTransform):
df = df.withColumn(next, concat(col(next), lit("new !")))
Create a new dataframe with updated column values and a monotonically increasing id
new_df = test_concat(test_df, cols).withColumn("index", F.monotonically_increasing_id())
Drop the list of columns from first dataframe and a monotonically increasing id
test_df_upt = test_df.drop(*cols).withColumn("index", F.monotonically_increasing_id())
Join the above 2 dataframes and drop the index colum
test_df_upt.join(new_df, "index").drop("index").show()
I want to find all rows where a certain value is present inside the column's list value.
So imagine I have a dataframe set up like this:
| placeID | users |
------------------------------------------------
| 134986| [U1030, U1017, U1123, U1044...] |
| 133986| [U1034, U1011, U1133, U1044...] |
| 134886| [U1031, U1015, U1133, U1044...] |
| 134976| [U1130, U1016, U1133, U1044...] |
How can I get all rows where 'U1030' exists in the users column?
Or... is the real problem that I should not have my data arranged like this, and I should instead explode that column to have a row for each user?
What's the right way to approach this?
The way you have stored data looks fine to me. You do not need to change the format of storing data.
Try this :
df1 = df[df['users'].str.contains("U1030")]
print(df1)
This will give you all the rows containing specified user in df format.
When you are wanting to check whether a value exists inside the column when the value in the column is a list, it's helpful to use the map function.
Implementing it like below, with a lambda inline function, the list of values stored in the 'users' column is mapped to the value u, and userID is compared to it...
Really the answer is pretty straightforward when you look at the code below:
# user_filter filters the dataframe to all the rows where
# 'userID' is NOT in the 'users' column (the value of which
# is a list type)
user_filter = df['users'].map(lambda u: userID not in u)
# cuisine_filter filters the dataframe to only the rows
# where 'cuisine' exists in the 'cuisines' column (the value
# of which is a list type)
cuisine_filter = df['cuisines'].map(lambda c: cuisine in c)
# Display the result, filtering by the weight assigned
df[user_filter & cuisine_filter]
I'm a little confused about the workings of the Pandas dataframe.
I have a pd.dataframe that looks like this:
index | val1 | val2
-----------------------------------
20-11-2017 22:33:20 | 0.33 | 05.43
23-11-2017 23:34:14 | 4.23 | 09.43
I'd like to append a row to it, and be able to specify the index, which in my case is a date and time.
I have tried the following methods:
dataframe = pd.DataFrame(columns=['val1', 'val2'])
dataframe.loc[someDate] = [someVal, someVal]
This seems to overwrite if the index already exists, but I want to be able to have duplicate indices.
dataframe = pd.DataFrame(columns=['val1', 'val2'])
record = pd.Series(
index=[someDate],
data=[someVal, someVal]
)
dataframe.append(record)
This causes the application to hang without returning an exception or error.
Am I missing something? Is this the correct way of doing the thing I want to achieve?