I want to find all rows where a certain value is present inside the column's list value.
So imagine I have a dataframe set up like this:
| placeID | users |
------------------------------------------------
| 134986| [U1030, U1017, U1123, U1044...] |
| 133986| [U1034, U1011, U1133, U1044...] |
| 134886| [U1031, U1015, U1133, U1044...] |
| 134976| [U1130, U1016, U1133, U1044...] |
How can I get all rows where 'U1030' exists in the users column?
Or... is the real problem that I should not have my data arranged like this, and I should instead explode that column to have a row for each user?
What's the right way to approach this?
The way you have stored data looks fine to me. You do not need to change the format of storing data.
Try this :
df1 = df[df['users'].str.contains("U1030")]
print(df1)
This will give you all the rows containing specified user in df format.
When you are wanting to check whether a value exists inside the column when the value in the column is a list, it's helpful to use the map function.
Implementing it like below, with a lambda inline function, the list of values stored in the 'users' column is mapped to the value u, and userID is compared to it...
Really the answer is pretty straightforward when you look at the code below:
# user_filter filters the dataframe to all the rows where
# 'userID' is NOT in the 'users' column (the value of which
# is a list type)
user_filter = df['users'].map(lambda u: userID not in u)
# cuisine_filter filters the dataframe to only the rows
# where 'cuisine' exists in the 'cuisines' column (the value
# of which is a list type)
cuisine_filter = df['cuisines'].map(lambda c: cuisine in c)
# Display the result, filtering by the weight assigned
df[user_filter & cuisine_filter]
Related
I have the following scenario in a sales dataframe, each row being a distinct sale:
Category Product Purchase Value | new_column_A new_column_B
A C 30 |
B B 50 |
C A 100 |
I'm trying to find in the qcut documentation but can't find it anywhere, how to add a series of columns based on the following logic:
df['new_column_A'] = when category = A and product = A then df['new_column_A'] = pd.qcut(df['Purchase_Value'], q=4)
df['new_column_B' = when category = A and product = B then
df['new_column_B'] = pd.qcut(df['Purchase_Value'], q=4)
Preferably i would like for this new column of percentile cut to be created in the same original dataframe.
The first thing that comes to my mind is to split the dataframe into separate ones by doing the filtering I need, but i would like to keep all these columns in the original dataframe.
Does anyone knows if this is possible and how I can do it?
I have a data frame like
I have a dictionary with the ec2 instance details
Now, I want to add a new column 'Instance Name' and populate it based on a condition that the instance ID in the dictionary is in the column 'ResourceId' and further, depending on what is there in the Name field in dictionary for that instance Id, I want to populate the new column value for each matching entry
Finally I want to create separate data frames for my specific use-cases e.g. to get only Box-Usage results. Something like this
box_usage = df[df['lineItem/UsageType'].str.contains('BoxUsage')]
print(box_usage.groupby('Instance Name')['lineItem/BlendedCost'].sum())
The new column value is not coming up against the respective Resource Id as I desire. It is rather coming up sequentially.
I have tried bunch of things including what I mentioned in above code, but no result yet. Any help?
After struggling through several options, I used the .apply() way and it did the trick
df.insert(loc=17, column='Instance_Name', value='Other')
instance_id = []
def update_col(x):
for key, val in ec2info.items():
if x == key:
if ('MyAgg' in val['Name']) | ('MyAgg-AutoScalingGroup' in val['Name']):
return 'SharkAggregator'
if ('MyColl AS Group' in val['Name']) | ('MyCollector-AutoScalingGroup' in val['Name']):
return 'SharkCollector'
if ('MyMetric AS Group' in val['Name']) | ('MyMetric-AutoScalingGroup' in val['Name']):
return 'Metric'
df['Instance_Name'] = df.ResourceId.apply(update_col)
df.Instance_Name.fillna(value='Other', inplace=True)
The following pyspark command
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
created the following result.
|URL_short |NumOfReqs|
+-----------------------------------------------------------------------------------------+---------+
|http1 | 500 |
|http4 | 500 |
|http2 | 500 |
|http3 | 500 |
In the original DataFrame dataFrame I have a column named success whose type is text. The value can be "true" or "false".
In the result I would like to have an additional column named for example NumOfSuccess which counts the elements having entry "true" in the original column success per category URL_short.
How can I modify
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
to output also the column satisfying the condition success=="trueperURL_short` category?
One way to do it is to add another aggregation expression (also turn the count into an agg expression):
import pyspark.sql.functions as f
dataFrame.groupBy("URL_short").agg(
f.count('*').alias('NumOfReqs'),
f.sum(f.when(f.col('success'), 1).otherwise(0)).alias('CountOfSuccess')
).show()
Note this assumes your success column is boolean type, if it's string, change the expression to f.sum(f.when(f.col('success') == 'true', 1).otherwise(0)).alias('CountOfSuccess')
I have dataframe rounds (which was the result of deleting a column from another dataframe) with the following structure (can't post pics, sorry):
----------------------------
|type|N|D|NATC|K|iters|time|
----------------------------
rows of data
----------------------------
I use groupby so I can then get the mean of the groups, like so:
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean()
I get the means that I wanted but I get a problem with the keys. The results_mean dataframe has the following structure:
----------------------------
| | | | | | |time|
|type|N|D|NATC|K|iters| |
----------------------------
rows of data
----------------------------
The only key recognized is time (I executed results_mean.keys()).
What did I do wrong? How can I fix it?
In your aggregated data, time is the only column. The other ones are indices.
groupby has a parameter as_index. From the documentation:
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
So you can get the desired output by calling
rounds = results.groupby(['type','N','D','NATC','K','iters'], as_index = False)
results_mean = rounds.mean()
Or, if you want, you can always convert indices to keys by using reset_index. Using
rounds = results.groupby(['type','N','D','NATC','K','iters'])
results_mean = rounds.mean().reset_index()
should have the desired effect as well.
I've got the same problem of losing the dataframes's keys due to the use of the group_by() function and the answer I found for that problem was to convert the Dataframe into a CSV file then read this file.
I have an Sqlite database that I need to query from python.
The database only has two columns, "key" and "Value", and the value column contains a dictionary with multiple values. What I want to do is create a query to use some of those known dictionary keys as column headers, and the corresponding data under that column.
Is it possible to do that all in a query, or will I have to process the dictionary in python afterwards?
Example data (values obviously have been changed) that I want to query.
key | Value
/auth/user_data/fb_me_user | {"uid":"100008112345597","first_name":"Tim","last_name":"Robins","name":"Tim Robins","emails":["t.robins#gmail.com"]"}
There are lots of other key / value combinations, but this is one of the ones I am interested in.
I would like to query this to produce the following;
UID | Name | Email
100008112345597 | Tim Robins | t.robins#gmail.com
Is that possible just in a query?
Thanks
after querying you get the value like below. from them you can get
value='''{"uid":"100008112345597","first_name":"Tim","last_name":"Robins","name":"Tim Robins","emails":["t.robins#gmail.com"]}'''
import ast
details=ast.literal_eval(value)
print details['uid'],details['name'],','.join(details['emails'])