I'm trying to obtain the "DESIRED OUTCOME" shown in my image below. I have a somewhat messy way to do it that i came up with but I was hoping there is a more efficient way this could be done using Pandas? Please advise and thank you in advance!
The problem is pretty standard, and so is its solution: group by the first column and join the data in the second column. Note that the function join is not called but passed to apply as a parameter.
df.groupby('Name')['Food'].apply(';'.join)
#Name
#Gary Oranges;Pizza
#John Tacos
#Matt Chicken;Steak
You can group by Name column then aggregate with ';'.join function
df.groupby('Name').agg({'Food': ';'.join})
Related
I have a column with a name and number, and I would like to extract the information and create 3 separate columns with the info in pandas using python. I'd like to also drop the original column. What is the most efficient code to do it? Is it feasible with a single line?
The number has brackets [] around it, which I also want to drop.
Thanks so much!
I'm a noob, don't have much experience with stripping/splicing and lambda functions within pandas.
We can use str.extract here with 3 capture groups for each component:
df[["last", "first", "number"]] = df["last_first_number"].str.extract(r'(\w+), (\w+) \[(\d+)\]')
I have a Spark DataFrame (sdf) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count(). However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count() with .approx_count_distinct(), which was syntactically incorrect.
I searched "how to use .approx_count_distinct() with groupBy()" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))) doesn't seem to give me the counts that I want. The method .approx_count_distinct() can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count")), either.
My question is, is there a way to get .approx_count_distinct() to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?
Thank you so much for your help in advance!
Group with expressions and alias as needed. Lets try:
df.groupBy("ip", "url").agg(expr("approx_count_distinct(ip)").alias('ip_count'),expr("approx_count_distinct(url)").alias('url_count')).show()
Your code sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count")) will give a value of 1 to every group since you are counting the value of one of the grouping column; url.
If you want to count distinct of IP-URL pairs using approx_count_distinct function, you can compound them in an array then apply the function. It would be something like this
sdf.selectExpr("approx_count_distinct(array(ip, url)) as distinct_count")
I am completely new to Python (I started last week!), so while I looked at similar questions, I have difficulty understanding what's going on and even more difficulty adapting them to my situation.
I have a csv file where rows are dates and columns are different regions (see image 1). I would like to create a file that has 3 columns: Date, Region, and Indicator where for each date and region name the third column would have the correct indicator (see image 2).
I tried turning wide into long data, but I could not quite get it to work, as I said, I am completely new to Python. My second approach was to split it up by columns and then merge it again. I'd be grateful for any suggestions.
This gives your solution using stack() in pandas:
import pandas as pd
# In your case, use pd.read_csv instead of this:
frame = pd.DataFrame({
'Date': ['3/24/2020', '3/25/2020', '3/26/2020', '3/27/2020'],
'Algoma': [None,0,0,0],
'Brant': [None,1,0,0],
'Chatham': [None,0,0,0],
})
solution = frame.set_index('Date').stack().reset_index(name='Indicator').rename(columns={'level_1':'Region'})
solution.to_csv('solution.csv')
This is the inverse of doing a pivot, as explained here: Doing the opposite of pivot in pandas Python. As you can see there, you could also consider using the melt function as an alternative.
first, you're region column is currently 'one hot encoded'. What you are trying to do is to "reverse" one hot encode your region column. Maybe check if this link answers your question:
Reversing 'one-hot' encoding in Pandas.
I have a dataframe df with a column var, and another dataframe df2 with columns var and var2. Both columns var in 2 dataframes are exactly the same.
In my example, df['var'].map(df2) and df.var.map(df2) yield the same result. I would like to ask if this is just a coincidence in my particular dataset, or it always holds.
Thank you so much!
Update: In my example, these below codes also produce the same result.
df.groupby('parent_id')['parent_id'].transform('count').tolist()
and
df.groupby('parent_id').parent_id.transform('count').tolist()
This gives me a feeling that df.groupby('parent_id')['parent_id'] and df.groupby('parent_id').parent_id produce the same result.
Yes (as long as the column exists in your data). It's syntactic sugar called attribute access. See the pandas documentation here.
I started learning python recently and I need your help. I have a dataframe with the following structure
I need to make a transformation, all values in column 2 (product id) that have the same order_id (column 1) must become a row and the values must be separated by commas.
Like this:
How can I make this transformation? Can somebody help me?
Thanks !
You can get your desired result using the following code
df.groupby(['order_id'])['product_id'].apply(','.join).reset_index()
You can refer to the following answer for more applications: https://stackoverflow.com/a/27298308/6908282
You might want to refer to this question which is basically what you are asking for.