Merge rows with the same values Pandas

Merge rows with the same values Pandas - python

I have a pandas dataframe as follows:
You will note here that there are many rows with the same code_module,code_presentation,id_student combination
What I want to do is merge all of these duplicate rows, and in so sum the sum_clicks with each group
An example of this is for the top rows they would be merged into one row looking as follows:
code_module code_presentation id_student sum_click
0 AAA 2013J 28400 18
In SQL terms, the private key should be a code_module,code_presentation,id_student combination
In my progress on this, I tried to use groupby in the following way:
groupby(['id_student','code_presentation','code_module']).aggregate({'sum_click': 'sum',})
But this didn't work as it gave student ids that aren't even in my dataset, which I don't understand why
Also, groupby doesn't seem to be quite what I'm looking for as it has a datastructure different to a standard pandas dataframe, which is what I would be looking for.
The problem can be seen in the following output
sum_click
id_student code_presentation code_module
6516 2014J AAA 2791
8462 2013J DDD 646
2014J DDD 10
11391 2013J AAA 934
Row 1 and 2 (indexing from 0) should be distinct rows, instead of the group as they are

Try this -
df.groupby(['code_module', 'code_presentation', 'id_student']).agg(sum_clicks=('sum_click', 'sum')).reset_index()

Related

What is the best way to aggregate 100 columns in pandas?

I have a data frame with 101 columns currently. The first column is called "Country/Region" and the other 100 are dates in MM/DD/YY format from 1/22/20 to 4/30/20 like the example below. I would like to combine repeat country entries such as 'Australia' below and have its values in the date columns to be added together so that there is one row per country. I would like to keep ALL date columns.I have tried to use the groupby() and agg() functions but I do not know how to sum() together that many columns without calling every single one. Is there a way to do this without calling all 100 columns individually?
Country/Region | 1/22/20 | 1/23/20 | ... | 4/29/20 | 4/30/20
Afghanistan 0 0 ... 1092 1176
Australia 0 0 10526 12065
Australia 0 0 ... 56289 4523

This should work:
df.pivot_table(index='Country/Region', aggfunc='sum')

Did you already try this? It should also give the expected result.
df.groupby('Country/Region').sum()

You can do this:
df.iloc[:,1:].sum(axis=1)

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!

You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

How to drop rows in pandas dataframe, when there is similar values?

I have a python pandas dataframe of stock data, and I'm trying to filter some of those tickers.
There are companies that have 2 or more tickers (different types of shares when a share is preferred and the other not).
I want to drop the lines of those additional share values, and let just the share with the higher volume. In the dataframe I also have the company name, so maybe there is a way of using it to make some condition and then drop it when comparing the volume of the same company? How can I do this?

Use groupby and idxmax:
Suppose this dataframe:
>>> df
ticker volume
0 CEBR3 123
1 CEBR5 456
2 CEBR6 789 # <- keep for group CEBR
3 GOAU3 23 # <- keep for group GOAU
4 GOAU4 12
5 CMIN3 135 # <- keep for group CMIN3
>>> df.loc[df.groupby(df['ticker'].str.extract(r'^(.*)\d', expand=False),
sort=False)['volume'].idxmax().tolist()]
ticker volume
2 CEBR6 789
3 GOAU3 23
5 CMIN3 135

How to link two dataframes based on the string similarity of one column

I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:
Dataframes:
DF-1 DF-2
--------------------- ---------------------
ID Name ID Name
1 56 aaeessa 1 12 H.P paRt 1
2 98 1o7v9sM 2 76 aa3esza
3 175 HP. part 1 3 762 stakoverfl
4 2 stackover 4 2 lo7v9Sm
I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:
Result:
Result
-----------------
ID1 ID2
1 56 76
2 98 2
3 175 12
4 2 762
This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.
EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).

Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/
You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.
first pip install pandas-dedupe and try this:
import pandas as pd
import pandas_dedupe
df1=pd.DataFrame({'ID':[56,98,175],
'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})
df2=pd.DataFrame({'ID':[12,76,762,2],
'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})
#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])
# reset index
df_final = df_final.reset_index(drop=True)
# print result
print(df_final)
ID Name cluster id confidence
0 98 1o7v9sm 0.0 1.000000
1 2 lo7v9sm 0.0 1.000000
2 175 hp. part 1 1.0 0.999999
3 12 h.p part 1 1.0 0.999999
4 56 aaeessa 2.0 0.999967
5 76 aa3esza 2.0 0.999967
6 762 stakoverfl NaN NaN
you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.

I suggest you a library called Python Record Linkage Toolkit.
Once you import the library, you must index the sources you intend to compare, something like this:
indexer = recordlinkage.Index()
#using url as intersection
indexer.block('id')
candidate_links = indexer.index(df_1, df_2)
c = recordlinkage.Compare()
Let's say you want to compare based on the similiraties of strings, but they don't match exactly:
c.string('name', 'name', method='jarowinkler', threshold=0.85)
And if you want an exact match you should use:
c.exact('name')

Using my fuzzy_wuzzy function from the linked answer:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
mrg = fuzzy_merge(df1, df2, 'Name', 'Name', threshold=70)\
.merge(df2, left_on='matches', right_on='Name', suffixes=['1', '2'])\
.filter(like='ID')
Output
ID1 ID2
0 56 76
1 98 2
2 175 12
3 2 762

Combining and rearranging two dataframes in pandas

I'm having two dataframes with each of them looking like
date country value
20100101 country1 1
20100102 country1 2
20100103 country1 3
date country value
20100101 country2 4
20100102 country2 5
20100103 country2 6
I want to merge them into one dataframe looking like
date country1 country2
20100101 1 4
20100102 2 5
20100103 3 6
Is there any clever way to do this in pandas?

This looks like pivot table, which in Pandas is called unstack for some bizarre reason.
Example analogous to the one used in Wes McKinley's "python for data analysis" book:
bytz = df.groupby(['tz', opersystem])
counts = bytz.size().unstack().fillna(0)
(groupby operating system in rows which is then pivoted so that operating system becomes columns, just like your "country*" values).
P.S. for catting dataframes you can use pandas.concat. It's also often good to do .reset_index on resulting dataframe, bc in some (many?) cases duplicate values in index can make pandas go haywire, throwing strange exceptions on .apply used on dataframe and the like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge rows with the same values Pandas - python

Try this - df.groupby(['code_module', 'code_presentation', 'id_student']).agg(sum_clicks=('sum_click', 'sum')).reset_index()

Related

What is the best way to aggregate 100 columns in pandas?

Iterate over certain columns with unique values and generate plots python

How to drop rows in pandas dataframe, when there is similar values?

How to link two dataframes based on the string similarity of one column

Combining and rearranging two dataframes in pandas

Categories

Resources