How to drop rows in pandas dataframe, when there is similar values?

How to drop rows in pandas dataframe, when there is similar values? - python

I have a python pandas dataframe of stock data, and I'm trying to filter some of those tickers.
There are companies that have 2 or more tickers (different types of shares when a share is preferred and the other not).
I want to drop the lines of those additional share values, and let just the share with the higher volume. In the dataframe I also have the company name, so maybe there is a way of using it to make some condition and then drop it when comparing the volume of the same company? How can I do this?

Use groupby and idxmax:
Suppose this dataframe:
>>> df
ticker volume
0 CEBR3 123
1 CEBR5 456
2 CEBR6 789 # <- keep for group CEBR
3 GOAU3 23 # <- keep for group GOAU
4 GOAU4 12
5 CMIN3 135 # <- keep for group CMIN3
>>> df.loc[df.groupby(df['ticker'].str.extract(r'^(.*)\d', expand=False),
sort=False)['volume'].idxmax().tolist()]
ticker volume
2 CEBR6 789
3 GOAU3 23
5 CMIN3 135

Related

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!

You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

Plot multiple rows of dataframe in pandas for specific columns

df
SKU Comp Brand Jan_Sales Feb_Sales Mar_sales Apr_sales Dec_sales..
A AC BA 122 100 50 200 300
B BC BB 100 50 80 90 250
C CC BC 40 30 100 10 11
and so on
Now I want a graph which will plot Jan sales, feb sales and so on till dec in one line for SKU A, Similarly one line on the same graph for SKU B and same way for SKU C.
I read few answers which say that I need to transpose my data. Something like below
df.T. plot()
However my first column is SKU, and I want to plot based on that. Rest of the columns are numeric. So I want that on each line SKU Name should be mentioned. And plotting should be row wise
EDIT(added after receiving some answers as I am facing this issue in few other datasets):
lets say I dont want columns Company, brand etc, then what to do

Use DataFrame.set_index for convert SKU to index and then tranpose:
df.set_index('SKU').T.plot()

Use set_index then transpose:
df.set_index("SKU").T.plot()
Output:

How to link two dataframes based on the string similarity of one column

I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:
Dataframes:
DF-1 DF-2
--------------------- ---------------------
ID Name ID Name
1 56 aaeessa 1 12 H.P paRt 1
2 98 1o7v9sM 2 76 aa3esza
3 175 HP. part 1 3 762 stakoverfl
4 2 stackover 4 2 lo7v9Sm
I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:
Result:
Result
-----------------
ID1 ID2
1 56 76
2 98 2
3 175 12
4 2 762
This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.
EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).

Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/
You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.
first pip install pandas-dedupe and try this:
import pandas as pd
import pandas_dedupe
df1=pd.DataFrame({'ID':[56,98,175],
'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})
df2=pd.DataFrame({'ID':[12,76,762,2],
'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})
#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])
# reset index
df_final = df_final.reset_index(drop=True)
# print result
print(df_final)
ID Name cluster id confidence
0 98 1o7v9sm 0.0 1.000000
1 2 lo7v9sm 0.0 1.000000
2 175 hp. part 1 1.0 0.999999
3 12 h.p part 1 1.0 0.999999
4 56 aaeessa 2.0 0.999967
5 76 aa3esza 2.0 0.999967
6 762 stakoverfl NaN NaN
you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.

I suggest you a library called Python Record Linkage Toolkit.
Once you import the library, you must index the sources you intend to compare, something like this:
indexer = recordlinkage.Index()
#using url as intersection
indexer.block('id')
candidate_links = indexer.index(df_1, df_2)
c = recordlinkage.Compare()
Let's say you want to compare based on the similiraties of strings, but they don't match exactly:
c.string('name', 'name', method='jarowinkler', threshold=0.85)
And if you want an exact match you should use:
c.exact('name')

Using my fuzzy_wuzzy function from the linked answer:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
mrg = fuzzy_merge(df1, df2, 'Name', 'Name', threshold=70)\
.merge(df2, left_on='matches', right_on='Name', suffixes=['1', '2'])\
.filter(like='ID')
Output
ID1 ID2
0 56 76
1 98 2
2 175 12
3 2 762

Create multiple dataframes from one

I have a dataframe like this:
name time session1 session2 session3
Alex 135 10 3 5
Lee 136 2 6 4
I want to make multiple dataframes based on each session. for example, i want to make dataframe one that has name, time, and session1. and dataframe 2 has name, time, and session2. and dataframe 3 has name, time, and session3. I want to use for loop or any other way is better but don't know how to choose column 1,2,3 at one time but column 1,2, 4 and etc. Any one has idea about that. The data is saved in pandas dataframe. I just don't know how to type it in Stackoverflow here. Thank you

I don't think you need to create a new dictionary for that.
Just directly slice your data frame whenever needed.
df[['name', 'time', 'session 1']]
If you think the following design can help you, you can set the name and time to be indexes (df.set_index(['name', 'time'])) and just simply
df['session 1']

Organize it into a dictionary of dataframes:
dict_of_dfs = {f'df {i}':df[['name','time', i]] for i in df.columns[2:]}
Then you can access each dataframe as you would any other dictionary values:
>>> dict_of_dfs['df session1']
name time session1
0 Alex 135 10
1 Lee 136 2
>>> dict_of_dfs['df session2']
name time session2
0 Alex 135 3
1 Lee 136 6

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.

use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to drop rows in pandas dataframe, when there is similar values? - python

Related

Iterate over certain columns with unique values and generate plots python

Plot multiple rows of dataframe in pandas for specific columns

How to link two dataframes based on the string similarity of one column

Create multiple dataframes from one

Dividing rows for specific columns by date+n in Pandas

Categories

Resources