I have a dataframe derived from a massive list of market tickers from a crypto exchange.
The list includes ALL combos yet I only need the tickers that are vs USD stablecoins.
The 1st 15 entries of the original dataframe...
Asset Price
0 1INCHBTC 0.00009650
1 1INCHBUSD 5.74340000
2 1INCHUSDT 5.74050000
3 AAVEBKRW 164167.00000000
4 AAVEBNB 0.77600000
5 AAVEBTC 0.00615200
6 AAVEBUSD 365.00200000
7 AAVEDOWNUSDT 2.02505200
8 AAVEETH 0.17212000
9 AAVEUPUSDT 81.89500000
10 AAVEUSDT 365.57600000
11 ACMBTC 0.00018420
12 ACMBUSD 10.91700000
13 ACMUSDT 10.89500000
14 ADAAUD 1.59600000
Now...there are many USD stablecoins, however not every ticker has a pair with one.
So I used the most popular ones in order to make sure every asset has at least one match.
df = df.loc[(df.Asset.str[-3:] == 'DAI')|
(df.Asset.str[-4:] == 'USDT')|
(df.Asset.str[-4:] == 'BUSD')|
(df.Asset.str[-4:] == 'TUSD')]
The 1st 15 entries of the new but 'messy' dataframe...
Asset Price
0 1INCHBUSD 5.74340000
1 1INCHUSDT 5.74050000
2 AAVEBUSD 365.00200000
3 AAVEDOWNUSDT 2.02505200
4 AAVEUPUSDT 81.89500000
5 AAVEUSDT 365.57600000
6 ACMBUSD 10.91700000
7 ACMUSDT 10.89500000
8 ADABUSD 1.21439000
9 ADADOWNUSDT 3.46482700
10 ADATUSD 1.21284000
11 ADAUPUSDT 76.12900000
12 ADAUSDT 1.21394000
13 AERGOBUSD 0.43012000
14 AIONBUSD 0.07210000
How do i filter/merge entries in this dataframe so that it removes duplicates?
I also need the substring to be removed at the end, so I'm left with just the asset and the USD price.
It should look something like this...
Asset Price
0 1INCH 5.74340000
2 AAVE 365.00200000
3 AAVEDOWN 2.02505200
4 AAVEUP 81.89500000
6 ACM 10.91700000
8 ADA 1.21439000
9 ADADOWN 3.46482700
11 ADAUP 76.12900000
13 AERGO 0.43012000
14 AION 0.07210000
This is for a portfolio tracker.
Also if there is a better way to do this without the middle step I'm all ears.
According your expected output, you want to remove duplicates but keep first item:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.drop_duplicates(subset="Asset", keep="first")
print(df)
Prints:
Asset Price
0 1INCH 5.743400
2 AAVE 365.002000
3 AAVEDOWN 2.025052
4 AAVEUP 81.895000
6 ACM 10.917000
8 ADA 1.214390
9 ADADOWN 3.464827
11 ADAUP 76.129000
13 AERGO 0.430120
14 AION 0.072100
EDIT: To group and average:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.groupby("Asset")["Price"].mean().reset_index()
print(df)
Prints:
Asset Price
0 1INCH 5.741950
1 AAVE 365.289000
2 AAVEDOWN 2.025052
3 AAVEUP 81.895000
4 ACM 10.906000
5 ADA 1.213723
6 ADADOWN 3.464827
7 ADAUP 76.129000
8 AERGO 0.430120
9 AION 0.072100
Just do
con1 = df.Asset.str[-3:] == 'DAI'
con2 = df.Asset.str[-4:] == 'USDT'
con3 = df.Asset.str[-4:] == 'BUSD'
con4 = df.Asset.str[-4:] == 'TUSD'
df['new'] = np.select(['con1','con2','con3','con4'],
['DAI','USDT','BUSD','TUSD'])
out = df[con1 | con2 | con3 | con4].groupby('new').head(1)
or
df[con1 | con2 | con3 | con4].drop_duplicates('new')
I have a dataframe as shown below
Token Label StartID EndID
0 Germany Country 0 2
1 Berlin Capital 6 9
2 Frankfurt City 15 18
3 four million Number 21 24
4 Sweden Country 26 27
5 United Kingdom Country 32 34
6 ten million Number 40 45
7 London Capital 50 55
I am trying to get row based on certain condition, i.e. associate the label Number to closest capital i.e. Berlin
3 four million Number 21 24 - > 1 Berlin Capital 6 9
or something like:
df[row3] -> df [row1]
A pseudo logic
First check, for the rows with label: Number then (assumption is that the city is always '2 rows' above or below) and has the label: Capital. But, label: 'capital' loc is always after the label: Country
What I have done until now,
columnsName =['Token', 'Label', 'StartID', 'EndID']
df = pd.read_csv('resources/testcsv.csv', index_col= 0, skip_blank_lines=True, header=0)
print(df)
key_number = 'Number'
df_with_number = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_number), regex=True, case=False)])
print(df_with_number)
key_capital = 'Capital'
df_with_capitals = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_capital), regex=True, case=False)])
print(df_with_capitals)
key_country = 'Country'
df_with_country = (df[df[1].str.lower().str.contains(r"\b{}\b".format(key_country), regex=True, case=False)])
print(df_with_country)
The logic is to compare the index's and then make possible relations
i.e.
df[row3] -> [ df [row1], df[row7]]
you could use merge_asof with the parameter direction=nearest for example:
df_nb_cap = pd.merge_asof(df_with_number.reset_index(),
df_with_capitals.reset_index(),
on='index',
suffixes=('_nb', '_cap'), direction='nearest')
print (df_nb_cap)
index Token_nb Label_nb StartID_nb EndID_nb Token_cap Label_cap \
0 3 four_million Number 21 24 Berlin Capital
1 6 ten_million Number 40 45 London Capital
StartID_cap EndID_cap
0 6 9
1 50 55
# adjusted sample data
s = """Token,Label,StartID,EndID
Germany,Country,0,2
Berlin,Capital,6,9
Frankfurt,City,15,18
four million,Number,21,24
Sweden,Country,26,27
United Kingdom,Country,32,34
ten million,Number,40,45
London,Capital,50,55
ten million,Number,40,45
ten million,Number,40,45"""
df = pd.read_csv(StringIO(s))
# create a mask for number where capital is 2 above or below
# and where country is three above number or one below number
mask = (df['Label'] == 'Number') & (((df['Label'].shift(2) == 'Capital') |
(df['Label'].shift(-2) == 'Capital')) &
(df['Label'].shift(3) == 'Country') |
(df['Label'].shift(-1) == 'Country'))
# create a mask for capital where number is 2 above or below
# and where country is one above capital
mask2 = (df['Label'] == 'Capital') & (((df['Label'].shift(2) == 'Number') |
(df['Label'].shift(-2) == 'Number')) &
(df['Label'].shift(1) == 'Country'))
# hstack your two masks and create a frame
new_df = pd.DataFrame(np.hstack([df[mask].to_numpy(), df[mask2].to_numpy()]))
print(new_df)
0 1 2 3 4 5 6 7
0 four million Number 21 24 Berlin Capital 6 9
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
I have searched a lot but couldn't find a solution to this particular case. I want to remove any rows that contains less than 3 strings or items in the lists. My issues will be addressed more clearly further down.
I'm preparing a LDA topic modelling with a large Swedish database in pandas and have limited the test case to 1000 rows. I'm only concerned with a specific column and my approach so far has been as follows:
con = sqlite3.connect('/Users/mo/EXP/NAV/afm.db')
sql = """
select * from stillinger limit 1000
"""
dfs = pd.read_sql(sql, con)
plb = """
select PLATSBESKRIVNING from stillinger limit 1000
"""
dfp = pd.read_sql(plb, con);dfp
Then I've defined a regular expression where the first argument removes any meta characters while keeping the Swedish and Norwegian language specific letters. The second argument removes words < 3:
rep = {
'PLATSBESKRIVNING': {
r'[^A-Za-zÅåÄäÖöÆØÅæøå]+': ' ',
r'\W*\b\w{1,3}\b': ' '}
}
p0 = (pd.DataFrame(dfp['PLATSBESKRIVNING'].str.lower()).replace(rep, regex=True).
drop_duplicates('PLATSBESKRIVNING').reset_index(drop=True));p0
PLATSBESKRIVNING
0 medrek rekrytering söker uppdrag manpower h...
1 familj barn tjejer kille söker pair ...
2 uppgift blir tillsammans medarbetare leda ...
3 behov operasjonssykepleiere langtidsoppdr...
4 detta perfekta jobbet arbetstiderna vardaga...
5 familj paris barn söker älskar barn v...
6 alla inom cafe restaurang förekommande arbets...
.
.
Creating a pandas Series:
s0 = p0['PLATSBESKRIVNING']
Then:
ts = s0.str.lower().str.split();ts
0 [medrek, rekrytering, söker, uppdrag, manpower...
1 [familj, barn, tjejer, kille, söker, pair, vil...
2 [uppgift, blir, tillsammans, medarbetare, leda...
3 [behov, operasjonssykepleiere, langtidsoppdrag...
4 [detta, perfekta, jobbet, arbetstiderna, varda...
5 [familj, paris, barn, söker, älskar, barn, vil...
6 [alla, inom, cafe, restaurang, förekommande, a...
7 [diskare, till, cafe, dubbel, sökes, arbetet, ...
8 [diskare, till, thelins, konditori, sökes, arb...
Removing the stop words from the database:
r = s0.str.split().apply(lambda x: [item for item in x if item not in mswl]);r
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 []
8 [thelins]
9 [paris, baby, månader, våning, delar, badrum, ...
Creating a new DataFrame and removing the empty brackets:
dr = pd.DataFrame(r)
dr0 = dr[dr.astype(str)['PLATSBESKRIVNING'] != '[]'].reset_index(drop=True); dr0
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 [thelins]
8 [paris, baby, månader, våning, delar, badrum, ...
Maintaining the string:
dr1 = dr0['PLATSBESKRIVNING'].apply(str); len(dr1),type(dr1), dr1
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['erfaring', 'sykepleier', 'legitimasjon']
4 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
5 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
6 ['förekommande', 'emot', 'utbildning']
7 ['thelins']
8 ['paris', 'baby', 'månader', 'våning', 'delar'...
My issue now is that I want to remove any rows that contains less than 3 strings in the lists, e.g row 3, 6 and 7. Desired result would be like this:
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
4 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
5 ['paris', 'baby', 'månader', 'våning', 'delar'...
.
.
How can I obtain this? I'm also wondering if this could be done in a neater way? My approach seems so clumsy and cumbersome.
I would also like to remove both indexes and column name for the LDA topic modelling such that I could write it to a text file without the header and the digits of indexes. I have tried:
dr1.to_csv('LDA1.txt',header=None,index=False)
But this wraps quotation marks "['word1', 'word2', 't.. ]" to the each list of strings in the file.
Any suggestions would be much appreciated.
Best regards
Mo
Just measure the number of items in the list and filter the rows with length lower than 3
dr0['length'] = dr0['PLATSBESKRIVNING'].apply(lambda x: len(x))
cond = dr0['length'] > 3
dr0 = dr0[cond]
You can use apply len and then select data store it in the dataframe variable you like i.e
df[df['PLATSBESKRIVNING'].apply(len)>3]
Output :
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, nice]
1 [föräldrarna, citycentre, stort, tomt]
2 [utveckla, övergripande, strategiska, fince]
4 [arbetstiderna, vardagar, härliga, männ]
5 [paris, utav, badrum, båda, yngsta]
8 [paris, baby, månader, våning, delar]