How to separate tuple into independent pandas columns? - python

I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]

first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])

Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())

Related

Remove unwanted characters from Dataframe values in Pandas

I have the following Dataframe full of locus/gen names from a multiple genome alignment.
However, I am trying to get only a full list of the locus/name without the coordinates.
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 0:Rv0001:1-1524 1:MSMEG_RS33460:6986600-6988114 2:MRA_RS00005:1-1524 3:BQ2027_RS00005:1-1524
1 0:Rv0002:2052-3260 1:MSMEG_RS00005:499-1692 2:MRA_RS00010:2052-3260 3:BQ2027_RS00010:2052-3260
2 0:Rv0003:3280-4437 1:MSMEG_RS00015:2624-3778 2:MRA_RS00015:3280-4437 3:BQ2027_RS00015:3280-4437
To avoid issues with empty cells, I am filling cells with 'N/A' and then striping the unwanted characters. But it's giving the same exact result, nothing seems to be happening.
for value in orthologs['Tuberculosis_locus']:
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].fillna("N/A")
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].map(lambda x: x.lstrip('\d:').rstrip(':\d+'))
Any idea on what I am doing wrong? I'd like the following output:
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 Rv0001 MSMEG_RS33460 MRA_RS00005 BQ2027_RS00005
1 Rv0002 MSMEG_RS00005 MRA_RS00010 BQ2027_RS00010
2 Rv0003 MSMEG_RS00015 MRA_RS00015 BQ2027_RS00015
Split by : with a maximum split of two and then take the 2nd elements, eg:
df.applymap(lambda v: v.split(':', 2)[1])
def clean(x):
x = x.split(':')[1].strip()
return x
orthologs = orthologs.applymap(clean)
should work.
Explanation:
applymap is for the whole dataframe and apply is for a data column.
clean is a function you want to apply to every entry of the dataframe. Note that you don't need (x) anymore when you use it together with applymap or apply.

How can I access a row by its index using pandas itertuples() methode?

this is my code where I need to access a certain tuple from the DataFrame df. Can you please help me with this matter as I can't find any answer regarding this issue.
import pandas as pd
import openpyxl
df_sheet_index = pd.read_excel("path/to/excel/file.xlsx")
df = df_sheet_index.itertuples()
for tuple in df:
print(tuple)
This is the output
Pandas(Index=0, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=1, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=2, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
...
EDIT: As a general rule, you should use pandas builtin functions to search inside and not iterate on it. It's more efficient and more readable.
But if you really want to access the tuples:
target_index= 10
for tu in df.itertuples():
if tu[0] == target_index:
print(tu)
in a more general view, this is a regular tuple so you can access each element by its position. index will be tuple[0] then you first column tuple[1], the second tuple[2] etc.
NOTE: do not use tuple as a variable name, this is a reserved name in Python for the tuple type and it may create issue (on top of not being a good practice)
if u are trying to get an element at a specific place. U can use .iloc()
It takes two pars the row & column
df.iloc[-1]["column"]
This will get the last row element at that column
For df.loc["row","column"]
df.loc[[]] returns a df while df.loc[] returns a series

How to drop element from a list inside a pandas column in Python?

I have a column in a dataframe that contain a list inside. My dataframe column is:
[],
['NORM'],
['NORM'],
['NORM'],
['NORM'],
['MI', 'STTC'],
As you can see I have an empty list and also a list with two elements. How can I change list with two elements to just take one of it (I don't care which one of it).
I tried with df.column.explode()but this just add more rows and I don't want more rows, I just need to take one of it.
Thank you so much
You can use Series.map with a custom mapping function which maps the elements of column according to desired requirements:
df['col'] = df['col'].map(lambda l: l[:1])
Result:
# print(df['col'])
0 []
1 [NORM]
2 [NORM]
3 [NORM]
4 [NORM]
5 [MI]
i, j is the location of the cell you need to access and this will give the first element of the list
list_ = df.loc[i][j]
if len(list_) > 0:
print(list_[0])
As you store lists into a pandas column, I assume that you do not worry for vectorization. So you could just use a list comprehension:
df[col] = [i[:1] for i in df[col]]

How to drop a series of rows from dataframe in a faster way

I have a data set and I want to drop some rows with a faster method. I had tried the following code but it took a long time
I want to drop every user who makes less than 3 operations.
every operation is stored in a row in which user_id is not the ID of my data
undesirable_users=[]
for i in range(len(operations_per_user)):
if operations_per_user.get_value(operations_per_user.index[i])<=3:
undesirable_users.append(operations_per_user.index[i])
for i in range(len(undesirable_users)):
data = data.drop(data[data.user_id == undesirable_users[i]].index)
data is a dataframe and operation_per_user is a series created by: operation_per_user = data['user_id'].value_counts().
Why not just filter them? You don't need to loop at all.
You can get the filtered indexes by:
operations_per_user.index[operations_per_user <= 3]
And then you can filter these indexes from the df, making the solution:
data = data[data['user_id'] not in (operations_per_user.index[operations_per_user <= 3])]
EDIT
My understanding is that you want to remove any user that occurs less than 3 times in the data. You won't need to create a value_counts list for that, you could do a groupby and find the counts and then filter on that basis.
filtered_user_ids = data.groupby('user_id').filter(lambda x: len(x) <= 3)['user_id'].tolist()
data = data[~data[user_id].isin(filtered_user_ids)]
If data is a pandas DataFrame, and it contains both user_id and operations_per_user as columns, you should perform the drop with:
data = data.drop(data.loc[data['operations_per_user'] <= 3].index)
Edit
Instead of creating a seperate series, you could add operations_per_user to data with:
data['operations_per_user'] = data.loc[:, 'user_id'].value_counts()
You could either perform the drop as above or perform the selection with the inverse logical condition:
data = data.loc[data['operations_per_user' > 3]]
Original
It would be preferable if you could supply some more information about the variables used in your code.
If operations_per_user is a pandas Series, your first loop could be improved with:
undesirable_users=[]
for i in operations_per_user.index:
if operations_per_user.loc[i] <= 3:
undesirable_users.append(i)
The function get_value() is deprecated, use loc or iloc instead. This is a good summary of loc and iloc, and here is a great pandas cheatsheet to reference.
You can use python lists as iterators; for your second loop:
for user in undesirable_users:
data = data.drop(data.loc[data['user_id'] == user].index)
Rather than dropping, you can simply select the rows you want to keep reverting the logical condition.
First, select the user to keep only.
Then get a boolean list, length equal to data rows.
Finally, select the rows to keep.
keepusers = operation_per_user.loc[operation_per_user > 3]
tokeep = [uid in keepuser for uid in data['user_id']]
newdata = data.loc[tokeep]

Tricky str value replacement within PANDAS DataFrame

Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!

Categories