Pandas df loop + merge - python

Hello guys I need your wisdom,
I'm still new to python and pandas and I'm looking to achieve the following thing.
df = pd.DataFrame({'code': [125, 265, 128,368,4682,12,26,12,36,46,1,2,1,3,6], 'parent': [12,26,12,36,46,1,2,1,3,6,'a','b','a','c','f'], 'name':['unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','g1','g2','g1','g3','g6']})
ds = pd.DataFrame({'code': [125, 265, 128,368,4682], 'name': ['Eagle','Cat','Koala','Panther','Dophin']})
I would like to add a new column in the ds dataframe with the name of the highest parent.
as an example for the first row :
code | name | category
125 | Eagle | a
"a" is the result of a loop between df.code and df.parent 125 > 12 > 1 > a
Since the last parent is not a number but a letter i think I must use a regex and than .merge from pandas to populate the ds['category'] column. Also maybe use an apply function but it seems a little bit above my current knowledge.
Could anyone help me with this?
Regards,

The following is certainly not the fastest solution but it works if your dataframes are not too big. First create a dictionary from the parent codes of df and then apply this dict recursively until you come to an end.
p = df[['code','parent']].set_index('code').to_dict()['parent']
def get_parent(code):
while par := p.get(code):
code = par
return code
ds['category'] = ds.code.apply(get_parent)
Result:
code name category
0 125 Eagle a
1 265 Cat b
2 128 Koala a
3 368 Panther c
4 4682 Dophin f
PS: get_parent uses an assignment expression (Python >= 3.8), for older versions of Python you could use:
def get_parent(code):
while True:
par = p.get(code)
if par:
code = par
else:
return code

Related

Need to extract specific word from text

I am trying to run data cleaning process in python and one of the column which has too many rows is as follows:
|Website |
|:------------------|
|m.google.com |
|uk.search.yahoo |
|us.search.yahoo.com|
|google.co.in |
|m.youtube |
|youtube.com |
I want to extract company name from the text
Output will be as follows
|Website |Company|
|:------------------|:------|
|m.google.com |google |
|uk.search.yahoo |yahoo |
|us.search.yahoo.com|yahoo |
|google.co.in |google |
|m.youtube |youtube|
|youtube.com |youtube|
Data is too big to do it manually and being a beginner, I tried all of the things I learned. Please help!
Not bullet-proof but maybe a feasible heuristic:
import pandas as pd
d = {'Website': {0: 'm.google.com', 1: 'uk.search.yahoo', 2: 'us.search.yahoo.com', 3: 'google.co.in', 4: 'm.youtube', 5: 'youtube.com'}}
df = pd.DataFrame(data=d)
df['Website'].str.split('.').map(lambda l: [e for e in l if len(e)>3][-1])
0 google
1 yahoo
2 yahoo
3 google
4 youtube
5 youtube
Name: Website, dtype: object
Explaination:
Split string on ., filter out substrings with less than 3 characters, then take the rightmost element which wasn't filtered out.
I applied this trick on a kaggle large dataset and it works for me. Assuming that you already have a dataframe object of Pandas named as df.
company = df['Website']
ext_list = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
for extension in ext_list:
company = company.str.replace(extension,'')
df['company'] = company
df['company'].head(15)
Now look at your data carefully either by looking at head or tail of data and try to find if there is any extension that you miss in the list if you find another then add it in ext_list.
Now you can also verify it using
df['company'].unique()
Here is a way of checking the running time of it also its Big O Complexity would be O(n) so it also perfoms well on a large number of datasets.
import time
def time_it(func):
def wrapper(*args,**kwargs):
start = time.time()
result = func(*args,**kwargs)
end = time.time()
print(func.__name__ + " took "+ str((end-start)* 1000)+ " miliseconds")
return result
return wrapper
#time_it
def specific_word(col_name, ext_list):
for extension in ext_list:
col_name = col_name.str.replace(extension,'')
return col_name
if __name__ == '__main__':
company = df['Website']
extensions = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
result = specific_word(company, extensions)
print(result.head())
Here is I applied on estimate 10,000 values.

Pandas to modify values in csv file based on function

I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505

Dataframe getting all data in one 'cell'

I'm having some problems to put the data I want in a specific df .
When I print the value without df i get
link to image bc if i type the answer it doesn't shows up
Then I try to use pandas dataframe to insert it in a dataframe and I get:
dinner = pd.DataFrame([dinner])
dinner.head()
Home Made - Tuna Poke, 472 gm (4 Ounces) {'cal...
So, basically, everything gets in just one cell. I would like to get something like:
A | Calories | carbohydrates
Home made - tuna poke | 592 | 8
Does anyone know how can I do it?
dinner looks like a string parsed from html text. If it is the case and there is regular pattern for parsing data, then the following code may work.
nutritions = dinner.split('{')[1].split('}')[0].split(', ')
menu = dinner.split('{')[0].strip('<').strip()
dict_dinner = {}
for n in nutritions:
item, qty = n.split(': ')
dict_dinner[item.strip("'")] = qty
df = pd.DataFrame(dict_dinner, index=[menu])
print(df)
This outputs:

Returning Max value grouping by N attributes

I am coming from a Java background and learning Python by applying it in my work environment whenever possible. I have a piece of functioning code that I would really like to improve.
Essentially I have a list of namedtuples with 3 numerical values and 1 time value.
complete=[]
uniquecomplete=set()
screenedPartitions = namedtuple('screenedPartitions'['feedID','partition','date', 'screeeningMode'])
I parse a log and after this is populated, I want to create a reduced set that is essentially the most recently dated member where feedID, partition and screeningMode are identical. So far I can only get it out by using a nasty nested loop.
for a in complete:
max = a
for b in complete:
if a.feedID == b.feedID and a.partition == b.partition and\
a.screeeningMode == b.screeeningMode and a.date < b.date:
max = b
uniqueComplete.add(max)
Could anyone give me advice on how to improve this? It would be great to work it out with whats available in the stdlib, as I guess my main task here is to get me thinking about it with the map/filter functionality.
The data looks akin to
FeedID | Partition | Date | ScreeningMode
68 | 5 |10/04/2017 12:40| EPEP
164 | 1 |09/04/2017 19:53| ISCION
164 | 1 |09/04/2017 20:50| ISCION
180 | 1 |10/04/2017 06:11| ISAN
128 | 1 |09/04/2017 21:16| ESAN
So
after the code is run line 2 would be removed as line 3 is a more recent version.
Tl;Dr, what would this SQL be in Python :
SELECT feedID,partition,screeeningMode,max(date)
from Complete
group by 'feedID','partition','screeeningMode'
Try something like this:
import pandas as pd
df = pd.DataFrame(screenedPartitions, columns=screenedPartitions._fields)
df = df.groupby(['feedID','partition','screeeningMode']).max()
It really depends on how your date is represented, but if you provide data I think we can work something out.

Replace WhiteSpace with a 0 in Pandas (Python 3)

simple question here -- how do I replace all of the whitespaces in a column with a zero?
For example:
Name Age
John 12
Mary
Tim 15
into
Name Age
John 12
Mary 0
Tim 15
I've been trying using something like this but I am unsure how Pandas actually reads whitespace:
merged['Age'].replace(" ", 0).bfill()
Any ideas?
merged['Age'] = merged['Age'].apply(lambda x: 0 if x == ' ' else x)
Use the built in method convert_objects and set param convert_numeric=True:
In [12]:
# convert objects will handle multiple whitespace, this will convert them to NaN
# we then call fillna to convert those to 0
df.Age = df[['Age']].convert_objects(convert_numeric=True).fillna(0)
df
Out[12]:
Name Age
0 John 12
1 Mary 0
2 Tim 15
Here's an answer modified from this, more thorough question. I'll make it a little bit more Pythonic and resolve your basestring issue.
def ws_to_zero(maybe_ws):
try:
if maybe_ws.isspace():
return 0
else:
return maybe_ws
except AttributeError:
return maybe_ws
d.applymap(ws_to_zero)
where d is your dataframe.
if you want to use NumPy, then you can use the below snippet:
import numpy as np
df['column_of_interest'] = np.where(df['column_of_interest']==' ',0,df['column_of_interest']).astype(float)
While Paulo's response is excellent, my snippet above may be useful when multiple criteria are required during advanced data manipulation.

Categories