I am trying to run data cleaning process in python and one of the column which has too many rows is as follows:
|Website |
|:------------------|
|m.google.com |
|uk.search.yahoo |
|us.search.yahoo.com|
|google.co.in |
|m.youtube |
|youtube.com |
I want to extract company name from the text
Output will be as follows
|Website |Company|
|:------------------|:------|
|m.google.com |google |
|uk.search.yahoo |yahoo |
|us.search.yahoo.com|yahoo |
|google.co.in |google |
|m.youtube |youtube|
|youtube.com |youtube|
Data is too big to do it manually and being a beginner, I tried all of the things I learned. Please help!
Not bullet-proof but maybe a feasible heuristic:
import pandas as pd
d = {'Website': {0: 'm.google.com', 1: 'uk.search.yahoo', 2: 'us.search.yahoo.com', 3: 'google.co.in', 4: 'm.youtube', 5: 'youtube.com'}}
df = pd.DataFrame(data=d)
df['Website'].str.split('.').map(lambda l: [e for e in l if len(e)>3][-1])
0 google
1 yahoo
2 yahoo
3 google
4 youtube
5 youtube
Name: Website, dtype: object
Explaination:
Split string on ., filter out substrings with less than 3 characters, then take the rightmost element which wasn't filtered out.
I applied this trick on a kaggle large dataset and it works for me. Assuming that you already have a dataframe object of Pandas named as df.
company = df['Website']
ext_list = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
for extension in ext_list:
company = company.str.replace(extension,'')
df['company'] = company
df['company'].head(15)
Now look at your data carefully either by looking at head or tail of data and try to find if there is any extension that you miss in the list if you find another then add it in ext_list.
Now you can also verify it using
df['company'].unique()
Here is a way of checking the running time of it also its Big O Complexity would be O(n) so it also perfoms well on a large number of datasets.
import time
def time_it(func):
def wrapper(*args,**kwargs):
start = time.time()
result = func(*args,**kwargs)
end = time.time()
print(func.__name__ + " took "+ str((end-start)* 1000)+ " miliseconds")
return result
return wrapper
#time_it
def specific_word(col_name, ext_list):
for extension in ext_list:
col_name = col_name.str.replace(extension,'')
return col_name
if __name__ == '__main__':
company = df['Website']
extensions = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
result = specific_word(company, extensions)
print(result.head())
Here is I applied on estimate 10,000 values.
Related
I am trying to find a string in an excel spreadsheet but it is only capturing the first row only and neglected to search the rest.
In my code I am using Tkinter to get a user to insert an input and using a link_url() to match it with each column cell in excel sheet and if it matches to capture the value of the same row another column.
Here is the same of the excel sheet index:
Name Link
0 ABC www.linkname1.com
1 DEF www.linkname2.com
2 GHI www.linkname3.com
3 JKL www.linkname4.com
4 MNO www.linkname5.com
5 PQR www.linkname6.com
6 STU www.linkname7.com
7 VWX www.linkname8.com
8 YZZ www.linkname9.com
9 123 www.linkname10.com
I create a the following function to search for the input:
def link_url():
df = pd.read_excel('Links.xlsx')
for i in df.index:
# print(df['Name'])
# print(e.get())
if e.get() in df['Name'][i]:
print(df['Name'][i])
link_url = df['Link'][i]
known.append(e.get())
return link_url
else:
unknown.append(e.get())
unknown_request = "I will search and return back to you"
return unknown_request
My Question
When I search for ABC it returns www.linkname1.com as requested but when I search for DEF it returns I will search and return back to you why is that happening and how can I fix it?
I may be misunderstanding the question a bit (Henry Ecker is right about the direct issue you are running into), but the solution with Pandas feels a bit weird to me.
I guess I'd personally do something more like this to filter a data frame to a specific row. I generally avoid for looping through data frames as much as I can.
import pandas as pd
my_data = pd.DataFrame(
{'Name': ['ABC', 'DEF', 'GHI'],
'Link': ['www.linkname1.com', 'www.linkname2.com', 'www.linkname3.com']}
)
keep = my_data.Name.eq('DEF')
result = my_data[keep]
if len(result) > 0:
print(result.Link.values)
else:
print("I will search and return back to you")
Hello guys I need your wisdom,
I'm still new to python and pandas and I'm looking to achieve the following thing.
df = pd.DataFrame({'code': [125, 265, 128,368,4682,12,26,12,36,46,1,2,1,3,6], 'parent': [12,26,12,36,46,1,2,1,3,6,'a','b','a','c','f'], 'name':['unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','g1','g2','g1','g3','g6']})
ds = pd.DataFrame({'code': [125, 265, 128,368,4682], 'name': ['Eagle','Cat','Koala','Panther','Dophin']})
I would like to add a new column in the ds dataframe with the name of the highest parent.
as an example for the first row :
code | name | category
125 | Eagle | a
"a" is the result of a loop between df.code and df.parent 125 > 12 > 1 > a
Since the last parent is not a number but a letter i think I must use a regex and than .merge from pandas to populate the ds['category'] column. Also maybe use an apply function but it seems a little bit above my current knowledge.
Could anyone help me with this?
Regards,
The following is certainly not the fastest solution but it works if your dataframes are not too big. First create a dictionary from the parent codes of df and then apply this dict recursively until you come to an end.
p = df[['code','parent']].set_index('code').to_dict()['parent']
def get_parent(code):
while par := p.get(code):
code = par
return code
ds['category'] = ds.code.apply(get_parent)
Result:
code name category
0 125 Eagle a
1 265 Cat b
2 128 Koala a
3 368 Panther c
4 4682 Dophin f
PS: get_parent uses an assignment expression (Python >= 3.8), for older versions of Python you could use:
def get_parent(code):
while True:
par = p.get(code)
if par:
code = par
else:
return code
I have a CSV sheet, having data like this:
| not used | Day 1 | Day 2 |
| Person 1 | Score | Score |
| Person 2 | Score | Score |
But with a lot more rows and columns. Every day I get progress of how much each person progressed, and I get that data as a dictionary where keys are names and values are score amounts.
The thing is, sometimes that dictionary will include new people and not include already existing ones. Then, if a new person comes, it will add 0 as every previous day and if the dict doesn't include already existing person, it will give him 0 score to that day
My idea of solving this is doing lines = file.readlines() on that CSV file, making a new list of people's names with
for line in lines:
names.append(line.split(",")[0])
then making a copy of lines (newLines = lines)
and going through dict's keys, seeing if that person is already in the csv, if so, append the value followed by a comma
But I'm stuck at the part of adding score of 0
Any help or contributions would be appreciated
EXAMPLE: Before I will have this
-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
And I have this dictionary to add
{'Mark': 1750, 'Hannah':1640, 'Brian':1780}
The result should be
-,day1,day2,day3,day4
Mark,1500,0,1660,1750
John,1800,1640,0,0
Peter,1670,1680,1630,0
Hannah,1480,1520,1570,1640
Brian,0,0,0,1780
See how Brian is in the dict and not in the before csv and he got added with any other day score 0. I figured out that one line .split(',') would give a list of N elements, where N - 2 will be amount of zero scores to add prior to first day of that person
This is easy to do in pandas as an outer join. Read the CSV into a dataframe and generate a new dataframe from the dictionary. The join is almost what you want except that since not-a-number values are inserted for empty cells, you need to fill the NaN's with zero and reconvert everything to integer.
The one potential problem is that the CSV is sorted. You don't simply have the new rows appended to the bottom.
import pandas as pd
import errno
import os
INDEX_COL = "-"
def add_days_score(filename, colname, scores):
try:
df = pd.read_csv(filename, index_col=INDEX_COL)
except OSError as e:
if e.errno == errno.ENOENT:
# file doesn't exist, create empty df
df = pd.DataFrame([], columns=[INDEX_COL])
df = df.set_index(INDEX_COl)
else:
raise
new_df = pd.DataFrame.from_dict({colname:scores})
merged = df.join(new_df, how="outer").fillna(0).astype(int)
try:
merged.to_csv(filename + ".tmp", index_label=[INDEX_COL])
except:
raise
else:
os.rename(filename + ".tmp", filename)
return merged
#============================================================================
# TEST
#============================================================================
test_file = "this_is_a_test.csv"
before = """-,day1,day2,day3
Mark,1500,0,1660
John,1800,1640,0
Peter,1670,1680,1630
Hannah,1480,1520,1570
"""
after = """-,day1,day2,day3,day4
Brian,0,0,0,1780
Hannah,1480,1520,1570,1640
John,1800,1640,0,0
Mark,1500,0,1660,1750
Peter,1670,1680,1630,0
"""
test_dicts = [
["day4", {'Mark': 1750, 'Hannah':1640, 'Brian':1780}],
]
open(test_file, "w").write(before)
for name, scores in test_dicts:
add_days_score(test_file, name, scores)
print("want\n", after, "\n")
got = open(test_file).read()
print("got\n", got, "\n")
if got != after:
print("FAILED")
I have a df that contains data including geocoordinates and postcodes.
lat | lon | postcode
54.3077 | 12.7 | 18314
51.898 | 9.26 | 32676
I need a new colum with the NUTS2 region, so in this case the resulting df should look something like this:
lat | lon | postcode | NUTS_ID
54.3077 | 12.7 | 18314 | DE80
51.898 | 9.26 | 32676 | DEA4
I found this package: https://github.com/vis4/pyshpgeocode which I managed to run. My first approach are the following two functions:
def get_gc(nuts='geoc\\shapes\\nuts2\\nuts2.shp'):
"""
nuts -> path to nuts file
"""
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
return gc
def add_nuts_to_df(df):
"""
df must have lon/lat geocoodinates in it
This function will add a column ['NUTS_ID'] with the corresponding
NUTS region
"""
start_time = time.time()
for idx, row in df.iterrows():
df.loc[idx, 'NUTS_ID'] = get_gc().geocode(row.lat,
row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
print('Done with index {}\nTime since start: {}s'.format(idx,
round(time.time() - start_time, 0 )))
return df
And this does work! However, it takes ~ 0.6s for one entry and some of my df have more than a million entries. Since my original dataframes usually contain postcodes I was thinking about aggregating them using a combination of groupby / apply / transform?
Or is there any other (more efficient) way of doing this?
I am very grateful for any help and look forward to receiving replies.
If I understand your code correctly you are re-creating the gc object for every single request from the same input file. I don't understand why.
One possibility therefore could be to do the following:
def add_nuts_to_df(df):
"""
df must have lon/lat geocoodinates in it
This function will add a column ['NUTS_ID'] with the corresponding
NUTS region
"""
nuts='geoc\\shapes\\nuts2\\nuts2.shp'
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
start_time = time.time()
for idx, row in df.iterrows():
df.loc[idx, 'NUTS_ID'] = gc.geocode(row.lat,
row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
print('Done with index {}\nTime since start: {}s'.format(idx,
round(time.time() - start_time, 0 )))
return df
Maybe it would speed up the process even more if you try to use the df.apply() method and pass your geocode logic in a function there.
Something like:
nuts='geoc\\shapes\\nuts2\\nuts2.shp'
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
def get_nuts_id(row):
return gc.geocode(row.lat, row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
df["NUTS_ID"] = df.apply(get_nuts_id,axis=1)
I didn't try this out though so beware of typos.
I'm trying to find the correlation between the open and close prices of 150 cryptocurrencies using pandas.
Each cryptocurrency data is stored in its own CSV file and looks something like this:
|---------------------|------------------|------------------|
| Date | Open | Close |
|---------------------|------------------|------------------|
| 2019-02-01 00:00:00 | 0.00001115 | 0.00001119 |
|---------------------|------------------|------------------|
| 2019-02-01 00:05:00 | 0.00001116 | 0.00001119 |
|---------------------|------------------|------------------|
| . | . | . |
I would like to find the correlation between the Close and Open column of every cryptocurrency.
As of right now, my code looks like this:
temporary_dataframe = pandas.DataFrame()
for csv_path, coin in zip(all_csv_paths, coin_name):
data_file = pandas.read_csv(csv_path)
temporary_dataframe[f"Open_{coin}"] = data_file["Open"]
temporary_dataframe[f"Close_{coin}"] = data_file["Close"]
# Create all_open based on temporary_dataframe data.
corr_file = all_open.corr()
print(corr_file.unstack().sort_values().drop_duplicates())
Here is a part of the output (the output has a shape of (43661,)):
Open_QKC_BTC Close_QKC_BTC 0.996229
Open_TNT_BTC Close_TNT_BTC 0.996312
Open_ETC_BTC Close_ETC_BTC 0.996423
The problem is that I don't want to see the following correlations:
between columns starting with Close_ and Close_(e.g. Close_USD_BTC and Close_ETH_BTC)
between columns starting with Open_ and Open_ (e.g. Open_USD_BTC and Open_ETH_BTC)
between the same coin (e.g. Open_USD_BTC and Close_USD_BTC).
In short, the perfect output would look like this:
Open_TNT_BTC Close_QKC_BTC 0.996229
Open_ETH_BTC Close_TNT_BTC 0.996312
Open_ADA_BTC Close_ETC_BTC 0.996423
(PS: I'm pretty sure this is not the most elegant to do what I'm doing. If anyone has any suggestions on how to make this script better I would be more than happy to hear them)
Thank you very much in advance for your help!
This is quite messy but it at least shows you an option.
Her i am generating some random data and have made some suffixes (coin names) easier than in your case
import string
import numpy as np
import pandas as pd
#Generate random data
prefix = ['Open_','Close_']
suffix = string.ascii_uppercase #All uppercase letter to simulate coin-names
var1 = [None] * 100
var2 = [None] * 100
for i in range(len(var1)) :
var1[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
var2[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
df = pd.DataFrame(data = {'var1': var1, 'var2':var2 })
df['DropScenario_1'] = False
df['DropScenario_2'] = False
df['DropScenario_3'] = False
df['DropScenario_Final'] = False
df['DropScenario_1'] = df.apply(lambda row: bool(prefix[0] in row.var1) and (prefix[0] in row.var2), axis=1) #Both are Open_
df['DropScenario_2'] = df.apply(lambda row: bool(prefix[1] in row.var1) and (prefix[1] in row.var2), axis=1) #Both are Close_
df['DropScenario_3'] = df.apply(lambda row: bool(row.var1[len(row.var1)-1] == row.var2[len(row.var2)-1]), axis=1) #Both suffixes are the same
#Combine all scenarios
df['DropScenario_Final'] = df['DropScenario_1'] | df['DropScenario_2'] | df['DropScenario_3']
#Keep only the part of the df that we want
df = df[df['DropScenario_Final'] == False]
#Drop our messy columns
df = df.drop(['DropScenario_1','DropScenario_2','DropScenario_3','DropScenario_Final'], axis = 1)
Hope this helps
P.S If you find the secret key to trading bitcoins without ending up on r/wallstreetbets, ill take 5% ;)