I am trying to replace a value inside a string column which is between two specific wording
For example, from this dataframe I want to change
df
seller_name url
Lucas http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990
To this
url
http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=Lucas&buyer_item=106822419_1056424990
Look in the URL in the seller_name= part I replaced by the real name, I changed the numbers for the real name.
I imagine something like changing from seller_name= to the first and that it see from seller_name.
this is just an example of what i want to do but really i have many of rows in my dataframe and length of the numbers inside the seller name is not always the same
Use apply and replace the string with seller name
Sample df
import pandas as pd
df=pd.DataFrame({'seller_name':['Lucas'],'url':['http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990']})
import re
def myfunc(row):
return(re.sub('(seller_name=\d{1,})','seller_name='+row.seller_name,row.url))
df['url']=df.apply(lambda x: myfunc(x),axis=1)
seller_name = 'Lucas'
url = 'http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990'
a = url.index('seller_name=')
b = url.index('&', a)
out = url.replace(url[a+12:b],seller_name)
print(out)
Try This one:
This solution doesn't assume the order of your query parameters, or the length of the ID you're replacing. All it assumes is that your query is &-delimited, and that you have the seller_name parameter, present.
split_by_amps = url.split('&')
for i in range(len(split_by_amps)):
if (split_by_amps[i].startswith('seller_name')):
split_by_amps[i] += 'seller_name=' + 'Lucas'
break
result = '&'.join(split_by_amps)
You can use regular expressions to substitute the code for the name:
import pandas as pd
import re
#For example use a dictionary to map codes to names
seller_dic = {102392852:'Lucas'}
for i in range(len(df['url'])):
#very careful with this, if a url doesn't have this structure it will throw
#an error, you may want to handle exceptions
code = re.search(r'seller_name=\d+&',df['url'][i]).group(0)
code = code.replace("seller_name=","")
code = code.replace("&","")
name = 'seller_name=' + seller_dic[code] + '&'
url = re.sub(r'seller_name=\d+&', name, df['url'][i])
df['url'][i] = url
Related
Trying to get a list with filtered items using regex. I am trying to get out a specific location codes from the results. I am able to get the results from a JSON file, but I am stuck at figuring out how I can use multiple regex values to filter out the results from the JSON file.
This is how far I am:
import json
import re
file_path = './response.json'
result = []
with open(file_path) as f:
data = json.loads(f.read())
for d in data:
result.append(d['location_code'])
result = list(dict.fromkeys(result))
re_list = ['.*dk*', '.*se*', '.*fi*', '.*no*']
matches = []
for r in re_list:
matches += re.findall( r, result)
# r = re.compile('.*denmark*', '', '', '')
# filtered_list = list(filter(r.match, result))
print(matches)
Output from the first JSON sort. I need to filter out country initials like dk, no, lv, fi, ee etc. and leave only the data that include the specific country codes.
[
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
...
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
Would appreciate any help. Thanks!
In that case, I know this could work if you try. here is a way that could be used:
Set up multiple fields.
for the first pattern you could:
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|([^"]+)"
or
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|*"
or
for text:
.*?text"\s?:\s?"([\w\s]+)
for names:
.*?name"\s?:\s?"([\w\s]+)
let me know it, if you are able to do
This looks like regex won't be the best tool; for example, .*fi.* will match sofia, which is probably not wanted; even if we insist on periods before and after, all of the example rows have .na., but probably shouldn't match a search for Namibia.
Probably a better way would be to parse the string more carefully, using one or more of (a) the csv module (if it can contain quoting and escaping in the fields), (b) the split method, and/or (c) regular expressions, to retrieve the country code from each row. Once we have the country code, we can then compare it explicitly
For example, using the split method:
DATA = [
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
COUNTRIES = ['dk', 'se', 'fi', 'no']
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
return country
filtered = [
row for row in DATA
if extract_country(row) in COUNTRIES
]
print(filtered)
or, if you prefer one-liners, you can skip the extract_country function:
filtered = [
row for row in DATA
if row.split('|')[1].split('.')[2] in COUNTRIES
]
Both of these split the row on | and take the second column to get the geographical area, then split the geo area on . and take the third item, which seems to be the country code. If you have documentation for your data source, you will be able to check whether this is true.
One additional check might be to verify that the extracted country code has exactly two letters, as a partial check for irregularities in the data:
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
if not re.match('^[a-z]{2}$', country):
raise ValueError(
'Expected a two-letter country code, got "%s" in row %s'
% (country, row)
)
return country
I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.
I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.
I am trying the code:
`s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'`
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
data[0]=str(data[0])
data['r_id']=data[0].apply(lambda x:re.search(r'(r_id)',data[0]))
data['level']=data[0].apply(lambda x:re.search(r'(level)',data[0]))
print(data)
I wish I could get the result:
r_id level
1312 307
1111 NAN
But it shows the error:expected string or bytes-like object
So how could I use the re.search in pandas or how could I get result?
My two cents...
import re
pattern = re.compile(r'^.*?id\":\"(\d+)\",\"level\":(\d+).*id\":\"(\d+).*$')
string = r'{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data = pattern.findall(string)
data
Which returns an array:
[('1312', '307', '1111')]
And you can access items with, for example:
data[0][2]
Regex demo: https://regex101.com/r/Inv4gp/1
The below works for me. The type problem arises because you cannot change the type of all the rows like that. You would need a lambda functor for that too.
There is an additional problem, that the regex and the exception case handling won't work like that. I proposed a solution for this, but you might want to consider a different regex if you want this to work for other columns.
I'm very novice with regex, so there might be a more general-purpose solution for your problem.
import re
import pandas as pd
import numpy as np
s='{"mail":vip#a.com,"type":"a","r_id":"1312","level":307},{"mail":vipx#a.com,"type":"b","r_id":"1111"}'
data_raw=re.split(r'[\{\}]',s)
data_raw=data_raw[1::2]
data=pd.DataFrame(data_raw)
# This is a regex wrapper which gets the row of our pandas dataframes and the columns that we want.
def regex_wrapper(row,column):
match = re.search(r'"' + column + '":"?(\d+)"?', str(row))
if match:
return match.group(1)
else:
return np.nan
data['r_id'] = data[0].apply(lambda row: regex_wrapper(row,"r_id"))
data['level'] = data[0].apply(lambda row: regex_wrapper(row,"level"))
del data[0]
I have a dataframe(sample_emails) that provides a list of emails and I would like to extract only the workplace from the email. For example from the email such as person1#uber.com, it should return only the string "uber". I tried writing the code for this but I keep getting a variety of errors.
extract_company = extract_company.find(email[ start['#', end['.']]
def extract_company(email):
return
The extracted value should be returned into the df extract_company
Use pandas.Series.str.extract:
import pandas as pd
extract_company = pd.Series(['a#google.com', 'b#facebook.com'])
extract_company.str.extract('#(.+)\.')
Output:
0
0 google
1 facebook
You can then assign it back to your df, for example:
df['extract_company'] = extract_company.str.extract('#(.+)\.')