I have a dataframe(sample_emails) that provides a list of emails and I would like to extract only the workplace from the email. For example from the email such as person1#uber.com, it should return only the string "uber". I tried writing the code for this but I keep getting a variety of errors.
extract_company = extract_company.find(email[ start['#', end['.']]
def extract_company(email):
return
The extracted value should be returned into the df extract_company
Use pandas.Series.str.extract:
import pandas as pd
extract_company = pd.Series(['a#google.com', 'b#facebook.com'])
extract_company.str.extract('#(.+)\.')
Output:
0
0 google
1 facebook
You can then assign it back to your df, for example:
df['extract_company'] = extract_company.str.extract('#(.+)\.')
Related
I recently discovered the investpy library and I would like to know if there is any function where I can pass the stocks code and on return, among other data, receive its ISIN. I'm trying the functionsearch_quotes, but the ISIN doesn't come back.
import investpy as ip
search = ip.search_quotes(text='CPLE11')
for s in search:
print(s)
If anyone knows of any other library where I can pass the code and receive the ISIN, that helps too.
You can get the ISIN by symbol as follows:
import investpy
symbol = "TSLA"
df = investpy.search_stocks(by='symbol', value=symbol)
df = df[df.symbol == symbol]
print(df)
However the code does not work for the symbol you provided. You have to find the correct symbol by mapping it.
EDIT: It might also be the case that this specific company cannot be found using this API. I tried to find it with the following code:
stocks = stocks = investpy.stocks.get_stocks()
matches = [_ for idx, _ in stocks.iterrows() if _["name"].startswith("Companhia")]
print([m["name"] for m in matches])
Your stock is not among the matches.
You can use the openfigi API
https://www.openfigi.com/api
Value ID_ISIN
Description
ISIN - International Securities Identification Number.
Example:
[{"idType":"ID_ISIN","idValue":"XX1234567890"}]
I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.
I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?
Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)
You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object
Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.
The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.
I am trying to replace a value inside a string column which is between two specific wording
For example, from this dataframe I want to change
df
seller_name url
Lucas http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990
To this
url
http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=Lucas&buyer_item=106822419_1056424990
Look in the URL in the seller_name= part I replaced by the real name, I changed the numbers for the real name.
I imagine something like changing from seller_name= to the first and that it see from seller_name.
this is just an example of what i want to do but really i have many of rows in my dataframe and length of the numbers inside the seller name is not always the same
Use apply and replace the string with seller name
Sample df
import pandas as pd
df=pd.DataFrame({'seller_name':['Lucas'],'url':['http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990']})
import re
def myfunc(row):
return(re.sub('(seller_name=\d{1,})','seller_name='+row.seller_name,row.url))
df['url']=df.apply(lambda x: myfunc(x),axis=1)
seller_name = 'Lucas'
url = 'http://sanyo.mapi/s3/e42390aac371?item_title=Branded%20boys%20Clothing&seller_name=102392852&buyer_item=106822419_1056424990'
a = url.index('seller_name=')
b = url.index('&', a)
out = url.replace(url[a+12:b],seller_name)
print(out)
Try This one:
This solution doesn't assume the order of your query parameters, or the length of the ID you're replacing. All it assumes is that your query is &-delimited, and that you have the seller_name parameter, present.
split_by_amps = url.split('&')
for i in range(len(split_by_amps)):
if (split_by_amps[i].startswith('seller_name')):
split_by_amps[i] += 'seller_name=' + 'Lucas'
break
result = '&'.join(split_by_amps)
You can use regular expressions to substitute the code for the name:
import pandas as pd
import re
#For example use a dictionary to map codes to names
seller_dic = {102392852:'Lucas'}
for i in range(len(df['url'])):
#very careful with this, if a url doesn't have this structure it will throw
#an error, you may want to handle exceptions
code = re.search(r'seller_name=\d+&',df['url'][i]).group(0)
code = code.replace("seller_name=","")
code = code.replace("&","")
name = 'seller_name=' + seller_dic[code] + '&'
url = re.sub(r'seller_name=\d+&', name, df['url'][i])
df['url'][i] = url
I have a dataframe in Python using pandas. It has 2 columns called 'dropoff_latitude' and 'pickup_latitude'. I want to make a function that will create a 3rd column based on these 2 variables (runs them through an api).
So I wrote a function:
def dropoff_info(row):
dropoff_latitude = row['dropoff_latitude']
dropoff_longitude = row['dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
dropoffinfo = dropoff_results2["Block"]["FIPS"][2:11]
return dropoffinfo
then I would run it as
df['newcolumn'] = dropoffinfo(df)
However it doesn't work.
Upon troubleshooting I find that when I print dropoff_latitude it looks like this:
0 40.773345947265625
1 40.762149810791016
2 40.770393371582031
...
And so I think that the URL can't get generated. I want dropoff_latitude to look like this when printed:
40.773345947265625
40.762149810791016
40.770393371582031
...
And I don't know how to specify that I want just the actual content part.
When I tried
dropoff_latitude = row['dropoff_latitude'][1]
dropoff_longitude = row['dropoff_longitude'][1]
It just gave me the values from the 1st row so that obviously didn't work.
Ideas please? I am very new to working with dataframes... Thank you!
Alex - with pandas we typically like to avoid loops, but in your particular case, the need to ping a remote server for data pretty much requires it. So I'd do something like the following:
l = []
for i in df.index:
dropoff_latitude = df.loc[i, 'dropoff_latitude']
dropoff_longitude = df.loc[i, 'dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
l.append(dropoff_results2["Block"]["FIPS"][2:11])
df['new'] = l
The key here is the .loc[i, ...] bit that gives you the ability to go through each row one by one, and call out the associated column to create the variables to send to your API.
Regarding your question about a drain on your memory - that's a little above my pay-grade, but I really don't think you have any other options in this case (unless your API has some kind of batch request that allows you to pull a larger data set in one call).