Code segment in Python keeps changing Dataframe content despite not being referenced - python

I have the following Dataframe in Python, where "data" = the full dataset composed of 2 columns of strings, 'Description' and 'Category'.
"dataTrain" is a subset of "data"
"catBag" is a list of all the words used in the 'Description' from rows of a specific 'Category'
"catDict" is a list of all the words used in the 'Description' from rows of all the other Categories.
"catUnique" returns me all the words that are unique to a specific category.
The nested loop replaces the 'Description' text with only words that are unique to the row's category.
classNames = sorted(list(set(dataTrain['Category'])))
catUnique = [[] for _ in range(len(classNames))]
dataTemp = dataTrain
for i in range(len(classNames)):
catBag = set()
data2 = dataTrain.loc[data['Category'] == classNames[i]]
data2['Description'].str.lower().str.split().apply(catBag.update)
catDict = set()
data3 = dataTrain.loc[data['Category'] != classNames[i]]
data3['Description'].str.lower().str.split().apply(catDict.update)
catUnique[i] = list(catBag-catDict)
for j in range(len(data2)):
if len(catUnique[i]) > 0:
data22 = data2
dataTemp.at[data22.index[j], 'Description'] = " ".join(list(set(data22.at[data22.index[j], 'Description'].lower().split()) & set(catUnique[i])))
However, running this code updates dataTrain's Description text despite not being referenced. Even when I change it so that dataTrain isn't used as an input, it still gets updated.
This issue means that more words are missing from "data3" as non-unique words are stripped from previously processed Categories.
I think it's to do with the data2['Description'].str.lower().str.spl...... lines but not sure how to fix it.

In your last line, you are updating dataTemp, which is the same as dataTrain.
In order to make a copy of dataTrain, use
dataTemp = dataTrain.copy()
In Python, dataTemp = dataTrain only creates a new variable that references the same object.

Related

no returned results from pubmed query

I am using the following code to search and extract research documents on chemical compounds from pubmed. I am interested in the author, name of document, abstract, etc..When I run the code I am only getting results for the last item on my list (see example data) in code below. Yet when I do a manual search I.e. one at a time), I get results from all of them..
#example data list
data={'IUPACName':['ethenyl(trimethoxy)silane','sodium;prop-2-enoate','2-methyloxirane;oxirane','2-methylprop-1-ene;styrene','terephthalic acid', 'styrene' ]}
df=pd.DataFrame(data)
df_list = []
import time
from pymed import PubMed
pubmed = PubMed(tool="PubMedSearcher", email="thomas.heiman#fda.hhs.gov")
data = []
for index, row in df.iterrows():
## PUT YOUR SEARCH TERM HERE ##
search_term =row['IUPACName']
time.sleep(3) #because I dont want to slam them with requests
#search_term = '3-hydroxy-2-(hydroxymethyl)-2-methylpropanoic '
results = pubmed.query(search_term, max_results=500)
articleList = []
articleInfo = []
for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
articleDict = article.toDict()
articleList.append(articleDict)
# Generate list of dict records which will hold all article details that could be fetch from
#PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
pubmedId = article['pubmed_id'].partition('\n')[0]
# Append article info to dictionary
try:
articleInfo.append({u'pubmed_id':pubmedId,
u'title':article['title'],
u'keywords':article['keywords'],
u'journal':article['journal'],
u'abstract':article['abstract'],
u'conclusions':article['conclusions'],
u'methods':article['methods'],
u'results': article['results'],
u'copyrights':article['copyrights'],
u'doi':article['doi'],
u'publication_date':article['publication_date'],
u'authors':article['authors']})
except KeyError as e:
continue
# Generate Pandas DataFrame from list of dictionaries
articlesPD = pd.DataFrame.from_dict(articleInfo)
#Add the query to the first column
articlesPD.insert(loc=0, column='Query', value=search_term)
df_list.append(articlesPD)
data = pd.concat(df_list, axis=1)
all_export_csv = data.to_csv (r'C:\Users\Thomas.Heiman\Documents\pubmed_output\all_export_dataframe.csv', index = None, header=True)
#Print first 10 rows of dataframe
#print(all_export_csv.head(10))
Any ideas on what I am doing wrong? Thank you!

Loop through list of dataframes and save as new dataframe name

I'm trying to loop through a list of dataframes and perform operations on them. In the final command I want to rename the dataframe as the original key plus '_rand_test'. I'm getting the error:
SyntaxError: cannot assign to operator
Is there a way to do this?
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
seg_name[i]+'_rand_test' = pd.concat([control,test])
The error is because you are trying to perform addition on the left side of an = sign, which you can never do. If you want to rename the dataframe you could just do it on the next line. I'm unsure of what exactly you're trying to rename based off of the code, but if it's just the corresponding string in the seg_name list then the next line would look like this:
seg_name[segments.index(i)] += 'rand_test'
The reason for the segments.index(i) is because you're looping over the elements in segments, not their indexes, so you need to get the index of the element.
Maybe this will work for you?
Create an empty list befor you run the loop and fill that list with append function. And then you rename all the elements of the new list.
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
new_list= []
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
new_list.append(df)
new_names_list=[item +'_rand_test' for item in new_list]

How do I fix the For Loop to return a certain character from a DataFrame?

I have imported an excel file and made it into a DataFrame and iterated over a column called "Titles" to spit out titles with certain keywords. I have the list of titles as "match_titles." What I want to do now is to create a For Loop to return the column before "titles" for each title in match_titles." I'm not sure why the code is not working. Any help would be appreciated.
import pandas as pd
data = pd.read_excel(r'C:\Users\bryanmccormack\Downloads\asin_list.xlsx')
df = pd.DataFrame(data, columns=['Track','Asin','Title'])
excludes = ["Chainsaw", "Diaper pail", "Leaf Blower"]
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
match_titles = [e for e in df.Title if
any(keywords.issubset(e.lower().split()) for keywords in my_excludes)]
a = []
for i in match_titles:
a.append(df['Asin'])
print(a)
In your for loop you are appending the unfiltered column df['Asin'] to your list a as many times as there are values in match_titles. But there isn't any filtering of df.
One solution would be to make a column of the match_values then you can return the column Asin after filtering on that match_values column:
# make a function to perform your match analysis.
def is_match(title, excludes=["Chainsaw", "Diaper pail", "Leaf Blower"]):
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
# Make a new boolean column for the matches. This applies your
# function to each value in df['Title'] and puts the output in
# the new column.
df['match_titles'] = df['Title'].apply(is_match)
# Filter the df to only matches and return the column you want.
# Because the match_titles column is boolean it can be used as
# an index.
result = df[df['match_titles']]['Asin']

How to zip values into a table with uneven lists? (DataNitro)

I'm attempting to get the last 5 orders from currency exchanges through their respective JSON API. Everything is working except for the fact there are some coins that have less than 5 orders (ask/bid) which causes some errors in the table write to Excel.
Here is what I have now:
import grequests
import json
import itertools
active_sheet("Livecoin Queries")
urls3 = [
'https://api.livecoin.net/exchange/order_book?
currencyPair=RBIES/BTC&depth=5',
'https://api.livecoin.net/exchange/order_book?
currencyPair=REE/BTC&depth=5',
]
requests = (grequests.get(u) for u in urls3)
responses = grequests.map(requests)
CellRange("B28:DJ48").clear()
def make_column(catalog_response, name):
column = []
catalog1 = catalog_response.json()[name]
quantities1, rates1 = zip(*catalog1)
for quantity, rate in zip(quantities1, rates1):
column.append(quantity)
column.append(rate)
return column
bid_table = []
ask_table = []
for response in responses:
try:
bid_table.append(make_column(response,'bids'))
ask_table.append(make_column(response,'asks'))
except (KeyError,ValueError,AttributeError):
continue
Cell(28, 2).table = zip(*ask_table)
Cell(39, 2).table = zip(*bid_table)
I've isolated the list of links down to just two with "REE" coin being the issue here.
I've tried:
for i in itertools.izip_longest(*bid_table):
#Cell(28, 2).table = zip(*ask_table)
#Cell(39, 2).table = zip(*i)
print(i)
Which prints out nicely in the terminal:
itertools terminal output
NOTE: As of right now "REE" has zero bid orders so it ends up creating an empty list:
empty list terminal output
When printing to excel I get a lot of strange outputs. None of which resemble what it looks like in the terminal. The way the information is set up in Excel requires it to be Cell(X,X).table
My question is, how do I make zipping with uneven lists play nice with tables in DataNitro?
EDIT1:
The problem is arising at catalog_response.json()[name]
def make_column(catalog_response, name):
column = []
catalog1 = catalog_response.json()[name]
#quantities1, rates1 = list(itertools.izip_longest(*catalog1[0:5]))
print(catalog1)
#for quantity, rate in zip(quantities1, rates1):
# column.append(quantity)
# column.append(rate)
#return column
Since there are zero bids there is not even an empty list created which is why I'm unable to zip them together.
ValueError: need more than 0 values to unpack
I suggest that you build the structure myTable that you intend to write back to excel.
It should be a list of lists
myTable = []
myRow = []
…build each myRow from your code…
if the length of the list for myRow is too short, pad with proper number of [None] elements
in your case if len(myRow) is 0 you need to append two “None” items
myRow.append(None)
myRow.append(None)
add the row to the output table
myTable.append(myRow)
so when ready you have a well formed nn x n table to output via:
Cell(nn, n).table = myTable

Python : Separating a .txt file into columns and finding the most frequent data item in one of the columns

I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'

Categories