I have an empty Pandas dataframe and I'm trying to add a row to it. Here's what I mean:
text_img_count = len(BeautifulSoup(html, "lxml").find_all('img'))
print 'img count: ', text_img_count
keys = ['text_img_count', 'text_vid_count', 'text_link_count', 'text_par_count', 'text_h1_count',
'text_h2_count', 'text_h3_count', 'text_h4_count', 'text_h5_count', 'text_h6_count',
'text_bold_count', 'text_italic_count', 'text_table_count', 'text_word_length', 'text_char_length',
'text_capitals_count', 'text_sentences_count', 'text_middles_count', 'text_rows_count',
'text_nb_digits', 'title_char_length', 'title_word_length', 'title_nb_digits']
values = [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count,
text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count,
text_bold_count, text_italic_count, text_table_count, text_word_length,
text_char_length, text_capitals_count, text_sentences_count, text_middles_count,
text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
numeric_df = pd.DataFrame()
for key, value in zip(keys, values):
numeric_df[key] = value
print numeric_df.head()
However, the output is this:
img count: 2
Empty DataFrame
Columns: [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count, text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count, text_bold_count, text_italic_count, text_table_count, text_word_length, text_char_length, text_capitals_count, text_sentences_count, text_middles_count, text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
Index: []
[0 rows x 23 columns]
This makes it seem like numeric_df is empty after I just assigned values for each of its columns.
What's going on?
Thanks for the help!
What I usually do to add a column to the empty data frame is to append the information into a list and then give it a data frame structure. For example:
df=pd.DataFrame()
L=['a','b']
df['SomeName']=pd.DataFrame(L)
And you have to use pd.Series() if the list is make of numbers.
Related
I got a table with no header in which the first column is followed by around 50 other columns with some nan-values and some values that appear more than 5 times.
I would like to let the values from the 2nd to the last column point to the values in the first column.
For example, my dataframe:
|no header|x|x|x|x|x|
|-|-|-|-|-|-|
|1|NaN|2|2|4|6|
|2|3|3|Nan|7|7|
|3|1|1|9|5|5|
Values don't appear in more than one row, but they may appear more than once in the same row.
Makes:
|value|linked to|
|-----|---------|
|1|3|
|2|1|
|3|2|
|4|1|
|5|3|
|6|1|
|7|2|
|8|NaN|
|9|3|
df.values.tolist()
Results in a list within a list, not skipping the first column.
I wrote some quick and dirty code to help myself.
df_list = frame1.values.tolist()
z=[]
for x in df_list:
z.append(set(x))
q=[]
for b in z:
q.append({x for x in b if x==x})
r=[]
for w in q:
r.append(list(w))
dict_values = { i+1 : r[i] for i in range(0, len(r) ) }
dict2= {}
for keys,values in dict_values.items():
for i in values:
dict2[i]=keys
final = pd.DataFrame(dict2.items())
I have a dataframe which looks like this (It contains dummy data) -
I want to remove the text which occurs after "_________" identifier in each of the cells. I have written the code as follows (Logic: Adding a new column containing NaN and saving the edited values in that column) -
import pandas as pd
import numpy as np
df = pd.read_excel(r'Desktop\Trial.xlsx')
NaN = np.nan
df["Body2"] = NaN
substring = "____________"
for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
row["Body2"] = split_string[0]
print(df)
But the Body2 column still displays NaN and not the edited values.
Any help would be much appreciated!
`for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
#row["Body2"] = split_string[0] # instead use below line
df.at[index,'Body2'] = split_string[0]`
Make use of at to modify the value
Instead of iterating through the rows, do the operation on all rows at once. You can use expand to split the values into multiple columns, which I think is what you want.
substring = "____________"
df = pd.DataFrame({'Body': ['a____________b', 'c____________d', 'e____________f', 'gh']})
df[['Body1', 'Body2']] = df['Body'].str.split(substring, expand=True)
print(df)
# Body Body1 Body2
# 0 a____________b a b
# 1 c____________d c d
# 2 e____________f e f
# 3 gh gh None
So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')
I have a pandas dataframe of 7000 rows, below is a sample
I need to fill in the missing branch type column, the missing info is available in the rows below. For the first row, I search the data frame ['link_name'] for B-A. and use the root_type to be the branch name.
After the extraction I want to delete the row I extracted the root_type from to have an output like this:
I tried the below code, but it doesn't work properly
count = 0
missing = 0
errored_links=[]
for i,j in bmx.iterrows():
try:
spn = bmx[bmx.link_name ==j.link_reverse_name].root_type.values[0]
index_t = bmx[bmx.link_name ==j.link_reverse_name].root_type.index[0]
bmx.drop(bmx.index[index_t],inplace=True)
count+=1
bmx.at[i,'branch_type']=spn
except:
bmx.at[i,'branch_type']='missing'
missing+=1
errored_links.append(j)
print('Iterations: ',count)
print('Missing: ', missing)
Build up a list with indices to be removed, do the job and after iterating all rows remove the unneeded rows. Do not use if/else in loop, simply set all to be missing by start and then set those that have branch type to its values.
bmx=pd.DataFrame({'link_name':["A-B","C-D","B-A","D-C"],
'root_type':["type1", "type2", "type6", "type1"],
'branch_type':["","","",""],
'link_reverse_name':["B-A","D-C","A-B","C-D"]},
columns=['link_name','root_type','branch_type','link_reverse_name'])
bmx["branch_type"]="missing" #set all to be missing by start, get rid of ifs :)
to_remove = []
for i,j in bmx.iterrows():
if(i in to_remove):
continue #just skip if we marked the row for removal already
link = bmx[bmx.link_name == j.link_reverse_name].root_type.values[0]
idx = bmx[bmx.link_name == j.link_reverse_name].index
if link:
j.branch_type = link
to_remove.append(idx[0]) #append the index to the list
bmx.drop(to_remove, inplace=True)
print(bmx)
We get the desired output:
link_name root_type branch_type link_reverse_name
0 A-B type1 type6 B-A
1 C-D type2 type1 D-C
Of course I expect that all entries are unique, otherwise this will produce some duplicates. I did not use the not problem relevant cols for simplicity.
I have imported an excel file and made it into a DataFrame and iterated over a column called "Titles" to spit out titles with certain keywords. I have the list of titles as "match_titles." What I want to do now is to create a For Loop to return the column before "titles" for each title in match_titles." I'm not sure why the code is not working. Any help would be appreciated.
import pandas as pd
data = pd.read_excel(r'C:\Users\bryanmccormack\Downloads\asin_list.xlsx')
df = pd.DataFrame(data, columns=['Track','Asin','Title'])
excludes = ["Chainsaw", "Diaper pail", "Leaf Blower"]
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
match_titles = [e for e in df.Title if
any(keywords.issubset(e.lower().split()) for keywords in my_excludes)]
a = []
for i in match_titles:
a.append(df['Asin'])
print(a)
In your for loop you are appending the unfiltered column df['Asin'] to your list a as many times as there are values in match_titles. But there isn't any filtering of df.
One solution would be to make a column of the match_values then you can return the column Asin after filtering on that match_values column:
# make a function to perform your match analysis.
def is_match(title, excludes=["Chainsaw", "Diaper pail", "Leaf Blower"]):
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
# Make a new boolean column for the matches. This applies your
# function to each value in df['Title'] and puts the output in
# the new column.
df['match_titles'] = df['Title'].apply(is_match)
# Filter the df to only matches and return the column you want.
# Because the match_titles column is boolean it can be used as
# an index.
result = df[df['match_titles']]['Asin']