How can I connect specific rows in a Pandas dataframe? - python

I would like to connect specific rows in a Pandas dataframe.
I have a column „text“ and another column „name“. Each entry of the column „text“ has a string. Some entries of the column „name“ are empty so I would like to connect the row n, that has an empty entry in the column „name“ with the row (n-1). If the row (n-1) has also an empty entry in the column „name“, the rows should connect both to the next row that has an entry in the column „name“.
For example:
Input:
Text=["Abc","def","ghi","jkl","mno","pqr","stu"]
Name=["a","b","c",““,““,"f","g"]
Expected Output:
Text= ["Abc","def","ghijklmno","pqr","stu"]
Name = ["a","b","c","f","g"]
I'd like to make my question more understandable:
I have two lists:
index = [3,6,8,9,10,12,15,17,18,19]
text = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
new = []
for i in range(0,len(text)):
if i not in index:
if i+1 not in index:
new.append(text[i])
if i in index:
new.append(text[i-1]+' '+ text[i])
The list index shows the false splits of the text (when column name has no value).
Therefore, I'd like to append e.g. text[3] to text[2]. So I'll get a new entry 'c d'.
Finally, the output should be:
new = ['a','b,'c d','e','f g','hijk','lm','n','op','qrst','u','v','w','x','y','z']
These lists are just a simplified example for my large textlist. I don't know how many entries I have to connect together. My algorithm works only when I have to connect an entry n with the entry n-1. But it's also possible that I have to connect the entry n with the entries until n-10, so I get one large entry.
I hope my question is now more understandable.

Replace empty strings with NaN and Forward fill. Then groupby Name column and aggregate.
import pandas as pd
df.Name = df.Name.str.replace('', pd.np.nan).ffill()
out_df = df.groupby('Name').agg({'Text': ' '.join})

by using defaultdict
Name=["a","b","c",None,None,None,"f","g"]
Text=["Abc","def","ghi","jkl","mno","pqr","stu"]
lst=list(zip(Name,Text))
from collections import defaultdict
d=defaultdict(str)
for i, v in lst:
d[i] += v
print(list(d.values()))
['Abc', 'def', 'ghi', 'jklmnopqr', 'stu']

I have a solution now (the code doesn't look good, but the output is what I expected):
for i in range(0,len(text)):
if i not in index:
if i+1 not in index:
new.append(text[i])
elif i+1 in index:
if i+2 not in index:
new.append(text[i]+text[i+1])
elif i+2 in index:
if i+3 not in index:
new.append(text[i]+text[i+1]+text[i+2])
elif i+3 in index:
if i+4 not in index:
new.append(text[i]+text[i+1]+text[i+2]+text[i+3])
elif i+4 in index:
if i+5 not in index:
new.append(text[i]+text[i+1]+text[i+2]+text[i+3]+text[i+4])
I have to add a few more if conditions... but for the simplified example above, the code works perfectly.

Related

Link values in table to (first) column - pandas

I got a table with no header in which the first column is followed by around 50 other columns with some nan-values and some values that appear more than 5 times.
I would like to let the values from the 2nd to the last column point to the values in the first column.
For example, my dataframe:
|no header|x|x|x|x|x|
|-|-|-|-|-|-|
|1|NaN|2|2|4|6|
|2|3|3|Nan|7|7|
|3|1|1|9|5|5|
Values don't appear in more than one row, but they may appear more than once in the same row.
Makes:
|value|linked to|
|-----|---------|
|1|3|
|2|1|
|3|2|
|4|1|
|5|3|
|6|1|
|7|2|
|8|NaN|
|9|3|
df.values.tolist()
Results in a list within a list, not skipping the first column.
I wrote some quick and dirty code to help myself.
df_list = frame1.values.tolist()
z=[]
for x in df_list:
z.append(set(x))
q=[]
for b in z:
q.append({x for x in b if x==x})
r=[]
for w in q:
r.append(list(w))
dict_values = { i+1 : r[i] for i in range(0, len(r) ) }
dict2= {}
for keys,values in dict_values.items():
for i in values:
dict2[i]=keys
final = pd.DataFrame(dict2.items())

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

iterate over pandas dataframe, update value from data in another row, and delete that other row

I have a pandas dataframe of 7000 rows, below is a sample
I need to fill in the missing branch type column, the missing info is available in the rows below. For the first row, I search the data frame ['link_name'] for B-A. and use the root_type to be the branch name.
After the extraction I want to delete the row I extracted the root_type from to have an output like this:
I tried the below code, but it doesn't work properly
count = 0
missing = 0
errored_links=[]
for i,j in bmx.iterrows():
try:
spn = bmx[bmx.link_name ==j.link_reverse_name].root_type.values[0]
index_t = bmx[bmx.link_name ==j.link_reverse_name].root_type.index[0]
bmx.drop(bmx.index[index_t],inplace=True)
count+=1
bmx.at[i,'branch_type']=spn
except:
bmx.at[i,'branch_type']='missing'
missing+=1
errored_links.append(j)
print('Iterations: ',count)
print('Missing: ', missing)
Build up a list with indices to be removed, do the job and after iterating all rows remove the unneeded rows. Do not use if/else in loop, simply set all to be missing by start and then set those that have branch type to its values.
bmx=pd.DataFrame({'link_name':["A-B","C-D","B-A","D-C"],
'root_type':["type1", "type2", "type6", "type1"],
'branch_type':["","","",""],
'link_reverse_name':["B-A","D-C","A-B","C-D"]},
columns=['link_name','root_type','branch_type','link_reverse_name'])
bmx["branch_type"]="missing" #set all to be missing by start, get rid of ifs :)
to_remove = []
for i,j in bmx.iterrows():
if(i in to_remove):
continue #just skip if we marked the row for removal already
link = bmx[bmx.link_name == j.link_reverse_name].root_type.values[0]
idx = bmx[bmx.link_name == j.link_reverse_name].index
if link:
j.branch_type = link
to_remove.append(idx[0]) #append the index to the list
bmx.drop(to_remove, inplace=True)
print(bmx)
We get the desired output:
link_name root_type branch_type link_reverse_name
0 A-B type1 type6 B-A
1 C-D type2 type1 D-C
Of course I expect that all entries are unique, otherwise this will produce some duplicates. I did not use the not problem relevant cols for simplicity.

How to retrieve the column name and row name with a condition satisfied in a dataframe?

I need to check a condition if the sum of columns is 1 and if satisfies i want to retrieve the column names and row number in a dictionary.
The output should be list1=({8:1004},{9:1001}).
I have tried some python code but couldn't move forward with the code.
list1=[]
for Emp in SkillsA:
sum_row = (SkillsA.sum(axis=0))
#print(sum_row)
# print((Skills_A[0]))
if sum_row[Emp] == 1:
#print(Emp)
for ws in SkillsA:
# if SkillsA[ws][Emp] == 1:
print(SkillsA[ws][Emp])
#list1.update({Emp:ws})
With Pandas you can do it
import pandas as pd
# Import data
df = pd.read_excel("location_file")
# Create a dictionary
dict = dict()
# Iterate in columns
for i in df.columns:
if df[i].sum() == 1:
dict[i] = df.Employee_No[df[i] == 1] # Add filter data to dict
Based on describing the problem as
Find a mapping of columns labels to row labels for the 1s in columns having only a single 1,
it can also be done with a one-line function:
def indices_of_single_ones_by_column(df):
return [{col: df[col].idxmax()} for col in df.columns if df[col].sum() == 1]

Pandas: Trouble setting value for each column

I have an empty Pandas dataframe and I'm trying to add a row to it. Here's what I mean:
text_img_count = len(BeautifulSoup(html, "lxml").find_all('img'))
print 'img count: ', text_img_count
keys = ['text_img_count', 'text_vid_count', 'text_link_count', 'text_par_count', 'text_h1_count',
'text_h2_count', 'text_h3_count', 'text_h4_count', 'text_h5_count', 'text_h6_count',
'text_bold_count', 'text_italic_count', 'text_table_count', 'text_word_length', 'text_char_length',
'text_capitals_count', 'text_sentences_count', 'text_middles_count', 'text_rows_count',
'text_nb_digits', 'title_char_length', 'title_word_length', 'title_nb_digits']
values = [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count,
text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count,
text_bold_count, text_italic_count, text_table_count, text_word_length,
text_char_length, text_capitals_count, text_sentences_count, text_middles_count,
text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
numeric_df = pd.DataFrame()
for key, value in zip(keys, values):
numeric_df[key] = value
print numeric_df.head()
However, the output is this:
img count: 2
Empty DataFrame
Columns: [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count, text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count, text_bold_count, text_italic_count, text_table_count, text_word_length, text_char_length, text_capitals_count, text_sentences_count, text_middles_count, text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
Index: []
[0 rows x 23 columns]
This makes it seem like numeric_df is empty after I just assigned values for each of its columns.
What's going on?
Thanks for the help!
What I usually do to add a column to the empty data frame is to append the information into a list and then give it a data frame structure. For example:
df=pd.DataFrame()
L=['a','b']
df['SomeName']=pd.DataFrame(L)
And you have to use pd.Series() if the list is make of numbers.

Categories