How can I connect specific rows in a Pandas dataframe?

How can I connect specific rows in a Pandas dataframe? - python

I would like to connect specific rows in a Pandas dataframe.
I have a column „text“ and another column „name“. Each entry of the column „text“ has a string. Some entries of the column „name“ are empty so I would like to connect the row n, that has an empty entry in the column „name“ with the row (n-1). If the row (n-1) has also an empty entry in the column „name“, the rows should connect both to the next row that has an entry in the column „name“.
For example:
Input:
Text=["Abc","def","ghi","jkl","mno","pqr","stu"]
Name=["a","b","c",““,““,"f","g"]
Expected Output:
Text= ["Abc","def","ghijklmno","pqr","stu"]
Name = ["a","b","c","f","g"]
I'd like to make my question more understandable:
I have two lists:
index = [3,6,8,9,10,12,15,17,18,19]
text = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
new = []
for i in range(0,len(text)):
if i not in index:
if i+1 not in index:
new.append(text[i])
if i in index:
new.append(text[i-1]+' '+ text[i])
The list index shows the false splits of the text (when column name has no value).
Therefore, I'd like to append e.g. text[3] to text[2]. So I'll get a new entry 'c d'.
Finally, the output should be:
new = ['a','b,'c d','e','f g','hijk','lm','n','op','qrst','u','v','w','x','y','z']
These lists are just a simplified example for my large textlist. I don't know how many entries I have to connect together. My algorithm works only when I have to connect an entry n with the entry n-1. But it's also possible that I have to connect the entry n with the entries until n-10, so I get one large entry.
I hope my question is now more understandable.

Replace empty strings with NaN and Forward fill. Then groupby Name column and aggregate.
import pandas as pd
df.Name = df.Name.str.replace('', pd.np.nan).ffill()
out_df = df.groupby('Name').agg({'Text': ' '.join})

by using defaultdict
Name=["a","b","c",None,None,None,"f","g"]
Text=["Abc","def","ghi","jkl","mno","pqr","stu"]
lst=list(zip(Name,Text))
from collections import defaultdict
d=defaultdict(str)
for i, v in lst:
d[i] += v
print(list(d.values()))
['Abc', 'def', 'ghi', 'jklmnopqr', 'stu']

I have a solution now (the code doesn't look good, but the output is what I expected):
for i in range(0,len(text)):
if i not in index:
if i+1 not in index:
new.append(text[i])
elif i+1 in index:
if i+2 not in index:
new.append(text[i]+text[i+1])
elif i+2 in index:
if i+3 not in index:
new.append(text[i]+text[i+1]+text[i+2])
elif i+3 in index:
if i+4 not in index:
new.append(text[i]+text[i+1]+text[i+2]+text[i+3])
elif i+4 in index:
if i+5 not in index:
new.append(text[i]+text[i+1]+text[i+2]+text[i+3]+text[i+4])
I have to add a few more if conditions... but for the simplified example above, the code works perfectly.

Related

Link values in table to (first) column - pandas

I got a table with no header in which the first column is followed by around 50 other columns with some nan-values and some values that appear more than 5 times.
I would like to let the values from the 2nd to the last column point to the values in the first column.
For example, my dataframe:
|no header|x|x|x|x|x|
|-|-|-|-|-|-|
|1|NaN|2|2|4|6|
|2|3|3|Nan|7|7|
|3|1|1|9|5|5|
Values don't appear in more than one row, but they may appear more than once in the same row.
Makes:
|value|linked to|
|-----|---------|
|1|3|
|2|1|
|3|2|
|4|1|
|5|3|
|6|1|
|7|2|
|8|NaN|
|9|3|
df.values.tolist()
Results in a list within a list, not skipping the first column.

I wrote some quick and dirty code to help myself.
df_list = frame1.values.tolist()
z=[]
for x in df_list:
z.append(set(x))
q=[]
for b in z:
q.append({x for x in b if x==x})
r=[]
for w in q:
r.append(list(w))
dict_values = { i+1 : r[i] for i in range(0, len(r) ) }
dict2= {}
for keys,values in dict_values.items():
for i in values:
dict2[i]=keys
final = pd.DataFrame(dict2.items())

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1

Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

iterate over pandas dataframe, update value from data in another row, and delete that other row

I have a pandas dataframe of 7000 rows, below is a sample
I need to fill in the missing branch type column, the missing info is available in the rows below. For the first row, I search the data frame ['link_name'] for B-A. and use the root_type to be the branch name.
After the extraction I want to delete the row I extracted the root_type from to have an output like this:
I tried the below code, but it doesn't work properly
count = 0
missing = 0
errored_links=[]
for i,j in bmx.iterrows():
try:
spn = bmx[bmx.link_name ==j.link_reverse_name].root_type.values[0]
index_t = bmx[bmx.link_name ==j.link_reverse_name].root_type.index[0]
bmx.drop(bmx.index[index_t],inplace=True)
count+=1
bmx.at[i,'branch_type']=spn
except:
bmx.at[i,'branch_type']='missing'
missing+=1
errored_links.append(j)
print('Iterations: ',count)
print('Missing: ', missing)

Build up a list with indices to be removed, do the job and after iterating all rows remove the unneeded rows. Do not use if/else in loop, simply set all to be missing by start and then set those that have branch type to its values.
bmx=pd.DataFrame({'link_name':["A-B","C-D","B-A","D-C"],
'root_type':["type1", "type2", "type6", "type1"],
'branch_type':["","","",""],
'link_reverse_name':["B-A","D-C","A-B","C-D"]},
columns=['link_name','root_type','branch_type','link_reverse_name'])
bmx["branch_type"]="missing" #set all to be missing by start, get rid of ifs :)
to_remove = []
for i,j in bmx.iterrows():
if(i in to_remove):
continue #just skip if we marked the row for removal already
link = bmx[bmx.link_name == j.link_reverse_name].root_type.values[0]
idx = bmx[bmx.link_name == j.link_reverse_name].index
if link:
j.branch_type = link
to_remove.append(idx[0]) #append the index to the list
bmx.drop(to_remove, inplace=True)
print(bmx)
We get the desired output:
link_name root_type branch_type link_reverse_name
0 A-B type1 type6 B-A
1 C-D type2 type1 D-C
Of course I expect that all entries are unique, otherwise this will produce some duplicates. I did not use the not problem relevant cols for simplicity.

How to retrieve the column name and row name with a condition satisfied in a dataframe?

I need to check a condition if the sum of columns is 1 and if satisfies i want to retrieve the column names and row number in a dictionary.
The output should be list1=({8:1004},{9:1001}).
I have tried some python code but couldn't move forward with the code.
list1=[]
for Emp in SkillsA:
sum_row = (SkillsA.sum(axis=0))
#print(sum_row)
# print((Skills_A[0]))
if sum_row[Emp] == 1:
#print(Emp)
for ws in SkillsA:
# if SkillsA[ws][Emp] == 1:
print(SkillsA[ws][Emp])
#list1.update({Emp:ws})

With Pandas you can do it
import pandas as pd
# Import data
df = pd.read_excel("location_file")
# Create a dictionary
dict = dict()
# Iterate in columns
for i in df.columns:
if df[i].sum() == 1:
dict[i] = df.Employee_No[df[i] == 1] # Add filter data to dict

Based on describing the problem as
Find a mapping of columns labels to row labels for the 1s in columns having only a single 1,
it can also be done with a one-line function:
def indices_of_single_ones_by_column(df):
return [{col: df[col].idxmax()} for col in df.columns if df[col].sum() == 1]

Pandas: Trouble setting value for each column

I have an empty Pandas dataframe and I'm trying to add a row to it. Here's what I mean:
text_img_count = len(BeautifulSoup(html, "lxml").find_all('img'))
print 'img count: ', text_img_count
keys = ['text_img_count', 'text_vid_count', 'text_link_count', 'text_par_count', 'text_h1_count',
'text_h2_count', 'text_h3_count', 'text_h4_count', 'text_h5_count', 'text_h6_count',
'text_bold_count', 'text_italic_count', 'text_table_count', 'text_word_length', 'text_char_length',
'text_capitals_count', 'text_sentences_count', 'text_middles_count', 'text_rows_count',
'text_nb_digits', 'title_char_length', 'title_word_length', 'title_nb_digits']
values = [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count,
text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count,
text_bold_count, text_italic_count, text_table_count, text_word_length,
text_char_length, text_capitals_count, text_sentences_count, text_middles_count,
text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
numeric_df = pd.DataFrame()
for key, value in zip(keys, values):
numeric_df[key] = value
print numeric_df.head()
However, the output is this:
img count: 2
Empty DataFrame
Columns: [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count, text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count, text_bold_count, text_italic_count, text_table_count, text_word_length, text_char_length, text_capitals_count, text_sentences_count, text_middles_count, text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
Index: []
[0 rows x 23 columns]
This makes it seem like numeric_df is empty after I just assigned values for each of its columns.
What's going on?
Thanks for the help!

What I usually do to add a column to the empty data frame is to append the information into a list and then give it a data frame structure. For example:
df=pd.DataFrame()
L=['a','b']
df['SomeName']=pd.DataFrame(L)
And you have to use pd.Series() if the list is make of numbers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I connect specific rows in a Pandas dataframe? - python

Replace empty strings with NaN and Forward fill. Then groupby Name column and aggregate. import pandas as pd df.Name = df.Name.str.replace('', pd.np.nan).ffill() out_df = df.groupby('Name').agg({'Text': ' '.join})

by using defaultdict Name=["a","b","c",None,None,None,"f","g"] Text=["Abc","def","ghi","jkl","mno","pqr","stu"] lst=list(zip(Name,Text)) from collections import defaultdict d=defaultdict(str) for i, v in lst: d[i] += v print(list(d.values())) ['Abc', 'def', 'ghi', 'jklmnopqr', 'stu']

Related

Link values in table to (first) column - pandas

Counting combinations in Dataframe create new Dataframe

iterate over pandas dataframe, update value from data in another row, and delete that other row

How to retrieve the column name and row name with a condition satisfied in a dataframe?

Pandas: Trouble setting value for each column

Categories

Resources