Related
Hi all i have a dataframe of approx 400k rows with a column of interest. I would like to map each element in the column to a category (LU, HU, etc.). This is obtained from a smaller dataframe where the column names are the category. The function below however runs very slow voor only 400k rows. I m not sure why. In the example below ofcourse it is fast for 5 examples.
cwp_sector_mapping = {
'LU': ['C2P34', 'C2P35', 'C2P36'],
'HU': ['C2P37', 'C2P38', 'C2P39'],
'EH': ['C2P40', 'C2P41', 'C2P42'],
'EL': ['C2P43', 'C2P44', 'C2P45'],
'WL': ['C2P12', 'C2P13', 'C2P14'],
'WH': ['C2P15', 'C2P16', 'C2P17'],
'NL': ['C2P18', 'C2P19', 'C2P20'],
}
df_cwp = pd.DataFrame.from_dict(cwp_sector_mapping)
columns = df_cwp.columns
ls = pd.Series(['C2P44', 'C2P43', 'C2P12', 'C2P1'])
temp = list((map(lambda pos: columns[df_cwp.eq(pos).any()][0] if
columns[df_cwp.eq(pos).any()].size != 0 else 'UN', ls)))
Use next with iter trick for possible get first meached value of columns, if no match get default value UN:
temp = [next(iter(columns[df_cwp.eq(pos).any()]), 'UN') for pos in ls]
I have a pandas dataframe of 7000 rows, below is a sample
I need to fill in the missing branch type column, the missing info is available in the rows below. For the first row, I search the data frame ['link_name'] for B-A. and use the root_type to be the branch name.
After the extraction I want to delete the row I extracted the root_type from to have an output like this:
I tried the below code, but it doesn't work properly
count = 0
missing = 0
errored_links=[]
for i,j in bmx.iterrows():
try:
spn = bmx[bmx.link_name ==j.link_reverse_name].root_type.values[0]
index_t = bmx[bmx.link_name ==j.link_reverse_name].root_type.index[0]
bmx.drop(bmx.index[index_t],inplace=True)
count+=1
bmx.at[i,'branch_type']=spn
except:
bmx.at[i,'branch_type']='missing'
missing+=1
errored_links.append(j)
print('Iterations: ',count)
print('Missing: ', missing)
Build up a list with indices to be removed, do the job and after iterating all rows remove the unneeded rows. Do not use if/else in loop, simply set all to be missing by start and then set those that have branch type to its values.
bmx=pd.DataFrame({'link_name':["A-B","C-D","B-A","D-C"],
'root_type':["type1", "type2", "type6", "type1"],
'branch_type':["","","",""],
'link_reverse_name':["B-A","D-C","A-B","C-D"]},
columns=['link_name','root_type','branch_type','link_reverse_name'])
bmx["branch_type"]="missing" #set all to be missing by start, get rid of ifs :)
to_remove = []
for i,j in bmx.iterrows():
if(i in to_remove):
continue #just skip if we marked the row for removal already
link = bmx[bmx.link_name == j.link_reverse_name].root_type.values[0]
idx = bmx[bmx.link_name == j.link_reverse_name].index
if link:
j.branch_type = link
to_remove.append(idx[0]) #append the index to the list
bmx.drop(to_remove, inplace=True)
print(bmx)
We get the desired output:
link_name root_type branch_type link_reverse_name
0 A-B type1 type6 B-A
1 C-D type2 type1 D-C
Of course I expect that all entries are unique, otherwise this will produce some duplicates. I did not use the not problem relevant cols for simplicity.
I would like to connect specific rows in a Pandas dataframe.
I have a column „text“ and another column „name“. Each entry of the column „text“ has a string. Some entries of the column „name“ are empty so I would like to connect the row n, that has an empty entry in the column „name“ with the row (n-1). If the row (n-1) has also an empty entry in the column „name“, the rows should connect both to the next row that has an entry in the column „name“.
For example:
Input:
Text=["Abc","def","ghi","jkl","mno","pqr","stu"]
Name=["a","b","c",““,““,"f","g"]
Expected Output:
Text= ["Abc","def","ghijklmno","pqr","stu"]
Name = ["a","b","c","f","g"]
I'd like to make my question more understandable:
I have two lists:
index = [3,6,8,9,10,12,15,17,18,19]
text = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
new = []
for i in range(0,len(text)):
if i not in index:
if i+1 not in index:
new.append(text[i])
if i in index:
new.append(text[i-1]+' '+ text[i])
The list index shows the false splits of the text (when column name has no value).
Therefore, I'd like to append e.g. text[3] to text[2]. So I'll get a new entry 'c d'.
Finally, the output should be:
new = ['a','b,'c d','e','f g','hijk','lm','n','op','qrst','u','v','w','x','y','z']
These lists are just a simplified example for my large textlist. I don't know how many entries I have to connect together. My algorithm works only when I have to connect an entry n with the entry n-1. But it's also possible that I have to connect the entry n with the entries until n-10, so I get one large entry.
I hope my question is now more understandable.
Replace empty strings with NaN and Forward fill. Then groupby Name column and aggregate.
import pandas as pd
df.Name = df.Name.str.replace('', pd.np.nan).ffill()
out_df = df.groupby('Name').agg({'Text': ' '.join})
by using defaultdict
Name=["a","b","c",None,None,None,"f","g"]
Text=["Abc","def","ghi","jkl","mno","pqr","stu"]
lst=list(zip(Name,Text))
from collections import defaultdict
d=defaultdict(str)
for i, v in lst:
d[i] += v
print(list(d.values()))
['Abc', 'def', 'ghi', 'jklmnopqr', 'stu']
I have a solution now (the code doesn't look good, but the output is what I expected):
for i in range(0,len(text)):
if i not in index:
if i+1 not in index:
new.append(text[i])
elif i+1 in index:
if i+2 not in index:
new.append(text[i]+text[i+1])
elif i+2 in index:
if i+3 not in index:
new.append(text[i]+text[i+1]+text[i+2])
elif i+3 in index:
if i+4 not in index:
new.append(text[i]+text[i+1]+text[i+2]+text[i+3])
elif i+4 in index:
if i+5 not in index:
new.append(text[i]+text[i+1]+text[i+2]+text[i+3]+text[i+4])
I have to add a few more if conditions... but for the simplified example above, the code works perfectly.
I have a data frame df1. "transactions" column has an array of int.
id transactions
1 [1,2,3]
2 [2,3]
data frame df2. "items" column has an array of int.
items cost
[1,2] 2.0
[2] 1.0
[2,4] 4.0
I need to check whether all elements of items are in each transaction if so sum up the costs.
Expected Result
id transaction score
1 [1,2,3] 3.0
2 [2,3] 1.0
I did the following
#cross join
-----------
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()],
right.values[ib2.ravel()]]))
out=cartesian_product_simplified(df1,df2)
#column names assigning
out.columns=['id', 'transactions', 'cost', 'items']
#converting panda series to list
t=out["transactions"].tolist()
item=out["items"].tolist()
#check list present in another list
-------------------------------------
def check(trans,itm):
out_list=list()
for row in trans:
ret =np.all(np.in1d(itm, row))
out_list.append(ret)
return out_list
if true: group and sum
-----------------------
a=check(t,item)
for i in a:
if(i):
print(out.groupby(['id','transactions']))['cost'].sum()
else:
print("no")
Throws TypeError: 'NoneType' object is not subscriptable.
I am new to python and don't know how to put all these together. How to group by and sum the cost when all items of one list in another list?
The simplies way is just to check all items for all transactions:
# df1 and df2 are initialized
def sum_score(transaction):
score = 0
for _, row in df2.iterrows():
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
df1["score"] = df1["transactions"].map(sum_score)
It will be extremely slow on big scale. If this is a problem, we need to iterate not over every item, but preselect only possible. If you have enough memory, it can be done like that. For each item we remember all the row numbers in df2, where it appeared. So for each transaction we get the items, get all the possible lines and check only them.
import collections
# df1 and df2 are initialized
def get_sum_score_precalculated_func(items_cost_df):
# create a dict of possible indexes to search for an item
items_search_dict = collections.default_dict(set)
for i, (_, row) in enumerate(items_cost_df.iterrow()):
for item in row["items"]:
items_search_dict[item].add(i)
def sum_score(transaction):
possible_indexes = set()
for i in transaction:
possible_indexes += items_search_dict[i]
score = 0
for i in possible_indexes:
row = items_cost_df.iloc[i]
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
return sum_score
df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))
Here I use
set which is an unordered storage of unique values (it helps to join possible line numbers and avoid double count).
collections.defaultdict which is a usual dict, but if you are trying to access uninitialized values it fill it with the given data (blank set in my case). It help to avoid if x not in my_dict: my_dict[x] = set(). I also use so called "closure", which means sum_score function will have access to items_cost_df and items_search_dict which were accessible at the level the sum_score function was declared even after it was returned and get_sum_score_precalculated_func
That should be much faster in case the items are quite unique and can be found only in a few lines of df2.
If you have quite a few unique items and so many identical transactions, you'd better calculate score for each unique transaction first. And then just join the result.
transactions_score = []
for transaction in df1["transactions"].unique():
score = sum_score(transaction)
transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
transaction_score,
columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")
Here I use sum_score from first example of code
P.S. With the python error message there should be a line number which helps a lot to understand the problem.
# convert df_1 to dictionary for iteration
df_1_dict = dict(zip(df_1["id"], df_1["transactions"]))
# convert df_2 to list for iteration as there is no unique column
df_2_list = df_2.values.tolist()
# iterate through each combination to find a valid one
new_data = []
for rows in df_2_list:
items = rows[0]
costs = rows[1]
for key, value in df_1_dict.items():
# find common items in both
common = set(value).intersection(set(items))
# execute of common item exist in second dataframe
if len(common) == len(items):
new_row = {"id": key, "transactions": value, "costs": costs}
new_data.append(new_row)
merged_df = pd.DataFrame(new_data)
merged_df = merged_df[["id", "transactions", "costs"]]
# group the data by id to get total cost for each id
merged_df = (
merged_df
.groupby(["id"])
.agg({"costs": "sum"})
.reset_index()
)
I have an empty Pandas dataframe and I'm trying to add a row to it. Here's what I mean:
text_img_count = len(BeautifulSoup(html, "lxml").find_all('img'))
print 'img count: ', text_img_count
keys = ['text_img_count', 'text_vid_count', 'text_link_count', 'text_par_count', 'text_h1_count',
'text_h2_count', 'text_h3_count', 'text_h4_count', 'text_h5_count', 'text_h6_count',
'text_bold_count', 'text_italic_count', 'text_table_count', 'text_word_length', 'text_char_length',
'text_capitals_count', 'text_sentences_count', 'text_middles_count', 'text_rows_count',
'text_nb_digits', 'title_char_length', 'title_word_length', 'title_nb_digits']
values = [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count,
text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count,
text_bold_count, text_italic_count, text_table_count, text_word_length,
text_char_length, text_capitals_count, text_sentences_count, text_middles_count,
text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
numeric_df = pd.DataFrame()
for key, value in zip(keys, values):
numeric_df[key] = value
print numeric_df.head()
However, the output is this:
img count: 2
Empty DataFrame
Columns: [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count, text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count, text_bold_count, text_italic_count, text_table_count, text_word_length, text_char_length, text_capitals_count, text_sentences_count, text_middles_count, text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
Index: []
[0 rows x 23 columns]
This makes it seem like numeric_df is empty after I just assigned values for each of its columns.
What's going on?
Thanks for the help!
What I usually do to add a column to the empty data frame is to append the information into a list and then give it a data frame structure. For example:
df=pd.DataFrame()
L=['a','b']
df['SomeName']=pd.DataFrame(L)
And you have to use pd.Series() if the list is make of numbers.