So here is my code updating many column values based on a condition of split values of the column 'location'. The code works fine, but as its iterating by row it's not efficient enough. Can anyone help me to make this code work faster please?
for index, row in df.iterrows():
print index
location_split =row['location'].split(':')
after_county=False
after_province=False
for l in location_split:
if l.strip().endswith('ED'):
df[index, 'electoral_district'] = l
elif l.strip().startswith('County'):
df[index, 'county'] = l
after_county = True
elif after_province ==True:
if l.strip()!='Ireland':
df[index, 'dublin_postal_district'] = l
elif after_county==True:
df[index, 'province'] = l.strip()
after_province = True
'map' was what I needed :)
def fill_county(column):
res = ''
location_split = column.split(':')
for l in location_split:
if l.strip().startswith('County'):
res= l.strip()
break
return res
df['county'] = map(fill_county, df['location'])
Related
community,
I have a sorted pandas dataframe that looks as following:
I want to merge rows that have overlapping values in start and end columns. Meaning that if the end value of initial row is bigger than start value of the sequential one or any othe sequential, they will be merged into one row. Examples are rows 3, 4 and 5. Output I would expect is:
To do so, I am trying to implement recursion function, that would loop over the dataframe until condition worsk and then return me a number that would be used to search location for the end row .
However, the functioin I am trying to implement, returns me empty dataframe. Could you help me please, where should I put attention, or what alternative can I build if recurtion is not a solution?
def row_merger(pd_df):
counter = 0
new_df = pd.DataFrame(columns=pd_df.columns)
for i in range(len(pd_df) - 1):
def recursion_inside(pd_df, counter = 0):
counter = 0
if pd_df.iloc[i + 1 + counter]["q.start"] <= pd_df.iloc[i]["q.end"]:
counter = counter+1
recursion_inside(pd_df, counter)
else:
return counter
new_row = {"name": pd_df["name"][i], "q.start": pd_df.iloc[i]
["q.start"], "q.end": pd_df.iloc[i+counter]["q.start"]}
new_df.append(new_row, ignore_index=True)
return new_df
I don't see the benefit of using recursion here, so I would just iterate over the rows instead, building up the rows for the output dataframe one by one, e.g. like this:
def row_merger(df_in):
if len(df_in) <= 1:
return df_in
rows_out = []
current_row = df_in.iloc[0].values
for next_row in df_in.iloc[1:].values:
if next_row[1] > current_row[2]:
rows_out.append(current_row)
current_row = next_row
else:
current_row[2] = max(current_row[2], next_row[2])
rows_out.append(current_row)
return pd.DataFrame(rows_out, columns=df_in.columns)
I have some strings in a column and I want to explode the words out only if they are not within brackets. The column looks like this
pd.DataFrame(data={'a': ['first,string','(second,string)','third,string (another,string,here)']})
and I want the output to look like this
pd.DataFrame(data={'a': ['first','string','(second,string)','third','string','(another,string,here)']})
This sort of works, but i would like to not have to put the row number in each time
re.split(r',(?![^()]*\))', x['a'][0])
re.split(r',(?![^()]*\))', x['a'][1])
re.split(r',(?![^()]*\))', x['a'][2])
i thought i could do with a lmbda function but i cannot get it to work. Thanks for checking this out
x['a'].apply(lambda i: re.split(r',(?![^()]*\))', i))
It is not clear to me if the elements in your DataFrame may have multiple groups between brackets. Given that doubt, I have implemented the following:
import pandas as pd
import re
df = pd.DataFrame(data={'a': ['first,string','(second,string)','third,string (another,string,here)']})
pattern = re.compile("([^\(]*)([\(]?.*[\)]?)(.*)", re.IGNORECASE)
def findall(ar, res = None):
if res is None:
res = []
m = pattern.findall(ar)[0]
if len(m[0]) > 0:
res.extend(m[0].split(","))
if len(m[1]) > 0:
res.append(m[1])
if len(m[2]) > 0:
return findall(ar[2], res = res)
else:
return res
res = []
for x in df["a"]:
res.extend(findall(x))
print(pd.DataFrame(data={"a":res}))
Essentially, you recursively scan the last part of the match until you find no more words between strings. If order was not an issue, the solution is easier.
I am trying to create sub list of indices by grouping indices of tuples with any of the elements being common from a list of tuples or keeping unique tuples indices separate. The definition of unique tuple being no element of the tuple is same as the elements in same position of other tuples in the list.
Example: List which groups same company together,with same company defined as same name or same registration number or same name of CEO.
company_list = [("companyA",0002,"ceoX"),
("companyB"),0002,"ceoY"),
("companyC",0003,"ceoX"),
("companyD",004,"ceoZ")]
The desired output would be:
[[0,1,2],[3]]
Does anyone know of a solution for this problem?
The companies form a graph. You want to create clusters from connected companies.
Try this:
company_list = [
("companyA",2,"ceoX"),
("companyB",2,"ceoY"),
("companyC",3,"ceoX"),
("companyD",4,"ceoZ")
]
# Prepare indexes
by_name = {}
by_number = {}
by_ceo = {}
for i, t in enumerate(company_list):
if t[0] not in by_name:
by_name[t[0]] = []
by_name[t[0]].append(i)
if t[1] not in by_number:
by_number[t[1]] = []
by_number[t[1]].append(i)
if t[2] not in by_ceo:
by_ceo[t[2]] = []
by_ceo[t[2]].append(i)
# BFS to propagate group to connected companies
groups = list(range(len(company_list)))
for i in range(len(company_list)):
g = groups[i]
queue = [g]
while queue:
x = queue.pop(0)
groups[x] = g
t = company_list[x]
for y in by_name[t[0]]:
if g < groups[y]:
queue.append(y)
for y in by_number[t[1]]:
if g < groups[y]:
queue.append(y)
for y in by_ceo[t[2]]:
if g < groups[y]:
queue.append(y)
# Assemble result
result = []
current = None
last = None
for i, g in enumerate(groups):
if g != last:
if current:
result.append(current)
current = []
last = g
current.append(i)
if current:
result.append(current)
print(result)
Fafl's answer is definitely more performant. If you're not worried about performance, here is a brute-force solution that might be easier to read. Tried to make it clear with some comments.
def find_index(res, target_index):
for index, sublist in enumerate(res):
if target_index in sublist:
# yes, it's present
return index
return None # not present
def main():
company_list = [
('companyA', '0002', 'CEOX'),
('companyB', '0002', 'CEOY'),
('companyC', '0003', 'CEOX'),
('companyD', '0004', 'CEOZ'),
('companyE', '0004', 'CEOM'),
]
res = []
for index, company_detail in enumerate(company_list):
# check if this `index` is already present in a sublist in `res`
# if the `index` is already present in a sublist in `res`, then we need to add to that sublist
# otherwise we will start a new sublist in `res`
index_to_add_to = None
if find_index(res, index) is None:
# does not exist
res.append([index])
index_to_add_to = len(res) - 1
else:
# exists
index_to_add_to = find_index(res, index)
for c_index, c_company_detail in enumerate(company_list):
# inner loop to compare company details with the other loop
if c_index == index:
# same, ignore
continue
if company_detail[0] == c_company_detail[0] or company_detail[1] == c_company_detail[1] or company_detail[2] == c_company_detail[2]:
# something matches, so append
res[index_to_add_to].append(c_index)
res[index_to_add_to] = list(set(res[index_to_add_to])) # make it unique
print(res)
if __name__ == '__main__':
main()
Check this out, I tried a lot for it. May be I am missing some test cases. Performance wise I think its good.
I have used set() and pop those which lie in one group.
company_list = [
("companyA",2,"ceoX"),
("companyB",2,"ceoY"),
("companyC",3,"ceoX"),
("companyD",4,"ceoZ"),
("companyD",3,"ceoW")
]
index = {val: key for key, val in enumerate(company_list)}
res = []
while len(company_list):
new_idx = 0
temp = []
val = company_list.pop(new_idx)
temp.append(index[val])
while new_idx < len(company_list) :
if len(set(val + company_list[new_idx])) < 6:
temp.append(index[company_list.pop(new_idx)])
else:
new_idx += 1
res.append(temp)
print(res)
I am trying to create some lag features by subtracting a month from each date in my datetime column and then assigning a column value from the past date to the current one.
This is my code:
for row_index in range(0,len(merger)):
date = merger.loc[merger.index[row_index],'datetime']
prev = subtract_one_month(date)
inde = merger.loc[merger['datetime'] == str(prev),'count'].index.values.astype(int)
if inde == []:
continue
else:
inde = inde[0]
merger.loc[merger.index[row_index], 'count_lag_month'] =
merger.loc[merger.index[inde], 'count']
The inner if else loop is meant to deal with cases where the date I'm looking for doesn't exist.
The code above simply gives me a list of NaNs. I would appreciate any help.
I've changed my
first = []
mean = []
wrkday = []
count = []
for row_index in range(0,len(merger)):
print(row_index)
date = merger.loc[merger.index[row_index],'datetime']
prev = subtract_one_month(date)
inde = merger.loc[merger['datetime'] == str(prev)].index.values.astype(int)
if inde.size == 0:
first.append(0)
mean.append(0)
wrkday.append(0)
continue
else:
inde = inde[0]
first.append(merger.loc[merger.index[inde], 'count'])
mean.append(merger.loc[merger.index[inde], 'monthly_mean_count'])
wrkday.append(merger.loc[merger.index[inde], 'monthly_wrkday_mean_count'])
prev_day = subtract_one_day(date)
inde = merger.loc[merger['datetime'] == str(prev_day)].index.values.astype(int)
if inde.size == 0:
count.append(0)
continue
else:
inde = inde[0]
count.append(merger.loc[merger.index[inde], 'count'])
merger['count_lag_month'] = first
merger['monthly_mean_count_lag_month'] = mean
merger['monthly_wrkday_mean_count_lag_month'] = wrkday
merger['count_lag_day'] = count
It uses lists instead and it seems to run at a decent speed. I'm not sure if it's the best approach though.
Q6
4;99
3;4;8;9;14;18
2;3;8;12;18
2;3;11;18
2;3;8;18
2;3;4;5;6;7;8;9;11;12;15;16;17;18
2;3;4;8;9;10;11;13;18
1;3;4;5;6;7;13;16;17
2;3;4;5;6;7;8;9;11;12;14;15;18
3;11;18
2;3;5;8;9;11;12;13;15;16;17;18
2;5;11;18
1;2;3;4;5;8;9;11;17;18
3;7;8;11;13;14
2;3;8;18
2;13
2;3;5;8;9;11;12;13;18
2;3;4;9;11;12;18
2;3;5;9;11;18
1;2;3;4;5;6;7;8;9;11;14;15;16;17;18
2;3;8;11;13;18
import pandas as pd
df_1 = pd.read_csv('amazon_final 29082018.csv')
list_6 = list(df_1["Q6"])
list_6 = list(map(str, list_6))
list_7 = list(zip(list_6))
tem_list = []
for x in list_6:
if ('3' in x[0]):
tem_list.append('Fire')
else:
tem_list.append(None)
df_1.to_csv('final.csv', index=False)
I have many such columns in data.
I want to extract value '3' from this, the code which i wrote is give giving me 3 value along with 13,23,33 so on. I only want count of rows having value 3.
You need to break up the rows and convert each value to an integer. At the moment you are looking for the presence of the string "3" which is why strings like "2;13" pass the test. Try something like this:
list_6 = ["4;99", "3;4;8;9;14;18", "2;3;8;12;18", "2;3;11;18", "2;3;8;18",
"2;3;4;5;6;7;8;9;11;12;15;16;17;18", "2;3;4;8;9;10;11;13;18",
"1;3;4;5;6;7;13;16;17", "2;3;4;5;6;7;8;9;11;12;14;15;18", "3;11;18",
"2;3;5;8;9;11;12;13;15;16;17;18", "2;5;11;18", "1;2;3;4;5;8;9;11;17;18",
"3;7;8;11;13;14", "2;3;8;18", "2;13", "2;3;5;8;9;11;12;13;18",
"2;3;4;9;11;12;18", "2;3;5;9;11;18",
"1;2;3;4;5;6;7;8;9;11;14;15;16;17;18", "2;3;8;11;13;18"]
temp_list = []
for x in list_6:
numbers = [int(num_string) for num_string in x.split(';')]
if (3 in numbers):
temp_list.append('Fire')
else:
temp_list.append('None')
print(temp_list)