Related
I'm searching for difference between columns in DataFrame and a data in List.
I'm doing it this way:
# pickled_data => list of dics
pickled_names = [d['company'] for d in pickled_data] # get values from dictionary to list
diff = df[~df['company_name'].isin(pickled_names)]
which works fine, but I realized that I need to check not only for company_name but also for place, because there could be two companies with the same name.
df contains also column place as well as pickled_data contains place key in the dictionary.
I would like to be able to do something like this
pickled_data = [(d['company'], d['place']) for d in pickled_data]
diff = df[~df['company_name', 'place'].isin(pickled_data)] # For two values in same row
You can convert values to MultiIndex by MultiIndex.from_tuples, then convert both columns too and compare:
pickled_data = [(d['company'], d['place']) for d in pickled_data]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
Sample:
data = {'company_name':['A1','A2','A2','A1','A1','A3'],
'place':list('sdasas')}
df = pd.DataFrame(data)
pickled_data = [('A1','s'),('A2','d')]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
print (diff)
company_name place
2 A2 a
4 A1 a
5 A3 s
You can form a set of tuples from your pickled_data for faster lookup later, then using a list comprehension over company_name and place columns of the frame, we get a boolean list of whether they are in the frame or not. Then we use this to index into the frame:
comps_and_places = set((d["company"], d["place"]) for d in pickled_data)
not_in_list = [(c, p) not in comps_and_places
for c, p in zip(df.company_name, df.place)]
diff = df[not_in_list]
I have been able to get the calculation to work but now I am having trouble appending the results back into the data frame e3. You can see from the picture that the values are printing out.
brand_list = list(e3["Brand Name"])
product_segment_list = list(e3['Product Segment'])
# Create a list of tuples: data
data = list(zip(brand_list, product_segment_list))
for i in data:
step1 = e3.loc[(e3['Brand Name']==i[0]) & (e3['Product Segment']==i[1])]
Delta_Price = (step1['Price'].diff(1).div(step1['Price'].shift(1),axis=0).mul(100.0))
print(Delta_Price)
it's easier to use groupby. In each loop 'r' will be just the grouped rows from e3 dataframe from each category and i an index.
new_df = []
for i,r in e3.groupby(['Brand Name','Product Segment']):
price_num = r["Price"].diff(1).values
price_den = r["Price"].shift(1).values
r['Price Delta'] = price_num/price_den
new_df.append(r)
e3_ = pd.concat(new_df, axis = 1)
Is there a way to use list comprehension to create a list of tuples with two different conditions.
I am interacting through a Pandas DF and I want to return an entire row in tuple if it matches either condition. The first is if the DF has nan values in any column.
The other is if a column in the DF called ODFS_FILE_CREATE_DATETIME doesn't match the regex pattern for the date column. The date column is supposed to have an output that looks like this: 2005242132. 10 number digits. So if the df returns something like 2004dg, it should be picked up as an error and the row should be added to my list of tuples
My sad pathetic attempt:
[tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values or x in odfscdate_re.search(str(odfscsv_df['ODFS_FILE_CREATE_DATETIME'])) ]
Full Function that contains the two seperate list of tuples:
def process_csv_formatting(csv):
odfscsv_df = pd.read_csv(csv, header=None,names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
odfscsv_df['CSV_FILENAME'] = csv.name
odfscdate_re = re.compile(r"\d{10}")
#print(odfscsv_df)
#odfscsv_df = odfscsv_df.replace('', np.nan)
errortup = [(odfsname, "Bad_ODFS_FILE_CREATE_DATETIME= " + str(cdatetime), csv.name) for odfsname,cdatetime in zip(odfscsv_df['ODFS_LOG_FILENAME'], odfscsv_df['ODFS_FILE_CREATE_DATETIME']) if not odfscdate_re.search(str(cdatetime))]
emptypdf = pd.DataFrame(columns=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
print([tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values])
[tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values or x in odfscdate_re.search(str(odfscsv_df['ODFS_FILE_CREATE_DATETIME'])) ]
#print(odfscsv_df[(odfscsv_df[column_name].notnull()) & (odfscsv_df[column_name] != u'')].index)
for index, row in odfscsv_df.iterrows():
#print((row['WAFER_SCRIBE']))
print((row['ODFS_FILE_CREATE_DATETIME']))
#errortup = [x for x in odfscsv_df['ODFS_FILE_CREATE_DATETIME']]
if len(errortup) != 0:
#print(errortup) #put this in log file statement somehow
#print(errortup[0][2])
return emptypdf
else:
return odfscsv_df
Sample CSV Data. The commas delienate the cells:
2005091432_943SK1J.00J.SK1J-23.FPD.FMGN520.Jx6D36ny5EO53qAtX4.log,,W943SK10,MGN520,0Z0RK072TCD2
2005230137_014SF1J.00J.SF1J-23.WCPC.FMGN520.XlwHcgyP5eFCpZm5cf.log,,W014SF10,MGN520,DM4MU129SEC1
2005240909_001914J.E0J.914J-15.WRO3PC.FMGN520.nZKn7OvjGKw1i4pxiu.log,,K001914E,MGN520,DM3FZ226SEE3
2005242132_001914J.E0J.914J-15.WRO4PC.FMGN520.V8dcLhEgygRj2rP2Df.log,2005242132,K001914E,MGN520,DM3FZ226SEE3
2005251037_001914J.E0J.914J-15.WRO4PC.FMGN520.dyixmQ5r4SvbDFkivY.log,2005251037,K001914E,MGN520,DM3FZ226SEE3
2005251215_949949J.E0J.949J-21.WRO2PP.FMGN520.yp1i4e7a7D1ighkdB7.log,2005251215,K949949E,MGN520,DG2KV122SEF6
2005251231_949949J.E0J.949J-25.WRO2PP.FMGN520.oLQGhc2whAlhC3dSuR.log,2005251231,K949949E,MGN520,DG2KV333SEF3
2005260105_001914J.E0J.914J-15.WRO4PC.FMGN520.wOQMUOfZgkQK9iHJS5.log,2005260105,K001914E,MGN520,DM3FZ226SEE3
2006111130_950909J.00J.909J-22.FPC.FMGN520.UuqeGtw9xP6lLDUW9N.log,2006111130,K9509090,MGN520,DG7LW031SEE7
2006111612_950909J.00J.909J-22.FPC.FMGN520.hoDl3QSNPKhcs4oA2N.log,2006111612,K9509090,MGN520,DG7LW031SEE7
2006120638_006914J.E0J.914J-15.CZPC.FMGN520.qCgFUH2H21ieT641i9.log,2006120638,K006914E,MGN520,DM8KJ568SEC3
2006122226_006914J.E0J.914J-15.CZPC.FMGN520.nSHSp7klxjrQlVTcCu.log,2006122226,K006914E,MGN520,DM8KJ568SEC3
2006130919_006914J.E0J.914J-15.CZPC.FMGN520.Zd6DrMUsCjuEVBFwvn.log,2006130919,K006914E,MGN520,DM8KJ568SEC3
2006140457_007911J.E0J.911J-25.RDR2PC.FMGN520.QPX9r59TnXObXyfibv.log,2006140457,K007911E,MGN520,DN4AU351SED1
2006141722_007911J.E0J.911J-25.WCPC.FMGN520.dNQLkvQlPTplEjJspB.log,2006141722,K007911E,MGN520,DN4AU351SED1
2006160332_007911J.E0J.911J-25.WCPC.FMGN520.DQiH82Ze9fCoaLVbDE.log,2006160332,K007911E,MGN520,DN4AU351SED1
2006170539_007911J.E0J.911J-25.WCPC.FMGN520.TjakhXkmhmlGhfLheo.log,2006170539,K007911E,MGN520,DN4AU351SED1
Add dtype parameter to import 'ODFS_FILE_CREATE_DATETIME' as dtype string when you call read_csv
odfscsv_df = pd.read_csv(csv, header=None,
names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'],
dtype={'ODFS_FILE_CREATE_DATETIME': str})
m1 = odfscsv_df.isna().any(1)
s = odfscsv_df['ODFS_FILE_CREATE_DATETIME']
m2 = ~s.astype(str).str.isnumeric()
m3 = s.astype(str).str.len().ne(10)
[tuple(x) for x in odfscsv_df[m1 | m2 | m3].values]
I have a large dataframe of urls and a smaller 2nd dataframe that contains columns of strings which I want to use to merge the two dataframes together. Data from the 2nd df will be used to populate the larger 1st df.
The matching strings can contain * wildcards (and more then one) but the order of the grouping still matters; so "path/*path2" would match with "exsample.com/eg_path/extrapath2.html but not exsample.com/eg_path2/path/test.html. How can I use the strings in the 2nd dataframe to merge the two dataframes together. There can be more then one matching string in the 2nd dataframe.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
what_I_am_after = pd.DataFrame(result)
Not very robust but gives the correct answer for my example.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
results = pd.DataFrame(columns=['url','hits','group'])
for index,row in df2.iterrows():
for x in row[1:]:
group = x.split('*')
rx = "".join([str(x)+".*" if len(x) > 0 else '' for x in group])
if rx == "":
continue
filter = df1['url'].str.contains(rx,na=False, regex=True)
if filter.any():
temp = df1[filter]
temp['group'] = row[0]
results = results.append(temp)
d3 = df1.merge(results,how='outer',on=['url','hits'])
I have script which looks at the rows and columns headers belonging to a group (REG_ID) and sums the values. The code runs on a matrix (small subset) as follows:
Outputs
My code runs well for calculating the sum for all the IDs based on rows and columns belonging to each internal group (REG_ID). For example for all row and column IDs which belong to REG_ID 1 are summed so the total flows between region 1 and region 1 (internal flows) is calculated and so on for each region.
I wish to extend this code by calculating (summing) the flows between regions for example region 1 to region 2, 3, 4 ,5....
I figure that I need to include another loop within the existing while loop but would really appreciate some help to figure out where it should be and how to construct it.
My code which currently runs on the internal flow sum (1-1, 2-2, 3-3 etc) is as follows:
global index
index = 1
x = index
while index < len(idgroups):
ward_list = idgroups[index] #select list of ward ids for each region from list of lists
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
print dfflows
new_df = pd.merge(pairlist, dfflows, how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2'])
print new_df #useful for checking dataframe merges
regionflows = r"C:\Temp\AllNI\regionflows.csv"
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"]
mergedcsv.to_csv(regionflows, columns = header, index=False)
regregflows = r"C:\Temp\AllNI\reg_regflows.csv"
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"]
pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv"
new_df.to_csv(pairlistCSV)
The output is as follows:
idgroups dataframe: (see image 1 - second part of image 1)
df7 and intflows for each region Reg_ID:(third part of image 1 - on the right)
ddflows dataframe:(fourth part of image 2)
and the final output is new_df:(fifth part of image 2)
I wish to populate the sums for all possible combinations of flows between the regions not just internal.
I figure I need to add another loop into the while loop. So possibly add an enumerate function like:
while index < len(idgroups):
#add line(s) to calculate flows between regions
for index, item in enumerate(idgroups):
ward_list = idgroups[index]
print ward_list
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
dfintflow = pd.DataFrame(intflow)
dfintflow.reset_index(level=0, inplace=True)
dfintflow.columns = ["RegID", "regflowsum"]
dfflows.set_value(index, 'RegID', index)
dfflows.set_value(index, 'RegID2', index)
dfflows.set_value(index, 'regflow', regflowsum)
mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
index += 1 #increment index number
I'm unsure how to integrate the item so struggling to extend the code for all combinations. Any advice appreciated.
Update based on flows function:
w=pysal.rook_from_shapefile("C:/Temp/AllNI/NIW01_sort.shp",idVariable='LABEL')
Simil = pysal.open("C:/Temp/AllNI/simNI.csv")
Similarity = np.array(Simil)
db = pysal.open('C:\Temp\SQLite\MatrixCSV2.csv', 'r')
dbf = pysal.open(r'C:\Temp\AllNI\NIW01_sortC.dbf', 'r')
ids = np.array((dbf.by_col['LABEL']))
commuters = np.array((dbf.by_col['Total'],dbf.by_col['IDNO']))
commutersint = commuters.astype(int)
comm = commutersint[0]
floor = int(MIN_COM_CT + 100)
solution = pysal.region.Maxp(w=w,z=Similarity,floor=floor,floor_variable=comm)
regions = solution.regions
#print regions
writecsv = r"C:\Temp\AllNI\reg_output.csv"
csv = open(writecsv,'w')
csv.write('"LABEL","REG_ID"\n')
for i in range(len(regions)):
for lines in regions[i]:
csv.write('"' + lines + '","' + str(i+1) + '"\n')
csv.close()
flows = r"C:\Temp\SQLite\MatrixCSV2.csv"
regs = r"C:\Temp\AllNI\reg_output.csv"
wardflows = pd.read_csv(flows)
regoutput = pd.read_csv(regs)
merged = pd.merge(wardflows, regoutput)
#duplicate REG_ID column as the index to be used later
merged['REG_ID2'] = merged['REG_ID']
merged.to_csv("C:\Temp\AllNI\merged.csv", index=False)
mergedcsv = pd.read_csv("C:\Temp\AllNI\merged.csv",index_col='WardID_1') #index this dataframe using the WardID_1 column
flabelList = pd.read_csv("C:\Temp\AllNI\merged.csv", usecols = ["WardID", "REG_ID"]) #create list of all FLabel values
reg_id = "REG_ID"
ward_flows = "RegIntFlows"
flds = [reg_id, ward_flows] #create list of fields to be use in search
dict_ref = {} # create a dictionary with for each REG_ID a list of corresponding FLABEL fields
#group the dataframe by the REG_ID column
idgroups = flabelList.groupby('REG_ID')['WardID'].apply(lambda x: x.tolist())
print idgroups
idgrp_df = pd.DataFrame(idgroups)
csvcols = mergedcsv.columns
#create a list of column names to pass as an index to select columns
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID').sum()
mergedcsvgroup.describe()
idList = idgroups[2]
df4 = pd.DataFrame()
df5 = pd.DataFrame()
col_ids = idList #ward id no
regiddf = idgroups.index.get_values()
print regiddf
#total number of region ids
#print regiddf
#create pairlist combinations from region ids
#combinations with replacement allows for repeated items
#pairs = list(itertools.combinations_with_replacement(regiddf, 2))
pairs = list(itertools.product(regiddf, repeat=2))
#print len(pairs)
#create a new dataframe with pairlists and summed data
pairlist = pd.DataFrame(pairs,columns=['origID','destID'])
print pairlist.tail()
header_pairlist = ["origID","destID","flow"]
header_intflow = ["RegID", "RegID2", "regflow"]
dfflows = pd.DataFrame(columns=header_intflow)
print mergedcsv.index
print mergedcsv.dtypes
#mergedcsv = mergedcsv.select_dtypes(include=['int64'])
#print mergedcsv.columns
#mergedcsv.rename(columns = lambda x: int(x), inplace=True)
def flows():
pass
#def flows(mergedcsv, region_a, region_b):
def flows(mergedcsv, ward_lista, ward_listb):
"""Return the sum of all the cells in the row/column intersections
of ward_lista and ward_listb."""
mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64']
regionflows = mergedcsv.loc[ward_lista, ward_listb]
regionflowsum = regionflows.values.sum()
#grid = [ax, bx, regflowsuma, regflowsumb]
gridoutput = [ax, bx, regionflowsum]
print gridoutput
return regflowsuma
return regflowsumb
#print mergedcsv.index
#mergedcsv.columns = mergedcsv.columns.str.strip()
for ax, group_a in enumerate(idgroups):
ward_lista = map(int, group_a)
print ward_lista
for bx, group_b in enumerate(idgroups[ax:], start=ax):
ward_listb = map(int, group_b)
#print ward_listb
flow_ab = flows(mergedcsv, ward_lista, ward_listb)
#flow_ab = flows(mergedcsv, group_a, group_b)
This results in KeyError: 'None of [[189, 197, 198, 201]] are in the [columns]'
I have tried using ward_lista = map(str, group_a) and map(int, group_a) also but list objects not found in dataframe.loc.
The columns are mixed datatypes but all the columns containing the labels which should be sliced are of type int64.
I have tried many solutions around the datatypes but to no avail. Any suggestions?
I can't speak to the computations you're doing, but it seems like you just want to arrange combinations of groups. The question is whether they are directed or undirected- that is, do you need to compute flows(A,B) and flows(B,A), or just one?
If just one, you could do this:
for i,ward_list in enumerate(idgroups):
for j,ward_list2 in enumerate(idgroups[i:],start=i):
This would iterate over i,j pairs like:
0,0 0,1 0,2 ... 0,n
1,1 1,2 ... 1,n
2,2 ... 2,n
which would serve in the undirected case.
If you need to compute both flows(A,B) and flows(B,A), then you can simply push your code into a function called flows, and call it with reversed args, as shown. ;-)
Update
Let's define a function called flows:
def flows():
pass
Now, what are the parameters?
Well, looking at your code, it gets data from a DataFrame. And you want two different wards, so let's start with those. The result seems to be a sum of the resulting grid.
def flows(df, ward_a, ward_b):
"""Return the sum of all the cells in the row/column intersections
of ward_a and ward_b."""
return 0
Now I'm going to copy lines of your code:
ward_list = idgroups[index]
print ward_list
df6 = mergedcsv.loc[ward_list] #select rows with values in the list
dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
ward_listint = map(int, ward_list)
#dfrowscols = mergedcsv.loc[ward_list, ward_listint]
df7 = df6.loc[:, ward_liststr]
print df7
regflowsum = df7.values.sum() #sum all values in dataframe
intflow = [regflowsum]
print intflow
I think this is most of the flow function right here. Let's look.
The ward_list will obviously be either the ward_a or ward_b parameters.
I'm not sure what df6 is, because you sort of recompute it in df7. So that need to be clarified.
regflowsum is our desired output, I think.
Rewriting this into the function:
def flows(df, ward_a, ward_b):
"""Return the sum of all the cells in the row/column intersections
of ward_a and ward_b."""
print "Computing flows from:"
print " ", ward_a
print ""
print "flows into:"
print " ", ward_b
# Filter rows by ward_a, cols by ward_b:
grid = df.loc[ward_a, ward_b]
print "Grid:"
print grid
flowsum = grid.values.sum()
print "Flows:", flowsum
return flowsum
Now, I have assumed that the ward_a and ward_b values are already in the correct format. So we'll have to str-ify them or whatever outside the function. Let's do that:
for ax, group_a in enumerate(idgroups):
ward_a = map(str, group_a)
for bx, group_b in enumerate(idgroups[ax:], start=ax):
ward_b = map(str, group_b)
flow_ab = flows(mergedcsv, ward_a, ward_b)
if ax != bx:
flow_ba = flows(mergedcsv, ward_b, ward_a)
else:
flow_ba = flow_ab
# Now what?
At this point, you have two numbers. They will be equal when the wards are the same (internal flow?). At this point your original code stops being helpful because it only deals with internal flows, and not A->B flows, so I don't know what to do. But the values are in the variables, so ...