Dict to Pandas Dataframe - python

Creaing first time my .csv export from a dict.
My dict has the fllowing structure:
dict_all[key] = {
"id_ja" : None,
"id_nein" : None ,
"ZUW_ja": set(),
"ZUW_nein": set(),
"missing_ZUW_ja" : set(),
"missing_ZUW_nein" : set()
}
My .CSV should look like:
ID_yes/ID_no, ZUW (this needs to be "ZUW" in every row), missing_ZUW_yes/missing_ZUW_nein and Relation (which needs to be -1 in every row)
For
missing_ZUW_yes/missing_ZUW_nein
I need to write a single row for each entry in this set.
That means the other three columns need to be duplicated as long there is as ID inside my missing_ZUW_yes/missing_ZUW_nein
Probably the easiest is to iterate over id_yes first and in this loop adding a row for each entry in missing_ZUW_yes. If the first half is done it might be easier to continue with id_no and missing_ZUW_no.. am I right with that?
My relevant dict entries look like:
dict["LM_Doctor"]= {"id_ja": 122344, "id_nein":122345, "missing_ZUW_ja": 123,132,143,12, "missing_ZUW_ja": 432,64,321}
and in the csv it should look like:
row 0 = Term ID 1
row 1 = 122344
row 0 = ZUW
row 1 = ZUW
row 0 = Term ID 2
row 1 = first id from missing_ZUW_yes #in this example 123
row 0 = RV
row 1 = -1
and row two should look the same except the fact that there should be 132 for missing_ZUW_yes

Related

Openpyxl and Binary Search

The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.
You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10

Return a copy of a row in a table

If I have a table (CSV) of data and the first column of data is an ID. I want to write code to search for the ID in the first column and then return a copy of the entire row of data for that corresponding ID. Sorry if it is a simple question, I am newer to python and struggling with this. I thought maybe you would search column one for the row number of the ID and then from there return the whole row. Any help or tips would be appreciated.
Table Example:
ID,LAST NAME,FIRST NAME,DOB
A001,Smith,Bob,1995-07-23
A002,Jones,John,1962-05-15
A003,Walker,Willy,1984-01-12
A004,Kelly,Sara,2001-12-01
def get_id_row(id,table):
"""id: id you want to search
table: is a 2 dim table"""
pass
This is a psuedo code which might help you. I wrote it on my mobile phone so it might be a bit dirty.
IDs = [1,2,3]
values = [101,102,103]
myTable = [IDs, values]
def get_id_row(id,table):
rowInd = 0
for tempId in table[0]:
if tempId == id:
return rowInd
rowInd += 1
print("Id:1 from the row number : "),
print(get_id_row(1, myTable))
print("Id:2 from the row number : "),
print(get_id_row(2, myTable))
print("Id:3 from the row number : "),
print(get_id_row(3, myTable))
The output:
Id:1 from the row number : 0 Id:2 from the row number : 1 Id:3
from the row number : 2

Replace row cell of table with row cell of another table using Python

In Python, I'm trying to...
look at 2 cells from
a row of a table
and then look at 2 cells from
a row of a second table
and then, if the value of [Table1.RowCell1] is equal to the value of [Table2.RowCell1],
populate [Table2.RowCell2] with [Table1.RowCell2]
What would the code for something like this look like?
Would it require nested for loops?
Here's the code that I have right now:
with arcpy.da.SearchCursor(sj, ["TARGET_FID", field2]) as scursor:
for srow in scursor:
with arcpy.da.UpdateCursor(fl, ["OBJECTID", field]) as ucursor:
for urow in ucursor:
if srow[0] == urow[0]:
arcpy.AddWarning("You updated FID:" + str(srow[0]))
urow[1] = srow[1]
ucursor.updateRow(urow)
count = count + 1
Does this logic make sense?

How to identify string repetition throughout rows of a column in a Pandas DataFrame?

I'm trying to think of a way to best handle this. If I have a data frame like this:
Module---|-Line Item---|---Formula-----------------------------------------|-repetition?|--What repeated--------------------------------|---Where repeated
Module 1-|Line Item 1--|---hello[SUM: hello2]------------------------------|----yes-----|--hello[SUM: hello2]---------------------------|---Module 1 Line item 2
Module 1-|Line Item 2--|---goodbye[LOOKUP: blue123] + hello[SUM: hello2]---|----yes-----|--hello[SUM: hello2], goodbye[LOOKUP: blue123]-|---Module 1 Line item 1, Module 2 Line Item 1
Module 2-|Line Item 1--|---goodbye[LOOKUP: blue123] + some other line item-|----yes-----|--goodbye[LOOKUP: blue123]---------------------|---Module 1 Line item 2
How would I go about setting up a search and find to locate and identify repetition in the middle or on edges or complete strings?
Sorry the formatting looks bad
Basically I have the module, line item, and formula columns filled in, but I need to figure out some sort of search function that I can apply to each of the last 3 columns. I'm not sure where to start with this.
I want to match any repetition that occurs between 3 or more words, including if for example a formula was 1 + 2 + 3 + 4 and that occurred 4 times in the Formula column, I'd want to give a yes to the boolean column "repetition" return 1 + 2 + 3 + 4 on the "Where repeated" column and a list of every module/line item combination where it occurred on the last column. I'm sure I can tweak it more to fit my needs once I get started.
This one was a bit messy, is surely some more straight forward way to do some of the steps, but it worked for your data.
Step 1: I just reset_index() (assuming index uses row numbers) to get row numbers into a column.
df.reset_index(inplace=True)
I then wrote a for loop which aim was to check for each given value, if that value is at any place in the given column (using the .str.contains() function, and if so, where. And then store that information in a dictionary. Note that here I used + to split the various values you search by as that looked to be a valid separator in your dataset, but you can adjust this accordingly
#the dictionary will have a key containing row number and the value we searched for
#the value will contain the module and line item values
result = {}
#create a rownumber variable so we know where in the dataset we are
rownumber = -1
#now we just iterate over every row of the Formula series
for row in df['Formula']:
rownumber +=1
#and also every relevant value within that cell
for value in row.split('+'):
#we clean the value from trailing/preceding whitespace
value = value.strip()
#and then we return our key and value and update our dictionary
key = 'row:|:'+str(rownumber)+':|:'+value
value = (df.loc[((df.Formula.str.contains(value,regex=False))) & (df.index!=rownumber),['Module','Line Item']])
result.update({key:value})
We can now unpack the dictionary into list, where we had a match:
where_raw = []
what_raw = []
rows_raw = []
for key,value in zip(result.keys(),result.values()):
if 'Empty' in str(value):
continue
else:
where_raw.append(list(value['Module']+' '+value['Line Item']))
what_raw.append(key.split(':|:')[2])
rows_raw.append(int(key.split(':|:')[1]))
tempdf = pd.DataFrame({'row':rows_raw,'where':where_raw,'what':what_raw})
tempdf now contains one row per match, however, we want to have one row per original row in the df, so we combine all matches for each main row into one
where = []
what = []
rows = []
for row in tempdf.row.unique():
where.append(list(tempdf.loc[tempdf.row==row,'where']))
what.append(list(tempdf.loc[tempdf.row==row,'what']))
rows.append(row)
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}))
lastly we can now get the result by merging the result with our original dataframe
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}),how='left',on='index').drop('index',axis=1)
and lastly we can add the repeated column like this:
result['repeated'] = (result['what']!='')
print(result)
Module Line Item Formula what where
Module 1 Line Item 1 hello[SUM: hello2] ['hello[SUM: hello2]'] [['Module 1 Line Item 2']]
Module 1 Line Item 2 goodbye[LOOKUP: blue123] + hello[SUM: hello2] ['goodbye[LOOKUP: blue123]', 'hello[SUM: hello2]'] [['Module 2 Line Item 1'], ['Module 1 Line Item 1']]
Module 2 Line Item 1 goodbye[LOOKUP: blue123] + some other line item ['goodbye[LOOKUP: blue123]'] [['Module 1 Line Item 2']]

Pandas - Change value in column based on its relationship with another column

I am working with the sklearn.datasets.fetch_20newsgroups() dataset. Here, there are some documents that belong to more than one news group. I want to treat those documents as two different entities that each belong to one news group. To do this, I've brought the document IDs and group names into a dataframe.
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
For each Document_ID, I want to change its value if it has more than one Group number assigned. Example,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
I want this to become:
>> Document_ID Group
5392 76139 6
5680 12345 17
Here, 12345 is a random new ID that is not already in keys list.
How can I do this?
You can find all the rows that contain duplicate Document_ID after the first with the duplicated methdod. Then create a list of new id's beginning with one more than the max id. Use the loc indexing operator to overwrite the duplicate keys with the new ids.
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
Test case
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
Ensure that no duplicates remain.
groups.Document_ID.duplicated(keep='first').any()
False
Kinda Hacky, but why not!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
Hope that helps...

Categories