How to Combine Two Referential Lists With Python - python

I have stupid data coming out of a system, it needs to be flattened.
The main csv has these columns: hostname, program_name, version_name
However, there is only one row per host, so the last two fields look like this:
program_name contents:
Word
Excel
Cognos
Mozilla
version contents (not real, just for illustrative purposes):
2.3.2
121.3.0
build 22
What's the best way to ensure things match up and to more concisely and pythonically do this.
Here is what the real code looks like, the above is mainly for demo purposes:
for row in tan_output.programs:
names = row["Name"].splitlines()
versions = row["Version"].splitlines()
if(len(names) != len(versions)):
print("NAME and VERSION from tan_programs are not equal... Exiting")
exit()
else:
for name in names:
#tan_programs.append({"Count": row["Count"], "Hostname": row["Hostname"], "Name": row["Name"], "Version": row["Version"]})
I am stuck on the bottom for loop because I feel like I should be looping thru both lists simultaneously instead of looping thru one, and then what I was going to do, use a counter to reference the second one and form the flattened data.
PS, the file is 7 gigs... so the more efficient the better e.g. if I have to use the counter, I know from experience i += 1 is 100 times more efficient than i = i + 1

Just use the Counter... unless someone has a better idea:
tan_programs = []
for row in tan_output.programs:
names = row["Name"].splitlines()
versions = row["Version"].splitlines()
if(len(names) != len(versions)):
print("NAME and VERSION from tan_programs are not equal... Exiting")
exit()
else:
i = 0
for name in names:
tan_programs.append({"Hostname": row["Hostname"], "Name": name, "Version": versions[i]})
i += 1
Its actually very fast... the slow part is inserting 8 million records into a DB on another server over the network.

Related

Finding the line number of specific values in pandas dataframe - python

I am working on a school project and I am trying to simulate a library's catalogue system. I have .csv files that hold all the data I need but I am having a problem with checking if an inputted title, author, bar code, etc. is in the data set. I have searched around for quite a while trying different solutions but nothing is working.
The idea that I have right now is that if I can find at what line the inputted data, then I can use .loc[] to get the needed info.
Is this the right track? is there another, more efficient way to do this?
import pandas
mainData = pandas.read_csv("mainData.csv")
barcodes = mainData["Barcode"]
authors = mainData["Author"]
titles = mainData["Title/Subtitle"]
callNumbers = mainData["Call Number"]
k = "Han, Jenny,"
for i in authors:
if k == i:
print("Success")
k = authors.index[k]
print(authors[k])
else:
print("Fail" + k)
# Please Note: This code only checks for an author match and has all other fields left out as I thought this code was too inefficient to add the rest of the fields. The code also does not find the line on witch the matched are located, therefore .loc[] can not be used to print out all the data found.
This is the code I am using right now, It outputs the result along with an error Python IndexError: only integers, slices (\`:\`), ellipsis (\`\...\`), numpy.newaxis (\`None\`) and integer or boolean arrays are valid indices and is very slow. I would like the code to be able to output the books and their respective info. I have found the the .loc[] feature (mentioned above) outputs the info quite nicely. Here is the data I am using .
Edit: I have been able to reduce the time it takes for the program to run and made a functional "prototype"
authorFirst = authorFirst.lower()
authorFirst = authorFirst.title()
authorFirst += ","
authorSecond = input("Enter author's last name: ")
authorSecond = authorSecond.lower()
authorSecond = authorSecond.title()
authorSecond += ", "
authorInput = authorSecond + authorFirst
print(mainData[mainData["Author"].isin([authorInput])])
bookChoice = input("Please Enter the number to the left of the barcode to select a book: ")
print(mainData.loc[int(bookChoice)])
id provides the functionality that I am looking for but I feel that there has to be a better way of doing it. (Not asking the user to input the row number). Idk if this is possible tho.
I am new to python and this is my first time using pandas so i'm sorry if this is really shitty and hurts your brain.
Thank-you so much for your time!
Pandas does not really need to find the numeric index of something, to do indexing.
Since you have not provided any starting point or data, I'll just provide a few pointers here as there are mans ways to match and index things in pandas.
import pandas as pd
# build a library
library = pd.DataFrame({
"Author": ["H.G. Wells", "Hubert Selby Jr.", "Ken Kesey"],
"Title": [
"The War of the Worlds",
"Requiem for a Dream",
"One Flew Over the Cuckoo's Nest",
],
"Published": [1898, 1979, 1962],
})
# find on some characteristics
mask_wells = library.Author.str.contains("Wells")
mask_rfad = library["Title"] == "Requiem for a Dream"
mask_xixth = library["Published"] < 1900

Is it possible to use variables values in if statements in Python?

I have a database table about people. They have variables like age, height, weight etc..
I also have another database table about charasteristics of the people. This has three fields:
Id: Just a running number
Condition: For example "Person is teenager" or "Person is overweight"
Formula: For example for the "Person is teenager" the formula is "age > 12 and age < 20" or for the overweight "weight / height * height > 30"
There are more than 50 conditions like there. When I want to define the characteristics of the person I would need to make if statement for all these conditions which makes the code quite messy and also hard to maintain (when ever I add a new condition to database I also need to add a new if statement in the code)
If I type the formulas directly to database is it possible to use those as if statements directly? As in if(print(characteristic['formula']) etc..
What I am looking is something like this, I am using Python.
In this code
Person is one person already fetched from database as a dict
Characteristics are all the characteristics fetched from the database as a list of dictionaries
def getPeronCharacteristics(person, characteristics):
age = person['age']
weight = person['weight'] etc...
personsCharacteristics = []
for x in characteristics:
if(x['formula']):
personCharacteristics.append(x['condition'])
return personCharacteristics
Now in this part if(x['formula']) instead of checking if the variable is true it should "print" the variable value and run if statement agains that e.g. if(age > 12 and age < 20):
Is this possible in some way? Again the whole point of this is that when I come up with new conditions I could just add a new row to the database without altering any code and adding yet another if statement.
Do you mean like this?
#
#Example file for working with conditional statement
#
def main():
x,y =2,8
if(x < y):
st= "x is less than y"
print(st)
if __name__ == "__main__":
main()
This is possible using python's eval function:
if eval(x['formula']):
...
However, this is usually discouraged as it can make it complicated to understand your program, and can give security problems if you're not very careful about how your database is accessed and what can end up in there.

Python - program for searching for relevant cells in excel does not work correctly

I've written a code to search for relevant cells in an excel file. However, it does not work as well as I had hoped.
In pseudocode, this is it what it should do:
Ask for input excel file
Ask for input textfile containing keywords to search for
Convert input textfile to list containing keywords
For each keyword in list, scan the excelfile
If the keyword is found within a cell, write it into a new excelfile
Repeat with next word
The code works, but some keywords are not found while they are present within the input excelfile. I think it might have something to do with the way I iterate over the list, since when I provide a single keyword to search for, it works correctly. This is my whole code: https://pastebin.com/euZzN3T3
This is the part I suspect is not working correctly. Splitting the textfile into a list works fine (I think).
#IF TEXTFILE
elif btext == True:
#Split each line of textfile into a list
file = open(txtfile, 'r')
#Keywords in list
for line in file:
keywordlist = file.read().splitlines()
nkeywords = len(keywordlist)
print(keywordlist)
print(nkeywords)
#Iterate over each string in list, look for match in .xlsx file
for i in range(1, nkeywords):
nfound = 0
ws_matches.cell(row = 1, column = i).value = str.lower(keywordlist[i-1])
for j in range(1, worksheet.max_row + 1):
cursor = worksheet.cell(row = j, column = c)
cellcontent = str.lower(cursor.value)
if match(keywordlist[i-1], cellcontent) == True:
ws_matches.cell(row = 2 + nfound, column = i).value = cellcontent
nfound = nfound + 1
and my match() function:
def match(keyword, content):
"""Check if the keyword is present within the cell content, return True if found, else False"""
if content.find(keyword) == -1:
return False
else:
return True
I'm new to Python so my apologies if the way I code looks like a warzone. Can someone help me see what I'm doing wrong (or could be doing better?)? Thank you for taking the time!
Splitting the textfile into a list works fine (I think).
This is something you should actually test (hint: it does but is inelegant). The best way to make easily testable code is to isolate functional units into separate functions, i.e. you could make a function that takes the name of a text file and returns a list of keywords. Then you can easily check if that bit of code works on its own. A more pythonic way to read lines from a file (which is what you do, assuming one word per line) is as follows:
with open(filename) as f:
keywords = f.readlines()
The rest of your code may actually work better than you expect. I'm not able to test it right now (and don't have your spreadsheet to try it on anyway), but if you're relying on nfound to give you an accurate count for all keywords, you've made a small but significant mistake: it's set to zero inside the loop, and thus you only get a count for the last keyword. Move nfound = 0 outside the loop.
In Python, the way to iterate over lists - or just about anything - is not to increment an integer and then use that integer to index the value in the list. Rather loop over the list (or other iterable) itself:
for keyword in keywordlist:
...
As a hint, you shouldn't need nkeywords at all.
I hope this gets you on the right track. When asking questions in future, it'd be a great help to provide more information about what goes wrong, and preferably enough to be able to reproduce the error.

Compiling a dictionary by pulling data from other dictionaries

I am doing a project in which I extract data from three different data sets and combine it to look at campaign contributions. To do this I turned the relevant data from two of the sets into dictionaries (canDict and otherDict) with ID numbers as keys and the information I need (party affiliation) as values. Then I wrote a program to pull party information based on the key (my third set included these ID numbers as well) and match them with the employer of the donating party, and the amount donated. That was a long winded explanation, but I thought it would help with understanding this chunk of code.
My problem is that, for some reason, my third dictionary (employerDict) won't compile. By the end of this step I should have a dictionary containing employers as keys, and a list of tuples as values, but after running it, the dictionary remains blank. I've been over this line by line a dozen times and I'm pulling my hair out - I can't for the life of me think why it won't work, which is making it hard to search for answers. I've commented almost every line to try to make it easier to understand out of context. Can anyone spot my mistake?
Update: I added a counter, n, to the outermost for loop to see if the program was iterating at all.
Update 2: I added another if statement in the creation of the variable party, in case the ID at data[0] did not exist in canDict or in otherDict. I also added some already suggested fixes from the comments.
n=0
with open(path3) as f: # path3 is a txt file
for line in f:
n+=1
if n % 10000 == 0:
print(n)
data = line.split("|") # Splitting each line into its entries (delimited by the symbol |)
party = canDict.get(data[0]) # data[0] is an ID number. canDict and otherDict contain these IDs as keys with party affiliations as values
if party is None:
party = otherDict[data[0]] # If there is no matching ID number in canDict, search otherDict
if party is None:
party = 'Other'
else:
print('ERROR: party is None')
x = (party, int(data[14])) # Creating a tuple of the the party (found through the loop) and an integer amount from the file path3
employer = data[11] # Index 11 in path3 is the employer of the person
if employer != '':
value = employerDict.get(employer) # If the employer field is not blank, see if this employer is already a key in employerDict
if value is None:
employerDict[employer] = [x] # If the key does not exist, create it and add a list including the tuple x as its value
else:
employerDict[employer].append(x) # If it does exist, add the tuple x to the existing value
else:
print('ERROR: employer == ''')
Thanks for all the input everyone - however, it looks like its a problem with my data file, not a problem with the program. Dangit.

In SPSS Python essentials, can I get the value of an SPSS variable returned to Python for further use?

I have a database where each case holds info about handwritten digits, eg:
Digit1Seq : when in the sequence of 12 digits the "1" was drawn
Digit1Ht: the height of the digit "1"
Digit1Width: its width
Digit2Seq: same info for digit "2"
on up to digit "12"
I find I now need the information organized a little differently as well. In particular I want a new variables with the height and width of the first digit written, then the height and width of the second, etc., as SPSS vars
FirstDigitHt
FirstDigitWidth ...
TwelvthDigitWidth
Here's a Python program I wrote to do within SPSS what ought to be a very simple computation, but it runs into a sort of namespace problem:
BEGIN PROGRAM PYTHON.
import spss
indices = ["1", "2", "3","4","5", "6", "7", "8", "9", "10", "11", "12"]
seq=0
for i in indices:
spss.Submit("COMPUTE seq = COMDigit" + i + "Seq.")
spss.Submit("EXECUTE.")
spss.Submit("COMPUTE COM" + indices[seq] + "thWidth = COMDigit" + i + "Width.")
spss.Submit("COMPUTE COM" + indices[seq] + "thHgt = COMDigit" + i + "Hgt.")
spss.Submit("EXECUTE.")
END PROGRAM.
It's clear what's wrong here: the value of seq in the first COMPUTE command doesn't get back to Python, so that the right thing can happen in the next two COMPUTEcommands. Python's value of seq doesn't change, so I end up with SPSS code that gives me only two variables (COM1thWidth and COM1Hgt), into which COMDigit1Width, COMDigit2Width, etc. get written.
Is there any way to get Python to access SPSS's value of seq each time so that the string concatenation will create the correct COMPUTE? Or am I just thinking about this incorrectly?
Have googled extensively, but find no way to do this.
As I'm new to using Python in SPSS (and not all that much of wiz with SPSS) there may well be a far easier way to do this.
All suggestions most welcome.
Probably the easiest way to get your SPSS variable data into Python variables for manipulation is with the spss.Dataset class.
To do this, You will need:
1.) the dataset name of your SPSS Dataset
2.) either the name of the variable you want to pull data from or its index in your dataset.
If the name of the variable you want to extract data from is named 'seq' (as I believe it was in your question), then you can use something like:
BEGIN PROGRAM PYTHON.
from __future__ import with_statement
import spss
with spss.DataStep()
#the lines below create references to your dataset,
#to its variable list, and to its case data
lv_dataset = spss.Dataset(name = <name of your SPSS dataset>)
lv_caseData = lv_dataset.cases
lv_variables = lv_dataset.varlist
#the line below extracts all the data from the SPSS variable named 'seq' in the dataset referenced above into a list
#to make use of an SPSS cases object, you specify in square brackets which rows and which variables to extract from, such as:
#Each row you request to be extracted will be returned as a list of values, one value for each variable you request data for
#lv_theData = lv_caseData[rowStartIndex:rowEndIndex, columnStartIndex:columnEndIndex]
#This means that if you want to get data for one variable across many rows of data, you will get a list for each row of data, but each row's list will have only one value in it, hence in the code below, we grab the first element of each list returned
lv_variableData = [itm[0] for itm in lv_caseData[0:len(lv_caseData), lv_variables['seq'].index]]
END PROGRAM.
There are lots of ways to process the case data held by Statistics via Python, but the case data has to be read explicitly using the spss.Cursor, spssdata.Spssdata, or spss.Dataset class. It does not live in the Python namespace.
In this case the simplest thing to do would be to just substitute the formula for seq into the later references. There are many other ways to tackle this.
Also, get rid of those EXECUTE calls. They just force unnecessary data passes. Statistics will automatically pass the data when it needs to based on the command stream.
Hi I just stumbled across this, and you've probably moved on, but it might help other folks. I don't thing you actually need to access have Python access the SPSS values. I think something like this might work:
BEGIN PROGRAM PYTHON.
import spss
for i in range(1,13):
k = "COMPUTE seq = COMDigit" + str(i) + "Seq."
l = "Do if seq = " + str(i)+ "."
m = "COMPUTE COM" + str(i) + "thWidth = COMDigit" + str(i) + "Width."
n = "COMPUTE COM" + str(i) + "thHgt = COMDigit" + str(i) + "Hgt."
o = "End if."
print k
print l
print m
print n
print o
spss.Submit(k)
spss.Submit(l)
spss.Submit(m)
spss.Submit(n)
spss.Submit(o)
spss.Submit("EXECUTE.")
END PROGRAM.
But I'd have to see the data to make sure I'm understanding your problem correctly. Also, the print stuff makes the code look ugly, but its the only way I can keep a handle on whats going on under the hood. Cheerio!

Categories