Python/django - compare list with database rows - python

I have a database with rows of image locations, I also have a list of image locations which is generated when I run the script. I want to keep the database in sync with the generated list. I would just drop the table, but this table has vote information which I need to keep. If I delete the image, I don't want there to be an entry, but if I add an image I want to be able to keep the votes for all of the other images
example:
[db]
Name | Path | vote_count
image1 | path/to/image1.jpg | 1
image2 | path/to/image2.jpg | 4
image3 | path/to/image3.jpg | 2
[list]
path/to/image1.jpg
path/to/image2.jpg
path/to/image3.jpg
path/to/image4.jpg
I want to compare the list to the database and if there is an added image I want to see the db do the following:
[db]
Name | Path | vote_count
image1 | path/to/image1.jpg | 1
image2 | path/to/image2.jpg | 4
image3 | path/to/image3.jpg | 2
image4 | path/to/image4.jpg | 0
What is a good way to accomplish this?
I have this so far:
def ScanImages(request):
files = []
fileRoots = []
for root, directories, filenames in os.walk('/web/static/web/images/'):
for filename in filenames:
files.append(os.path.join(root,filename))
fileRoots.append(root)

Assuming you have django model VotableImage, you can get list of one of it's fields by calling db_path_list = VotableImage.objects.values_list('Path', flat=True) and then check each value for presence in files list (that you created by script)

Related

How create new column in Spark using Python, based on other column?

My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:
"content" "other column"
The father has two dogs father
One cat stay at home of my mother mother
etc. etc.
I thought to create an array with words who interessed me. For example:
people=[mother,father,etc.]
Then, I iterate on column "content" and extract the word to insert on new column:
def extract_people(df):
column=[]
people=[mother,father,etc.]
for row in df.select("content").collect():
for word in people:
if str(row).find(word):
column.append(word)
break
return pd.Series(column)
f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))
This code don't work and give me this error on the collect():
22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space
Maybe because my file is too large, have 15 million of row.
How I may make the new column in different mode?
Using the following dataframe as an example
+---------------------------------+
|content |
+---------------------------------+
|Thefatherhas two dogs |
|The fatherhas two dogs |
|Thefather has two dogs |
|Thefatherhastwodogs |
|One cat stay at home of my mother|
|One cat stay at home of mymother |
|Onecatstayathomeofmymother |
|etc. |
|my feet smell |
+---------------------------------+
You can do the following
from pyspark.sql import functions
arr = ["father", "mother", "etc."]
expression = (
"CASE " +
"".join(["WHEN content LIKE '%{}%' THEN '{}' ".format(val, val) for val in arr]) +
"ELSE 'None' END")
df = df.withColumn("other_column", functions.expr(expression))
df.show()
+---------------------------------+------------+
|content |other_column|
+---------------------------------+------------+
|Thefatherhas two dogs |father |
|The fatherhas two dogs |father |
|Thefather has two dogs |father |
|Thefatherhastwodogs |father |
|One cat stay at home of my mother|mother |
|One cat stay at home of mymother |mother |
|Onecatstayathomeofmymother |mother |
|etc. |etc. |
|my feet smell |None |
+---------------------------------+------------+

django get record which is latest and other column value

I have a model with some columns, between them there are 2 columns: equipment_id (a CharField) and date_saved (a DateTimeField).
I have multiple rows with the same equipment_id and but different date_saved (each time the user saves the record I save the now date time).
I want to retrieve the record that has a specific equipment_id and is the latest saved, i.e.:
| Equipment_id | Date_saved |
| --- ----- | --------------------- -------- |
| 1061a | 26-DEC-2020 10:10:23|
| 1061a | 26-DEC-2020 10:11:52|
| 1061a | 26-DEC-2020 10:22:03|
| 1061a | 26-DEC-2020 10:31:15|
| 1062a | 21-DEC-2020 10:11:52|
| 1062a | 25-DEC-2020 10:22:03|
| 1073a | 20-DEC-2020 10:31:15|
I want to retrieve for example the latest equipment_id=1061.
I have tried various approach without success:
prg = Program.objects.filter(equipment_id=id)
program = Program.objects.latest('date_saved')
when I use program I get the latest record saved with no relation to the previous filter
You can chain the filtering as,
result = Program.objects.filter(equipment_id=id).latest('date_saved')

Pythonic way to find hierarchy in a given list of accounts

I have a MySQL Column with the following information:
codes = [
"[1]",
"[1-1]",
"[1-1-01]",
"[1-1-01-01]",
"[1-1-01-02]",
"[1-1-01-03]",
"[1-1-02]",
"[1-1-02-01]",
"[1-1-02-02]"
"[1-2]",
"[1-2-01]",
"[1-2-01-01]",
"[2]",
"[2-1]",
"[2-1-01-01]",
"[2-1-01-02]"
]
This is a hierarchical structure for accounts, and I need to know, for each account, which is parent and which is the child to add to a secondary table called AccountsTree.
My models are:
Accounts:
id = db.Column(Integer)
account = db.Column(Integer)
...
numbers = db.Column(Integer)
AccountsTree:
id = db.Column(Integer)
parent = db.Column(Integer, db.ForeignKey('Accounts.id')
child = db.Column(Integer, db.ForeignKey('Accounts.id')
I started coding something like:
For each_element in code_list:
replace "[" and "]"
split strings in '-' and
make a new list of lists: each list element is a list of codes, each element of the inner list is a level
But that's starting not to look very good, and it looks as if I'm adding unnecessary complexity.
My workflow is:
(1) Import XLS from front end
(2) Parse XLS and add information to Accounts table
(3) Find out hierarchy of accounts and add information to AccountsTree table
I'm currently struggling with the 3rd step. So, after adding information to my Accounts table, how do I find out the hierarchical structure to fill my AccountsTree table?
My desired result is:
ParentID | ChildID
1 | 2
2 | 3
3 | 4
3 | 4
3 | 5
3 | 6
Anybody went through a similar challenge and can share a most efficient approach?
Thanks

Python nested loop error when inserting to DB

I have a code here that scrapes data from a specific website which is https://vancouver.craigslist.org/search/ela. My problem is when I execute my code, it gives me an error of 'list' object has no attribute 'get_attribute' in the line asdf = images.get_attribute("src"). I am using selenium library in scraping the data. What I want is to insert the image url from my table which is named images but I cannot. What is wrong with my code? I am not familiar with python yet thats why I am asking questions. Thanks a lot for consideration.
Current code
x = driver.find_elements_by_class_name('hdrlnk')
y = driver.find_elements_by_xpath('//p[#class="result-info"]/span[#class="result-meta"]//span[#class="result-price"]')
images = driver.find_elements_by_xpath('//*[#id="sortable-results"]/ul/li/a/img')
for img in images:
print(img.get_attribute('src'))
for i in range(len(x)):
asdf = images.get_attribute("src")
prod = (x[i].text)
price = (y[i].text)
image = asdf
sql = """INSERT INTO products (name,price,image) VALUES (%s,%s,%s)"""
mycursor.execute(sql,(prod,price,image))
mydb.commit()
When I comment this line
for img in images:
print(img.get_attribute('src'))
and remove the asdf and image variable, I am able to insert the data and also when I comment this line of code and remain the print for images,
#for i in range(len(x)):
#asdf = images.get_attribute("src")
#prod = (x[i].text)
#price = (y[i].text)
#image = asdf
#sql = """INSERT INTO products (name,price,image) VALUES (%s,%s,%s)"""
#mycursor.execute(sql,(prod,price,image))
#mydb.commit()
I got the result I want which is like this
https://images.craigslist.org/00z0z_4cqgwC5PIXs_300x300.jpg
https://images.craigslist.org/00J0J_f6AnAonGjXd_300x300.jpg
https://images.craigslist.org/00606_mtKNjKREOO_300x300.jpg
https://images.craigslist.org/00U0U_l5t0QnjZEPt_300x300.jpg
https://images.craigslist.org/00505_gIXt1C8aeqk_300x300.jpg
https://images.craigslist.org/00N0N_6P1GmSiL2vI_300x300.jpg
Sample data for x and y variable in i loop:
x = Spigen Magnetic Car Phone Mount
y= $20
What do I need to do in order to insert the image url with the product name and images in a single row? TIA.
EDIT. I tried #terahertz's answer and rewrite my code like this
x = driver.find_elements_by_class_name('hdrlnk')
y = driver.find_elements_by_xpath('//p[#class="result-info"]/span[#class="result-meta"]//span[#class="result-price"]')
images = driver.find_elements_by_xpath('//*[#id="sortable-results"]/ul/li/a/img')
for img in images:
# print(img.get_attribute('src'))
for i in range(len(x)):
asdf = img.get_attribute("src")
prod = (x[i].text)
price = (y[i].text)
image = asdf
sql = """INSERT INTO products (name,price,image) VALUES (%s,%s,%s)"""
mycursor.execute(sql,(prod,price,image))
mydb.commit()
Current DB datas
+-----+------------------------------------------------------------------------+--------+-------------------------------------------------------------+
| id | name | price | image |
+-----+------------------------------------------------------------------------+--------+-------------------------------------------------------------+
| 1 | Spigen Magnetic Car Phone Mount | $20 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
| 2 | Netgear Nighthawk x6 r8000 wireless router | $120 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
| 3 | iPod Touch 8gb 2nd generation - Loaded with Classic Rock | $60 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
| 4 | 3 plug 3.1A fast USB wallplugs | $10 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
| 5 | Audio and Video Cables | $3 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
| 6 | Like New Samsung 50" HD TV ForSale | $400 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
| 7 | SONY Alarm Clock | $20 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
| 8 | Bowers & Wilkins P7 Wireless MINT | $450 | https://images.craigslist.org/00i0i_7PvHxDMvR2o_300x300.jpg |
+-----+------------------------------------------------------------------------+--------+-------------------------------------------------------------+
Now I can insert into my database, BUT the problem is the image column has the same value with other. Its like only one image url inserted. And when I visited the link, the product name and the image doesnt match.
Change asdf = images.get_attribute("src") to asdf = img.get_attribute("src")
Your outer loop is accessing each item in the images list with the variable img. But in your inner loop, you are accessing the images list.

How can I store a 2D array in python, like table in MySQL?

I have a 2D array (similar to table in MySQL), for instance,
+------------------+------------------+-------------------+------------------+
| trip_id | service_id | route_id | shape_id |
+------------------+------------------+-------------------+------------------+
| 4503599630773892 | 4503599630773892 | 11821949021891677 | 4503599630773892 |
| 4503599630773894 | 4503599630773892 | 11821949021891677 | 4503599630773892 |
| 4503599630773896 | 4503599630773892 | 11821949021891677 | 4503599630773892 |
| 4503599630773898 | 4503599630773892 | 11821949021891677 | 4503599630773892 |
| 4503599630773900 | 4503599630773892 | 11821949021891677 | 4503599630773892 |
| 4503599630773902 | 4503599630773892 | 11821949021891677 | 4503599630773892 |
| 4503599630810392 | 4503599630773892 | 11821949021891678 | 4503599630810392 |
| 4503599630810394 | 4503599630810394 | 11821949021891678 | 4503599630810392 |
| 4503599630810396 | 4503599630773892 | 11821949021891678 | 4503599630810392 |
| 4503599630810398 | 4503599630773892 | 11821949021891678 | 4503599630810392 |
+------------------+------------------+-------------------+------------------+
How can I store a 2D array in python, like table in MySQL?
The first solution came to my mind is to use dict. The key is trip_id (the first column) and the value is list ([service_id, route_id, shape_id]).
Another solution is to use SQLite.
Which one is recommended, or other solutions?
PS: I want to store rows (say,[trip_id, service_id, route_id, shape_id]) that are crawled from webpages. It requires dozens of insert or append operation. The order of entries is not necessary, but should be unique.
Depending on your use cases, the usual solution of using a list of lists (or list of tuples) may be the most efficient and readable alternative:
my_table= [
("trip_id", "service_id", "route_id", "shape_id"),
(4503599630773892, 4503599630773892, 11821949021891677, 4503599630773892),
...
]
first_trip_id= my_table[1][0]
You can then simply add new rows using my_table.append( (1,2,3,4) ), which is as efficient as it gets (in python).
There's a couple of tricks you can use to make accesses to this kind of structure efficient and readable.
You can opt to exclude the header from it, which may help you make sense of the indexes. If you want both versions, just store the header in it, make a copy, and pop the headers:
from copy import deepcopy
my_table_no_headers= deepcopy(my_table)
my_table_no_headers.pop(0)
first_trip_id= my_table_no_headers[0][0]
Something that may also help writing readable code is declaring constants for the column names:
trip_id, service_id, route_id, shape_id= range(4)
first_trip_id= my_table_no_headers[0][trip_id]
If you want to obtain, for example, a list of all trip_ids, you can simply invert the indexing order:
my_table_no_headers = zip(*my_table_no_headers)
first_trip_id= my_table_no_headers[trip_id][0]
all_trip_ids= my_table_no_headers[trip_id] #note this is not a copy!
This is the same as transposing a matrix. Note this last indexing order is not suitable for adding rows.
You would have to be more specific on your requirements to say objectively whether or not you should use an sqlite db. (Although I would tend to lean towards yes, if you will be storing more than one instance of this kind of data).
You should be aware, however, that unless you are using OrderedDict that the order of your objects would be random (And not accessible by index). A dict by default does not preserve the item order.
I would actually suggest you make your table a list of objects rather than a dict of columns where you would need to look up matching values across lists.
trips = [
{
"trip_id": "4503599630773892",
"service_id": "4503599630773892",
"route_id": "4503599630773892",
"shape_id": "4503599630773892"
},
{
"trip_id": "4503599630773892",
"service_id": "4503599630773892",
"route_id": "4503599630773892",
"shape_id": "4503599630773892"
}
]
etc.
The reason being that lookup would be much easier, using filter() or just a for loop. The equivalent process for the structure you have now would involve filtering a single column, finding matching values by index, then basically compiling this data structure yourself every time anyway (And you would have to worry about maintaining order very strictly and avoiding mismatching column lengths).

Categories