Create a dataframe of missing data from XML in Python - python

I am extremely new to Python (< 1 week), and I hope to read in three "variables," PMID, Abstract Text, and Mesh into a dataframe. My .xml is 10 GB.
Right now the following code produces a list of the PMIDs and the Abstract Texts. How would I convert it into a dataframe where there are 3 variables, PMID, Abstract Text, and Mesh, in which each DescriptorName within the Mesh from the XML is separated by a comma (for ex: Adenocarcinoma, Antineoplastic Agents, Colorectal Neoplasms)? Please note that the following snippet is only 1 PMID. There are about 1.8 mil in total.
Please note that some PMIDs do not contain any Abstract Texts or Mesh...in that case, I would like NA or "" to stand in place for its row.
import xml.etree.cElementTree as etree
# read in all PMIDs and Abstract Texts - got too scared to parse in Mesh incorrectly since it's very time consuming to re-run
pmid_abstract = []
for event, element in etree.iterparse("pubmed_result.xml"):
if element.tag in ["PMID", "AbstractText"]:
pmid_abstract.append(element.text)
element.clear()
This contains only the relevant tags in .xml for one PMID only
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">29310420</PMID>
<Article PubModel="Print">
<Abstract>
<AbstractText Label="RATIONALE" NlmCategory="BACKGROUND">Regorafenib is the new standard third-line therapy in metastatic colorectal cancer (mCRC). However, the reported 1-year overall survival rate does not exceed 25%.</AbstractText>
<AbstractText Label="PATIENT CONCERNS" NlmCategory="UNASSIGNED">A 55-year-old man affected by mCRC, treated with regorafenib combined with stereotactic body radiotherapy (SBRT), showing a durable response.</AbstractText>
</Abstract>
</Article>
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D000230" MajorTopicYN="N">Adenocarcinoma</DescriptorName>
<QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
<QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D000970" MajorTopicYN="N">Antineoplastic Agents</DescriptorName>
<QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName UI="D015179" MajorTopicYN="N">Colorectal Neoplasms</DescriptorName>
<QualifierName UI="Q000000981" MajorTopicYN="N">diagnostic imaging</QualifierName>
<QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
</PubmedArticle>

This might be what you want.
I appended a copy of the complete PubmedArticle element that you posted to the end of the xml file and then enclosed the two elements in a single PubmedArticles (plural) element, to demonstrate the principle involved in processing such files. Because your file is so large I chose to put temporary results into a sql database table, and then to import them from there into pandas.
The first time through the loop there is no record to process. Afterwards, each time a PMID element is encountered this implies that the previous PubmedArticle has been completely processed and is available for storing to the database. As other elements are encountered they are simply inserted into the dictionary representing the current article.
from xml.etree import ElementTree
import sqlite3
import pandas as pd
conn = sqlite3.connect('ragtime.db')
c = conn.cursor()
c.execute('DROP TABLE IF EXISTS ragtime')
c.execute('''CREATE TABLE ragtime (PMID text, AbstractText Text, DescriptorName Text)''')
with open('ragtime.csv', 'w') as ragtime:
record = None
for ev, el in ElementTree.iterparse('ragtime.xml'):
if el.tag=='PMID':
if record:
c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
record = {'PMID': el.text, 'AbstractText': [], 'DescriptorName': []}
elif el.tag=='AbstractText':
record['AbstractText'].append(el.text)
elif el.tag=='DescriptorName':
record['DescriptorName'].append(el.text)
else:
pass
c.execute('INSERT INTO ragtime VALUES (?, ?, ?)', [record['PMID'], ' '.join(record['AbstractText']), ','.join(record['DescriptorName'])])
conn.commit()
df = pd.read_sql_query('SELECT * FROM ragtime', conn)
print (df.head())
conn.close()
It produces the following printed result.
PMID AbstractText \
0 29310420 Regorafenib is the new standard third-line the...
1 29310425 Regorafenib is the new standard third-line the...
DescriptorName
0 Adenocarcinoma,Antineoplastic Agents,Colorecta...
1 Adenocarcinoma,Antineoplastic Agents,Colorecta...

Related

sqlite3.OperationalError: table A has X columns but Y values were supplied

Firstly, I've read all similar questions and applied the solutions listed here: [1] SQLite with Python "Table has X columns but Y were supplied", [1]"Table has X columns but Y values were supplied" error, 3sqlite3.OperationalError: table card_data has 11 columns but 10 values were supplied, [4]sqlite3.OperationalError: table book has 6 columns but 5 values were supplied, [5]Sqlite3 Table users has 7 columns but 6 values were supplied Yet, none of them worked out.
I am creating 26 tables succesfully insert data using this:
im.execute("""CREATE TABLE IF NOT EXISTS C_Socket7 (Date_Requested, Time_Requested, Time_Passed, Energy_Consumption, Cumulative_Consumption, Hourly_Consumption, Daily_Consumption, Weekly_Consumption, Monthly_Consumption)""")
string = """INSERT INTO {} VALUES ('{}','{}','{}','{}','{}','{}','{}','{}')""".format('C_'+tables[7], tstr, dict4DB['timePassed'], dict4DB['Socket7Consp'],DummyConsumptions['Cumulative'], DummyConsumptions['Hourly'], DummyConsumptions['Daily'], DummyConsumptions['Weekly'], DummyConsumptions['Monthly'])
and this code:
im.execute("""CREATE TABLE IF NOT EXISTS A_Status (Time_Requested, Socket0Status, Socket1Status, Socket2Status, Socket3Status, Socket4Status, Socket5Status, Socket6Status, Socket7Status)""")
string = """INSERT INTO {} VALUES ('{}','{}','{}','{}','{}','{}','{}','{}','{}')""".format('A_'+tables[7], tstr, dict4DB['Socket0Stat'], dict4DB['Socket1Stat'],dict4DB['Socket2Stat'], dict4DB['Socket3Stat'], dict4DB['Socket4Stat'], dict4DB['Socket5Stat'], dict4DB['Socket6Stat'], dict4DB['Socket7Stat'])
But when it comes to this table:
im.execute("""CREATE TABLE IF NOT EXISTS A_Environment (Date_Requested, Time_Requested, Voltage, Frequency, In_Temperature, In_Humidity, Ext_Temperature, Ext_Humidity, Door_Switch, Door_Relay)""")
string = """INSERT INTO A_Environment(Date_Requested, Time_Requested, Voltage, Frequency, In_Temperature, In_Humidity, Ext_Temperature, Ext_Humidity, Door_Switch, Door_Relay) VALUES ('{}','{}', '{}','{}','{}','{}','{}','{}','{}','{}')""".format(d_str, h_str, dict4DB['Voltage'], dict4DB['Frequency'],dict4DB['InTemp'], dict4DB['InHumidity'], dict4DB['ExtTemp'], dict4DB['ExtHumidity'], dict4DB['DoorSwitch'], dict4DB['DoorRelay'])
It gives the error:
sqlite3.OperationalError: table A_Environment has 10 columns but 9 values were supplied
When I check the database using "DB Browser for SQLite", I can't see the column names. When I create columns manually using the interface, I get the same error.
Here is a screenshot how it looks from DB Browser:
[]
It was working fine when the structure was like this:
im.execute("""CREATE TABLE IF NOT EXISTS A_Environment (Time_Requested, Voltage, Frequency, In_Temperature, In_Humidity, Ext_Temperature, Ext_Humidity, Door_Switch, Door_Relay)""")
string = """INSERT INTO {} VALUES ('{}','{}','{}','{}','{}','{}','{}','{}','{}')""".format('A_'+tables[6], tstr, dict4DB['Voltage'], dict4DB['Frequency'],dict4DB['InTemp'], dict4DB['InHumidity'], dict4DB['ExtTemp'], dict4DB['ExtHumidity'], dict4DB['DoorSwitch'], dict4DB['DoorRelay'])
I've changed 'tstr' with 'd_str' and 'h_str' which I've seperated date and time into different columns.
I've tried these:
Created the columns manually. Got the same result.
Dropped the table and let it create itself using "CREATE TABLE IF NOT EXISTS", it created the table but no columns in it too. Put the columns manually. Did not work.
Delete the whole database and let it create itself from scratch. Got the same result.
I also change the provided values in .format() function as str(d_str) and str(h_str) did not work as well.
I am planning to create a minimum working code snippet to see which value is not being able to read, yet after seeing that the columns are not being created. I thought the problem is not about the data to be written.
I've read the documentation but could not find something I can use.
Here is a screenshot of the tables from "DB Browser for SQLite"
[]
What do you recommend? What should I do?
PS: SQLite is running on Raspberry Pi, Rasbian OS, Python 3.7, SQLite3.
Using
You should never use .format() (or % or f-strings) to generate SQL statements. Your issue is possibly (hard to tell since we don't see your input data) stemming from an unescaped value being wrongly interpreted by SQLite - in effect, you're your own SQL injection vulnerability.
Instead use the placeholder system provided by your database, e.g. like this for Sqlite:
im.execute(
"""
INSERT INTO
A_Environment(
Date_Requested,
Time_Requested,
Voltage,
Frequency,
In_Temperature,
In_Humidity,
Ext_Temperature,
Ext_Humidity,
Door_Switch,
Door_Relay
)
VALUES
(?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""",
(
d_str,
h_str,
dict4DB["Voltage"],
dict4DB["Frequency"],
dict4DB["InTemp"],
dict4DB["InHumidity"],
dict4DB["ExtTemp"],
dict4DB["ExtHumidity"],
dict4DB["DoorSwitch"],
dict4DB["DoorRelay"],
),
)

function to reduce redundancy for reading database in sqlite3

Hi guys:) I'm a newbie at programming and would like to ask for help in creating a function to help reduce redundancy in my code. I have successfully created a database holding 5 different tables for data of different countries. All tables have the same structure (see attached screenshots for reference). My objective is to calculate the summation of all rows within all the different tables for a particular parameter (type of pollution).
I have managed to write code to only select the particular data I need of one country (I tried writing code to calculate the summation but I can't figure that out, so I decided to just select the data and then manually calculate the values myself with a calculator -I know that sort of defeats the purpose of programming but at my programming level (beginner) I feel like it's the only way that I can do the code) my issue is that I have five countries, so I don't want to repeat the same block of code for the different countries. this is my code for one country:
def read_MaltaData():
conn = sqlite3.connect('FinalProjectDatabase.sqlite3')
Malta = conn.cursor()
Malta.execute("SELECT * FROM MaltaData WHERE AirPollutant = 'PM10'")
result = Malta.fetchall()
print(result)
my result is this:
[('Malta', 'Valletta', 'MT00005', 'Msida', 'PM10', 64.3, 'ug/m3', 'Traffic', 'urban', 14.489985999999998, 35.895835999489535, 2.0), ('Malta', None, etc.
(I am going to manually calculate the data I require -in this case 64.3 + the value from the next row- as I don't know how to do it in python)
To clarify, my aim isn't to have a sum total across all the tables as one whole value (i.e. I don't want to add the values of all the countries all together). My desired output should look something like this:
Malta summation value
italy summation value
france summation value
and not like this
countries all together = one whole value (i.e. all summation values added together)
I would greatly appreciate any help I can get. Unfortunately I am not able to share the database with you, which is why I am sharing screenshots of it instead.
image of all 5 different tables in one database:
image of one table (all tables look the same, just with different values)
You can use UNION ALL to get a row for each country:
SELECT 'France' country, SUM(AirPolutionLevel) [summation value] FROM FranceData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Germany' country, SUM(AirPolutionLevel) [summation value] FROM GermanyData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Italy' country, SUM(AirPolutionLevel) [summation value] FROM ItalyData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Malta' country, SUM(AirPolutionLevel) [summation value] FROM MaltaData WHERE AirPollutant = 'PM10'
UNION ALL
SELECT 'Poland' country, SUM(AirPolutionLevel) [summation value] FROM PolandData WHERE AirPollutant = 'PM10'
If you pass the country name as argument to the data retrieval function, you can generate the table names dynamically (note the f-string arguments in execute and print):
First draft
def print_CountryData(country):
conn = sqlite3.connect('FinalProjectDatabase.sqlite3')
cur = conn.cursor()
cur.execute(f"SELECT SUM(AirPollutionLevel) FROM {country}Data WHERE AirPollutant = 'PM10'")
sumVal = cur.fetchone()[0]
print(f"{country} {sumVal}")
# example call:
for country in ('France', 'Germany', 'Italy', 'Malta', 'Poland'):
print_CountryData(country)
While building query strings your own with simple string functions is discouraged in the sqlite3 documentation for security reasons, in your very case where you have total control of the actual arguments I'd consider it as safe.
This answer adapts the summation from the great answer given by forpas but refuses to move the repetition to SQL. It also shows both integration with python and output formatting.
MRE-style version
This is an improved version of my first answer, transformed into a Minimal, Reproducible Example and combined with output. Also, some performance improvements were made, for instance opening the database only once.
import sqlite3
import random # to simulate actual pollution values
# Countries we have data for
countries = ('France', 'Germany', 'Italy', 'Malta', 'Poland')
# There is one table for each country
def tableName(country):
return country+'Data'
# Generate minimal version of tables filled with random data
def setup_CountryData(cur):
for country in countries:
cur.execute(f'''CREATE TABLE {tableName(country)}
(AirPollutant text, AirPollutionLevel real)''')
for i in range(5):
cur.execute(f"""INSERT INTO {tableName(country)} VALUES
('PM10', {100*random.random()})""")
# Get sum up pollution data for each country
def print_CountryData(cur):
for country in countries:
cur.execute(f"""SELECT SUM(AirPollutionLevel) FROM
{tableName(country)} WHERE AirPollutant = 'PM10'""")
sumVal = cur.fetchone()[0]
print(f"{country:10} {sumVal:9.5f}")
# For testing, we use an in-memory database
conn = sqlite3.connect(':memory:')
cur = conn.cursor()
setup_CountryData(cur)
# The functionality actually required
print_CountryData(cur)
Sample output:
France 263.79430
Germany 245.20942
Italy 225.72068
Malta 167.72690
Poland 290.64190
It's often hard to evaluate a solution without actually trying it. That's the reason why questioners on StackOverflow are constantly encouraged to ask in this style: it makes it much more likely someone will understand and fix the problem ... quickly
If the database is not too big you could use pandas.
This approach is less efficient than using SQL queries directly but can be used if you want to explore the data interactively in a notebook for example.
You can create a dataframe from your SQLite db using pandas.read_sql_query
and then perform your calculation using pandas.DataFrame methods, which are designed for this type of tasks.
For your specific case:
import sqlite3
import pandas as pd
conn = sqlite3.connect(db_file)
query = "SELECT * FROM MaltaData WHERE AirPollutant = 'PM10'"
df = pd.read_sql_query(query, conn)
# check dataframe content
print(df.head())
If I understood and then you want to compute the sum of the values in a given column:
s = df['AirPollutionLevel'].sum()
If you have missing values you might want to fill them with 0s before summing:
s = df['AirPollutionLevel'].fillna(0).sum()

Using ? within INSERT INTO of a executemany() in Python with Sqlite

I am trying to submit data to a Sqlite db through python with executemany(). I am reading data from a JSON file and then placing it into the db. My problem is that the JSON creation is not under my control and depending on who I get the file from, the order of values is not the same each time. The keys are correct so they correlate with the keys in the db but I can't just toss the values at the executemany() function and have the data appear in the correct columns each time.
Here is what I need to be able to do.
keyTuple = (name, address, telephone)
listOfTuples = [(name1, address1, telephone1),
(name2, address2, telephone2),
(...)]
cur.executemany("INSERT INTO myTable(?,?,?)", keysTuple"
"VALUES(?,?,?)", listOfTuples)
The problem I have is that some JSON files have order of "name, telephone, address" or some other order. I need to be able to input my keysTuple into the INSERT portion of the command so I can keep my relations straight no matter what order the JSON file come in without having to completely rebuild the listOfTuples. I know there has got to be a way but what I have written doesn't match the right syntax for the INSERT portion. The VALUES line works just fine, it uses each element in listofTuples.
Sorry if I am not asking with the correct verbage. FNG here and this is my first post. I have look all over the web but it only produces the examples of using ? in the VALUE portion, never in the INSERT INTO portion.
You cannot use SQL parameters (?) for table/column names.
But when you already have the column names in the correct order, you can simply join them in order to be able to insert them into the SQL command string:
>>> keyTuple = ("name", "address", "telephone")
>>> "INSERT INTO MyTable(" + ",".join(keyTuple) + ")"
'INSERT INTO MyTable(name,address,telephone)'
Try this
Example if you have table named products with the following fields:
Prod_Name Char( 30 )
UOM Char( 10 )
Reference Char( 10 )
Const Float
Price Float
list_products = [('Garlic', '5 Gr.', 'Can', 1.10, 2.00),
('Beans', '8 On.', 'Bag', 1.25, 2.25),
('Apples', '1 Un.', 'Unit', 0.25, 0.30),
]
c.executemany('Insert Into products Values (?,?,?,?,?)', list_products )

List/Dict structure issue

I'm confused on how to structure a list/dict I need. I have scraped three pieces of info off ESPN: Conference, Team, and link to team homepage for future stat scrapping.
When the program first runs, id like to build a dictionary/list so that one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools. The link associated with each school isn't important that the end user know about but it is important that the correct link is associated with the correct school so that future stats from that specific school can be scraped.
For example the info scrapped is:
SEC, UGA, www.linka.com
ACC, FSU, www.linkb.com
etc...
I know i could create a list of dictionaries like:
sec_list=[{UGA: www.linka.com, Alabama: www.linkc.com, etc...}]
acc_list=[{FSU: www.linkb.com, etc...}]
The problem is id have to create about 26 lists here to hold every conference which sounds excessive. Is there a way to lump everything into one list but still have the ability to to extract schools from a specific conference or search for a school and the correct conference is also returned? Of course, the link to the school must also correspond to the correct school.
Python ships with sqlite3 to handle database problems and it has an :memory: mode for in-memory databases. I think it will solve your problem directly and with clear code.
import sqlite3
from pprint import pprint
# Load the data from a list of tuples in the from [(conf, school, link), ...]
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE Espn (conf text, school text, link text)')
c.execute('CREATE INDEX Cndx ON Espn (conf)')
c.execute('CREATE INDEX Sndx ON Espn (school)')
c.executemany('INSERT INTO Espn VALUES (?, ?, ?)', data)
conn.commit()
# Run queries
pprint(c.execute('SELECT * FROM Espn WHERE conf = "Big10"').fetchall())
pprint(c.execute('SELECT * FROM Espn WHERE school = "Alabama"').fetchall())
In memory databases are so easy to create and query that they are often the easiest solution to the problem of how to have multiple lookup keys and doing analytics on relational data. Trying to use dicts and lists for this kind of work just makes the problem unnecessarily complicated.
It's true you can do this with a list of dictionaries, but you might find it easier to be able to look up information with named fields. In that case, I'd recommend storing your scraped data in a Pandas DataFrame.
You want it so that "one can type in a school and it would print the conference the school is in OR one could select an entire conference and it would print the corresponding list of schools".
Here's an example of what that would look like, using Pandas and a couple of convenience functions.
First, some example data:
confs = ['ACC','Big10','BigEast','BigSouth','SEC',
'ACC','Big10','BigEast','BigSouth','SEC']
teams = ['school{}'.format(x) for x in range(10)]
links = ['www.{}.com'.format(x) for x in range(10)]
scrape = zip(confs,teams,links)
[('ACC', 'school0', 'www.0.com'),
('Big10', 'school1', 'www.1.com'),
('BigEast', 'school2', 'www.2.com'),
('BigSouth', 'school3', 'www.3.com'),
('SEC', 'school4', 'www.4.com'),
('ACC', 'school5', 'www.5.com'),
('Big10', 'school6', 'www.6.com'),
('BigEast', 'school7', 'www.7.com'),
('BigSouth', 'school8', 'www.8.com'),
('SEC', 'school9', 'www.9.com')]
Now convert to DataFrame:
import pandas as pd
df = pd.DataFrame.from_records(scrape, columns=['conf','school','link'])
conf school link
0 ACC school0 www.0.com
1 Big10 school1 www.1.com
2 BigEast school2 www.2.com
3 BigSouth school3 www.3.com
4 SEC school4 www.4.com
5 ACC school5 www.5.com
6 Big10 school6 www.6.com
7 BigEast school7 www.7.com
8 BigSouth school8 www.8.com
9 SEC school9 www.9.com
Type in school, get conference:
def get_conf(df, school):
return df.loc[df.school==school, 'conf'].values
get_conf(df, school = 'school1')
['Big10']
Type in conference, get schools:
def get_schools(df, conf):
return df.loc[df.conf==conf, 'school'].values
get_schools(df, conf = 'Big10')
['school1' 'school6']
It's unclear from your question whether you also want the links associated with schools returned when searching by conference. If so, just update get_schools() to:
def get_schools(df, conf):
return df.loc[df.conf==conf, ['school','link']].values

Storing a List into Python Sqlite3

I am trying to scrape form field IDs using Beautiful Soup like this
for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
if link.has_key('id'):
print link['id']
Lets us assume that it returns something like
username
email
password
passwordagain
terms
button_register
I would like to write this into Sqlite3 DB.
What I will be doing down the line in my application is... Use these form fields' IDs and try to do a POST may be. The problem is.. there are plenty of sites like this whose form field IDs I have scraped. So the relation is like this...
Domain1 - First list of Form Fields for this Domain1
Domain2 - Second list of Form Fields for this Domain2
.. and so on
What I am unsure here is... How should I design my column for this kind of purpose? Will it be OK if I just create a table with two columns - say
COL 1 - Domain URL (as TEXT)
COL 2 - List of Form Field IDs (as TEXT)
One thing to be remembered is... Down the line in my application I will need to do something like this...
Pseudocode
If Domain is "http://somedomain.com":
For ever item in the COL2 (which is a list of form field ids):
Assign some set of values to each of the form fields & then make a POST request
Can any one guide, please?
EDITed on 22/07/2011 - Is My Below Database Design Correct?
I have decided to have a solution like this. What do you guys think?
I will be having three tables like below
Table 1
Key Column (Auto Generated Integer) - Primary Key
Domain as TEXT
Sample Data would be something like:
1 http://url1.com
2 http://url2.com
3 http://url3.com
Table 2
Domain (Here I will be using the Key Number from Table 1)
RegLink - This will have the registeration link (as TEXT)
Form Fields (as Text)
Sample Data would be something like:
1 http://url1.com/register field1
1 http://url1.com/register field2
1 http://url1.com/register field3
2 http://url2.com/register field1
2 http://url2.com/register field2
2 http://url2.com/register field3
3 http://url3.com/register field1
3 http://url3.com/register field2
3 http://url3.com/register field3
Table 3
Domain (Here I will be using the Key Number from Table 1)
Status (as TEXT)
User (as TEXT)
Pass (as TEXT)
Sample Data would be something like:
1 Pass user1 pass1
2 Fail user2 pass2
3 Pass user3 pass3
Do you think this table design is good? Or are there any improvements that can be made?
There is a normalization problem in your table.
Using 2 tables with
TABLE domains
int id primary key
text name
TABLE field_ids
int id primary key
int domain_id foreign key ref domains
text value
is a better solution.
Proper database design would suggest you have a table of URLs, and a table of fields, each referenced to a URL record. But depending on what you want to do with them, you could pack lists into a single column. See the docs for how to go about that.
Is sqlite a requirement? It might not be the best way to store the data. E.g. if you need random-access lookups by URL, the shelve module might be a better bet. If you just need to record them and iterate over the sites, it might be simpler to store as CSV.
Try this to get the ids:
ids = (link['id'] for link in
BeautifulSoup(content, parseOnlyThese=SoupStrainer('input'))
if link.has_key('id'))
And this should show you how to save them, load them, and do something to each. This uses a single table and just inserts one row for each field for each domain. It's the simplest solution, and perfectly adequate for a relatively small number of rows of data.
from itertools import izip, repeat
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table domains
(domain text, linkid text)''')
domain_to_insert = 'domain_name'
ids = ['id1', 'id2']
c.executemany("""insert into domains
values (?, ?)""", izip(repeat(domain_to_insert), ids))
conn.commit()
domain_to_select = 'domain_name'
c.execute("""select * from domains where domain=?""", (domain_to_select,))
# this is just an example
def some_function_of_row(row):
return row[1] + ' value'
fields = dict((row[1], some_function_of_row(row)) for row in c)
print fields
c.close()

Categories