Delete index in list if multiple strings are matched - python

I've scraped a website containing a table and I want to format the headers for my desired final out.
headers = []
for row in table.findAll('tr'):
for item in row.findAll('th'):
for link in item.findAll('a', text=True):
headers.append(link.contents[0])
print headers
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Click here to read more', u'Student Satisfaction', u'Click here to read more', u'Research Quality', u'Click here to read more', u'Graduate Prospects', u'Click here to read more', u'Overall Score', u'Click here to read more', u'\r\n 2016\r\n ']
I don't want the "Click here to read more' or '2016' headers so I've done the following:
for idx, i in enumerate(headers):
if 'Click' in i:
del headers[idx]
for idx, i in enumerate(headers):
if '2016' in i:
del headers[idx]
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Student Satisfaction', u'Research Quality', u'Graduate Prospects', u'Overall Score']
Perfect. But is there a better/neater way of removing the unwanted items? Thanks!

headers = filter(lambda h: not 'Click' in h and not '2016' in h, headers)
If you want to be more generic:
banned = ['Click', '2016']
headers = filter(lambda h: not any(b in h for b in banned), headers)

You can consider using list comprehension to get a new, filtered list, something like:
new_headers = [header for header in headers if '2016' not in header]

If you can be sure that '2016' will always be last:
>>> [x for x in headers[:-1] if 'Click here' not in x]
['Rank ', 'University Name ', 'Entry Standards', 'Student Satisfaction', 'Research Quality', 'Graduate Prospects', 'Overall Score']

pattern = '^Click|^2016'
new = [x for x in header if not re.match(pattern,str(x).strip())]

Related

How to format list data and write to csv file in selenium python?

I'm getting data from a website and storing them inside a list of variables. Now I need to send these data to a CSV file.
The website data is printed and shown below.
The data getting from the Website
['Company Name: PATRY PLC', 'Contact Name: Jony Deff', 'Company ID: 234567', 'CS ID: 236789', 'MI/MC:', 'Road Code:']
['Mailing Address:', 'Street: 19700 I-45 Spring, TX 77373', 'City: SPRING', 'State: TX', 'Postal Code: 77388', 'Country: US']
['Physical Address:', 'Street: 1500-1798 Runyan Ave Houston, TX 77039, USA', 'City: HOUSTON', 'State: TX', 'Postal Code: 77039', 'Country: US']
['Registration Period', 'Registration Date/Time', 'Registration ID', 'Status']
['2020-2025', 'MAY-10-2020 15:54:12', '26787856889l', 'Active']
I'm using for loop to get these data using the below code:
listdata6 = []
for c6 in cells6:
listdata6.append(c6.text)
Now I have all data inside the 5 list variables. How can I write these data into CSV file like the below format?
You seem to want to have two header rows.
But I'm afraid your CSV interpreter (which seem to be MS Excel) won't be able to merge cells like you show on the screenshot.
Based on the structure of your data (five lists where keys and values are mixed) looks like you probably have to construct both headers semi-manually.
Here is the code:
company_info = ['Company Name: PATRY PLC', 'Contact Name: Jony Deff', 'Company ID: 234567', 'CS ID: 236789', 'MI/MC:', 'Road Code:']
mailaddr_info = ['Mailing Address:', 'Street: 19700 I-45 Spring, TX 77373', 'City: SPRING', 'State: TX', 'Postal Code: 77388', 'Country: US']
physaddr_info = ['Physical Address:', 'Street: 1500-1798 Runyan Ave Houston, TX 77039, USA', 'City: HOUSTON', 'State: TX', 'Postal Code: 77039', 'Country: US']
reg_data = ['Registration Period', 'Registration Date/Time', 'Registration ID', 'Status']
status_data = ['2020-2025', 'MAY-10-2020 15:54:12', '26787856889l', 'Active']
# composing 1st header's row
header1 = ''.join(',' for i in range(len(company_info))) # add commas
header1 += mailaddr_info[0].strip(':') # adds 1st item which is header of that data
header1 += ''.join(',' for i in range(1, len(mailaddr_info)))
header1 += physaddr_info[0].strip(':') # adds 1st item which is header of that data
header1 += ''.join(',' for i in range(1, len(physaddr_info)))
header1 += ''.join(',' for i in range(len(reg_data))) # add commas
# composing 2nd header's row
header2 = ','.join( item.split(':')[0].strip(' ') for item in company_info) + ','
header2 += ','.join( item.split(':')[0].strip(' ') for item in mailaddr_info[1:]) + ','
header2 += ','.join( item.split(':')[0].strip(' ') for item in physaddr_info[1:]) + ','
header2 += ','.join( item.split(':')[0].strip(' ') for item in reg_data)
# finally, the data row. Note we replace comma with empty char because some items contain comma.
# You can further elaborate by encapsulating comma-containing items with quotes "" which
# is treated as text by CSV interpreters.
data_row = ','.join( item.split(':')[-1].strip(' ') for item in company_info)
data_row += ','.join( item.split(':')[-1].strip(' ').replace(',','') for item in mailaddr_info)
data_row += ','.join( item.split(':')[-1].strip(' ').replace(',','') for item in physaddr_info)+ ','
data_row += ','.join( item for item in status_data)
# writing the data to CSV file
with open("test_file.csv", "w") as f:
f.write(header1 + '\n')
f.write(header2 + '\n')
f.write(data_row + '\n')
If I import that file using MS Excel and set 'Comma' as separator in text import wizard you will get something like that:
You can wrap it into a helper class which takes these five lists and exposes write_csv() method to the outside world.

Parse list to get new list with same structure

I applied a previous code for a log, to get the following list
log = ['',
'',
'ABC KLSC: XYZ',
'',
'some text',
'some text',
'%%ABC KLSC: XYZ',
'some text',
'',
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'',
'some text',
'-----------------------------------------------',
'',
'QUWERW WALS RUSZ CRORS ELME',
'P <NULL> R 98028',
'P <NULL> R 30310',
'',
'',
'Some text',
'',
'Some text',
'',
'--- FINISH'
]
and I want to filter those lines in order to get a list with only the lines that contains "=" and the
lines that are ordered in columns format (those below headers QUWERW, WALS, RUSZ, CRORS), but additionally, for those lines with column format, store
each value with its corresponding header.
I was able to filter the desired lines with code below (not sure here if there is a better condition to filter the lines with columns)
d1 = [line for line in log if len(line) > 50 or " = " in line]
d1
>>
[
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'QUWERW WALS RUSZ CRORS ELME',
'P <NULL> R 98028',
'P <NULL> R 30310',
]
But I donĀ“t know how to get the output I'm looking for as follows. Thanks for any help
[
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'QUWERW = P',
'WALS = <NULL>',
'RUSZ = R',
'CRORS = 98028',
'QUWERW = P',
'WALS = <NULL>',
'RUSZ = R',
'CRORS = 30310'
]
Finding the = is straight-forward. One way to find the column values might be, as follows, to identify header rows that contain the headings, and then zipping the following rows when splitting by white-space.
items_list = []
for item in log:
if '=' in item:
items_list.append(item)
elif len(item.split()) > 3:
splits = item.split()
if all(header in splits for header in ['QUWERW', 'WALS', 'RUSZ', 'CRORS']):
headers = splits
else:
for lhs,rhs in zip(headers,splits):
items_list.append(f'{lhs} = {rhs}')
print('\n'.join(items_list))

New Pandas Series longer than original dataset?

So I have a data set with user, date, and post columns. I'm trying to generate a column of the calories that foods contain in the post column for each user. This dataset has a length of 21, and the code below finds the food words, get their calorie value, append it to that user's respective calorie list, and append that list to the new column. The new generated column, however, somehow has a length of 25:
Current data: 21
New column: 25
Does anybody know why this occurs? Here is the code below and samples of what the original dataset and the new column look like:
while len(col) < len(data['post']):
for post, api_id, api_key in zip(data['post'], ids_keys.keys(), ids_keys.values()): # cycles through text data & api keys
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'x-app-id': api_id,
'x-app-key': api_key,
'x-remote-user-id': '0'
}
calories = []
print('Current data:', len(data['post']), '\n New column: ', len(col)) # prints length of post vs new cal column
for word in eval(post):
if word not in food:
continue
else:
print('Detected Word: ', word)
query = {'query': '{}'.format(word)}
try:
response = requests.request("POST", url, headers=headers, data=query)
except KeyError as ke:
print(ke, 'Out of calls, next key...')
ids_keys.pop(api_id) # drop current api id & key from dict if out of calls
print('API keys left:', len(ids_keys))
finally:
stats = response.json()
print('Food Stats: \n', stats)
print('Calories in food: ', stats['foods'][0]['nf_calories'])
calories.append(stats['foods'][0]['nf_calories'])
print('Current Key', api_id, ':', api_key)
col.append(calories)
if len(col) == len(data['post']):
break
I attempted to use the while loop to only append up to the length of the dataset, but to no avail.
Original Data Set:
pd.DataFrame({'user':['avskk', 'janejellyn', 'firlena227','...'],
'date': ['October 22', 'October 22', 'October 22','...'],
'post': [['autumn', 'fully', 'arrived', 'cooking', 'breakfast', 'toaster','...'],
['breakfast', 'chinese', 'sticky', 'rice', 'tempeh', 'sausage', 'cucumber', 'salad', 'lunch', 'going', 'lunch', 'coworkers', 'probably', 'black', 'bean', 'burger'],
['potato', 'bean', 'inspiring', 'food', 'day', 'today', '...']]
})
New Column:
pd.DataFrame({'Calories': [[22,33,45,32,2,5,7,9,76],
[43,78,54,97,32,56,97],
[23,55,32,22,7,99,66,98,54,35,33]]
})

Convert a csv into category-subcategory using array

Above is the input table i have in csv
I am trying to use array and while loops in python. I am new to this language. Loops should occur twice to give Category\sub-category\sub-category_1 order...I am trying to use split().Ouput should be like below
import csv
with open('D:\\test.csv', 'rb') as f:
reader = csv.reader(f, delimiter='',quotechar='|')
data = []
for name in reader:
data[name] = []
And if you read the lines of your csv and access the data then you can manipulate the way you want later.
cats = {}
with open('my.csv', "r") as ins:
# check each line of the fine
for line in ins:
# remove double quotes: replace('"', '')
# remove break line : rstrip()
a = str(line).replace('"', '').rstrip().split('|')
if a[0] != 'CatNo':
cats[int(a[0])] = a[1:];
for p in cats:
print 'cat_id: %d, value: %s' % (p, cats[p])
# you can access the value by the int ID
print cats[1001]
the output:
cat_id: 100, value: ['Best Sellers', 'Best Sellers']
cat_id: 1001, value: ['New this Month', 'New Products\\New this Month']
cat_id: 10, value: ['New Products', 'New Products']
cat_id: 1003, value: ['Previous Months', 'New Products\\Previous Months']
cat_id: 110, value: ['Promotional Material', 'Promotional Material']
cat_id: 120, value: ['Discounted Products & Special Offers', 'Discounted Products & Special Offers']
cat_id: 1002, value: ['Last Month', 'New Products\\Last Month']
['New this Month', 'New Products\\New this Month']
Updated script for your question:
categories = {}
def get_parent_category(cat_id):
if len(cat_id) <= 2:
return '';
else:
return cat_id[:-1]
with open('my.csv', "r") as ins:
for line in ins:
# remove double quotes: replace('"', '')
# remove break line : rstrip()
a = str(line).replace('"', '').rstrip().split('|')
cat_id = a[0]
if cat_id != 'CatNo':
categories[cat_id] = {
'parent': get_parent_category(cat_id),
'desc': a[1],
'long_desc': a[2]
};
print 'Categories relations:'
for p in categories:
parent = categories[p]['parent']
output = categories[p]['desc']
while parent != '':
output = categories[parent]['desc'] + ' \\ ' + output
parent = categories[parent]['parent']
print '\t', output
output:
Categories relations:
New Products
New Products \ Best Sellers
New Products \ Discounted Products & Special Offers
New Products \ Best Sellers \ Previous Months
New Products \ Best Sellers \ Last Month
New Products \ Best Sellers \ New this Month

Failing to append to dictionary. Python

I am experiencing a strange faulty behaviour, where a dictionary is only appended once and I can not add more key value pairs to it.
My code reads in a multi-line string and extracts substrings via split(), to be added to a dictionary. I make use of conditional statements. Strangely only the key:value pairs under the first conditional statement are added.
Therefore I can not complete the dictionary.
How can I solve this issue?
Minimal code:
#I hope the '\n' is sufficient or use '\r\n'
example = "Name: Bugs Bunny\nDOB: 01/04/1900\nAddress: 111 Jokes Drive, Hollywood Hills, CA 11111, United States"
def format(data):
dic = {}
for line in data.splitlines():
#print('Line:', line)
if ':' in line:
info = line.split(': ', 1)[1].rstrip() #does not work with files
#print('Info: ', info)
if ' Name:' in info: #middle name problems! /maiden name
dic['F_NAME'] = info.split(' ', 1)[0].rstrip()
dic['L_NAME'] = info.split(' ', 1)[1].rstrip()
elif 'DOB' in info: #overhang
dic['DD'] = info.split('/', 2)[0].rstrip()
dic['MM'] = info.split('/', 2)[1].rstrip()
dic['YY'] = info.split('/', 2)[2].rstrip()
elif 'Address' in info:
dic['STREET'] = info.split(', ', 2)[0].rstrip()
dic['CITY'] = info.split(', ', 2)[1].rstrip()
dic['ZIP'] = info.split(', ', 2)[2].rstrip()
return dic
if __name__ == '__main__':
x = format(example)
for v, k in x.iteritems():
print v, k
Your code doesn't work, at all. You split off the name before the colon and discard it, looking only at the value after the colon, stored in info. That value never contains the names you are looking for; Name, DOB and Address all are part of the line before the :.
Python lets you assign to multiple names at once; make use of this when splitting:
def format(data):
dic = {}
for line in data.splitlines():
if ':' not in line:
continue
name, _, value = line.partition(':')
name = name.strip()
if name == 'Name':
dic['F_NAME'], dic['L_NAME'] = value.split(None, 1) # strips whitespace for us
elif name == 'DOB':
dic['DD'], dic['MM'], dic['YY'] = (v.strip() for v in value.split('/', 2))
elif name == 'Address':
dic['STREET'], dic['CITY'], dic['ZIP'] = (v.strip() for v in value.split(', ', 2))
return dic
I used str.partition() here rather than limit str.split() to just one split; it is slightly faster that way.
For your sample input this produces:
>>> format(example)
{'CITY': 'Hollywood Hills', 'ZIP': 'CA 11111, United States', 'L_NAME': 'Bunny', 'F_NAME': 'Bugs', 'YY': '1900', 'MM': '04', 'STREET': '111 Jokes Drive', 'DD': '01'}
>>> from pprint import pprint
>>> pprint(format(example))
{'CITY': 'Hollywood Hills',
'DD': '01',
'F_NAME': 'Bugs',
'L_NAME': 'Bunny',
'MM': '04',
'STREET': '111 Jokes Drive',
'YY': '1900',
'ZIP': 'CA 11111, United States'}

Categories