From set to set of set? - python

I have set of elements, I want to convert this set to a list of list of 5 elements.
i.e.
I want set below
symbol_list = set([u'DIVISLAB', u'TITAN', u'JINDALSTEL', u'ENDURANCE', u'PGHH', u'GMRINFRA', u'UNIONBANK', u'RAMCOCEM', u'GAIL', u'ICICIGI', u'L&TFH', u'HINDUNILVR', u'SBIN', u'PRESTIGE', u'BERGEPAINT', u'LT', u'HINDPETRO', u'RELIANCE', u'GODREJCP', u'GRAPHITE', u'RELINFRA', u'NBCC', u'MCDOWELL-N', u'SYNGENE', u'IOC', u'PETRONET', u'SUNPHARMA', u'GRASIM', u'FEDERALBNK', u'GRUH', u'CANBK', u'BBTC', u'FCONSUMER', u'MFSL', u'MRF', u'TATACHEM', u'IDFCFIRSTB', u'FRETAIL', u'OIL', u'DBL', u'PFIZER', u'BANKINDIA', u'CHOLAFIN', u'MARUTI', u'HDFC', u'EXIDEIND', u'VOLTAS', u'PAGEIND', u'RELCAPITAL', u'HDFCAMC', u'INDHOTEL', u'INDIGO', u'BHARATFORG', u'BPCL', u'MOTHERSUMI', u'COLPAL', u'LTTS', u'BAJAJHLDNG', u'GICRE', u'KOTAKBANK', u'ABCAPITAL', u'CADILAHC', u'PIDILITIND', u'APOLLOTYRE', u'AUBANK', u'TCS', u'NATCOPHARM', u'AMARAJABAT', u'EICHERMOT', u'QUESS', u'SBILIFE', u'HCLTECH', u'SHREECEM', u'UPL', u'ESCORTS', u'DLF', u'BRITANNIA', u'MPHASIS', u'LUPIN', u'ONGC', u'GSPL', u'TATAGLOBAL', u'DISHTV', u'NIACL', u'NMDC', u'VARROC', u'SUNTV', u'IGL', u'GLENMARK', u'WIPRO', u'MARICO', u'COROMANDEL', u'TORNTPHARM', u'ASHOKLEY', u'MRPL', u'OBEROIRLTY', u'BIOCON', u'HINDALCO', u'SAIL', u'MGL', u'ICICIBANK', u'NTPC', u'BAJFINANCE', u'ACC', u'CONCOR', u'IDEA', u'RBLBANK', u'PEL', u'MUTHOOTFIN', u'M&MFIN', u'JUBILANT', u'OFSS', u'EDELWEISS', u'HEXAWARE', u'BEL', u'ADANIPORTS', u'DRREDDY', u'CROMPTON', u'ASIANPAINT', u'JSWSTEEL', u'AJANTPHARM', u'AXISBANK', u'SPARC', u'APOLLOHOSP', u'RECLTD', u'GODREJAGRO', u'JSWENERGY', u'ADANIPOWER', u'SRF', u'BANKBARODA', u'IDBI', u'HEG', u'ENGINERSIN', u'TATAMTRDVR', u'LTI', u'IBVENTURES', u'NHPC', u'BATAINDIA', u'HEROMOTOCO', u'ZEEL', u'AUROPHARMA', u'HDFCBANK', u'NAUKRI', u'ULTRACEMCO', u'ITC', u'HUDCO', u'TORNTPOWER', u'INFY', u'MINDTREE', u'IBULHSGFIN', u'BHARTIARTL', u'TATASTEEL', u'GODREJIND', u'AMBUJACEM', u'M&M', u'POWERGRID', u'HDFCLIFE', u'MANAPPURAM', u'DHFL', u'RPOWER', u'BALKRISIND', u'ABFRL', u'PNBHOUSING', u'HINDZINC', u'STRTECH', u'RAJESHEXPO', u'TATAMOTORS', u'TATAPOWER', u'DMART', u'CIPLA', u'HAVELLS', u'COALINDIA', u'LICHSGFIN', u'JUBLFOOD', u'BAJAJ-AUTO', u'DABUR', u'CUMMINSIND', u'NATIONALUM', u'INFRATEL', u'ABB', u'VEDL', u'BHEL', u'UBL', u'BOSCHLTD', u'BAJAJFINSV', u'TECHM', u'INDIANB', u'CASTROLIND', u'PIIND', u'PFC', u'PNB', u'BANDHANBNK', u'YESBANK', u'ALKEM', u'INDUSINDBK', u'SIEMENS', u'TVSMOTOR', u'GSKCONS', u'SRTRANSFIN', u'ICICIPRULI', u'VGUARD'])
to be of form
convertedset = ([[u'DIVISLAB', u'TITAN', u'JINDALSTEL', u'ENDURANCE', u'PGHH'], [u'GMRINFRA', u'UNIONBANK', u'RAMCOCEM', u'GAIL', u'ICICIGI'],[u'L&TFH', u'HINDUNILVR', u'SBIN'...]])

Try this :
symbol_list = [u'DIVISLAB', u'TITAN', u'JINDALSTEL', u'ENDURANCE', u'PGHH', u'GMRINFRA', u'UNIONBANK', u'RAMCOCEM', u'GAIL', u'ICICIGI', u'L&TFH', u'HINDUNILVR', u'SBIN', u'PRESTIGE', u'BERGEPAINT', u'LT', u'HINDPETRO', u'RELIANCE', u'GODREJCP', u'GRAPHITE', u'RELINFRA', u'NBCC', u'MCDOWELL-N', u'SYNGENE', u'IOC', u'PETRONET', u'SUNPHARMA', u'GRASIM', u'FEDERALBNK', u'GRUH', u'CANBK', u'BBTC', u'FCONSUMER', u'MFSL', u'MRF', u'TATACHEM', u'IDFCFIRSTB', u'FRETAIL', u'OIL', u'DBL', u'PFIZER', u'BANKINDIA', u'CHOLAFIN', u'MARUTI', u'HDFC', u'EXIDEIND', u'VOLTAS', u'PAGEIND', u'RELCAPITAL', u'HDFCAMC', u'INDHOTEL', u'INDIGO', u'BHARATFORG', u'BPCL', u'MOTHERSUMI', u'COLPAL', u'LTTS', u'BAJAJHLDNG', u'GICRE', u'KOTAKBANK', u'ABCAPITAL', u'CADILAHC', u'PIDILITIND', u'APOLLOTYRE', u'AUBANK', u'TCS', u'NATCOPHARM', u'AMARAJABAT', u'EICHERMOT', u'QUESS', u'SBILIFE', u'HCLTECH', u'SHREECEM', u'UPL', u'ESCORTS', u'DLF', u'BRITANNIA', u'MPHASIS', u'LUPIN', u'ONGC', u'GSPL', u'TATAGLOBAL', u'DISHTV', u'NIACL', u'NMDC', u'VARROC', u'SUNTV', u'IGL', u'GLENMARK', u'WIPRO', u'MARICO', u'COROMANDEL', u'TORNTPHARM', u'ASHOKLEY', u'MRPL', u'OBEROIRLTY', u'BIOCON', u'HINDALCO', u'SAIL', u'MGL', u'ICICIBANK', u'NTPC', u'BAJFINANCE', u'ACC', u'CONCOR', u'IDEA', u'RBLBANK', u'PEL', u'MUTHOOTFIN', u'M&MFIN', u'JUBILANT', u'OFSS', u'EDELWEISS', u'HEXAWARE', u'BEL', u'ADANIPORTS', u'DRREDDY', u'CROMPTON', u'ASIANPAINT', u'JSWSTEEL', u'AJANTPHARM', u'AXISBANK', u'SPARC', u'APOLLOHOSP', u'RECLTD', u'GODREJAGRO', u'JSWENERGY', u'ADANIPOWER', u'SRF', u'BANKBARODA', u'IDBI', u'HEG', u'ENGINERSIN', u'TATAMTRDVR', u'LTI', u'IBVENTURES', u'NHPC', u'BATAINDIA', u'HEROMOTOCO', u'ZEEL', u'AUROPHARMA', u'HDFCBANK', u'NAUKRI', u'ULTRACEMCO', u'ITC', u'HUDCO', u'TORNTPOWER', u'INFY', u'MINDTREE', u'IBULHSGFIN', u'BHARTIARTL', u'TATASTEEL', u'GODREJIND', u'AMBUJACEM', u'M&M', u'POWERGRID', u'HDFCLIFE', u'MANAPPURAM', u'DHFL', u'RPOWER', u'BALKRISIND', u'ABFRL', u'PNBHOUSING', u'HINDZINC', u'STRTECH', u'RAJESHEXPO', u'TATAMOTORS', u'TATAPOWER', u'DMART', u'CIPLA', u'HAVELLS', u'COALINDIA', u'LICHSGFIN', u'JUBLFOOD', u'BAJAJ-AUTO', u'DABUR', u'CUMMINSIND', u'NATIONALUM', u'INFRATEL', u'ABB', u'VEDL', u'BHEL', u'UBL', u'BOSCHLTD', u'BAJAJFINSV', u'TECHM', u'INDIANB', u'CASTROLIND', u'PIIND', u'PFC', u'PNB', u'BANDHANBNK', u'YESBANK', u'ALKEM', u'INDUSINDBK', u'SIEMENS', u'TVSMOTOR', u'GSKCONS', u'SRTRANSFIN', u'ICICIPRULI', u'VGUARD']
convertedlist = [symbol_list[i:i+5] for i in range(0, len(symbol_list), 5)]
Output :
[['DIVISLAB', 'TITAN', 'JINDALSTEL', 'ENDURANCE', 'PGHH'],
['GMRINFRA', 'UNIONBANK', 'RAMCOCEM', 'GAIL', 'ICICIGI'],
['L&TFH', 'HINDUNILVR', 'SBIN', 'PRESTIGE', 'BERGEPAINT'],
['LT', 'HINDPETRO', 'RELIANCE', 'GODREJCP', 'GRAPHITE'], ...]
Note :
Don't convert symbol_list to a set as even if not converted to a set it would contain unique elements 'cause len(symbol_list) == len(set(symbol_list)). Both are having 201 elements.

This can be handled by the below code, without using any lib,
split_size = 5
converted_list = []
# ouput list
split_list = []
# chlid list items, part of the output list
for index, elt in enumerate(symbol_list):
split_list.append(elt)
# we append to the list till the split_size cut
# and then append this list to the output list and continue
if index and index % split_size == 0:
converted_list.append(split_list)
split_list = []

Related

Extract letters after $ symbol using Pandas

I am trying to extract just the data upto and including the $ symbol from a spreadsheet.
I have isolated the data to give me just the column containing the data but what I am trying to do is extract any and all symbols that follow a $ symbol.
For example:
$AAPL $LOW $TSLA and so on from the entire dataset but I don't need or want $1000 $600 and so on - just letters only and either a period or a space follows but just the characters a-z is what I am trying to get.
I haven't been successful in full extraction and my code is starting to get messy so I'll provide the code that will bring back the data for you to see for yourself. I am using Jupyter Notebook.
import mysql.connector
import pandas
googleSheedID = '15fhpxqWDRWkNtEFhi9bQyWUg8pDn4B-R2N18s1xFYTU'
worksheetName = 'Sheet1'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheedID,
worksheetName
)
df = pandas.read_csv(URL)
del df['DATE']
del df['USERNAME']
del df['LINK']
del df['LINK2']
df[df["TWEET"].str.contains("RT")==False]
print(df)
Not sure if I understand what you want correctly, but the following codes give all elements that comes after $ before (blank space).
import mysql.connector
import pandas
googleSheedID = '15fhpxqWDRWkNtEFhi9bQyWUg8pDn4B-R2N18s1xFYTU'
worksheetName = 'Sheet1'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheedID,
worksheetName
)
df = pandas.read_csv(URL)
del df['DATE']
del df['USERNAME']
del df['LINK']
del df['LINK2']
unique_results = []
for i in range(len(df['TWEET'])):
if 'RT' in df["TWEET"][i]:
continue
else:
for j in range(len(df['TWEET'][i])-1):
if df['TWEET'][i][j] == '$':
if df['TWEET'][i][j+1] == '1' or df['TWEET'][i][j+1] == '2' or df['TWEET'][i][j+1] == '3' or\
df['TWEET'][i][j+1] == '4' or df['TWEET'][i][j+1] == '5' or df['TWEET'][i][j+1] == '6' or\
df['TWEET'][i][j+1] == '7' or df['TWEET'][i][j+1] == '8' or df['TWEET'][i][j+1] == '9' or df['TWEET'][i][j+1] == '0':
continue
else:
start = j
for k in range(start, len(df['TWEET'][i])):
if df['TWEET'][i][k] == ' ' or df['TWEET'][i][k:k+1] == '\n':
end = k
break
results = df['TWEET'][i][start:end]
if results not in unique_results:
unique_results.append(results)
print(unique_results)
edit: fixed the code
The outputs are:
['$GME', '$SNDL', '$FUBO', '$AMC', '$LOTZ', '$CLOV', '$USAS', '$AIHS', '$PLM', '$LODE', '$TTNP', '$IMTE', '', '$NAK.', '$NAK', '$CRBP', '$AREC', '$NTEC', '$NTN', '$CBAT', '$ZYNE', '$HOFV', '$GWPH', '$KERN', '$ZYNE,', '$AIM', '$WWR', '$CARV', '$VISL', '$SINO', '$NAKD', '$GRPS', '$RSHN', '$MARA', '$RIOT', '$NXTD', '$LAC', '$BTC', '$ITRM', '$CHCI', '$VERU', '$GMGI', '$WNBD', '$KALV', '$EGOC', '$Veru', '$MRNA', '$PVDG', '$DROP', '$EFOI', '$LLIT', '$AUVI', '$CGIX', '$RELI', '$TLRY', '$ACB', '$TRCH', '$TRCH.', '$TSLA', '$cciv', '$sndl', '$ANCN', '$TGC', '$tlry', '$KXIN', '$AMZN', '$INFI', '$LMND', '$COMS', '$VXX', '$LEDS', '$ACY', '$RHE', '$SINO.', '$GPL', '$SPCE', '$OXY', '$CLSN', '$FTFT', '$FTFT.....', '$BIEI', '$EDRY', '$CLEU', '$FSR', '$SPY', '$NIO', '$LI', '$XPEV,', '$UL', '$RGLG', '$SOS', '$QS', '$THCB', '$SUNW', '$MICT', '$BTC.X', '$T', '$ADOM', '$EBON', '$CLPS', '$HIHO', '$ONTX', '$WNRS', '$SOLO', '$Mara,', '$Riot,', '$SOS,', '$GRNQ,', '$RCON,', '$FTFT,', '$BTBT,', '$MOGO,', '$EQOS,', '$CCNC', '$CCIV', '$tsla', '$fsr', '$wkhs', '$ride', '$nio', '$NETE', '$DPW', '$MOSY', '$SSNT', '$PLTR', '$GSAH:', '$EQOS', '$MTSL', '$CMPS', '$CHIF', '$MU', '$HST', '$SNAP', '$CTXR', '$acy', '$FUBOTV', '$DPBE', '$HYLN', '$SPOT', '$NSAV', '$HYLN,', '$aabb', '$AAL', '$BBIG', '$ITNS', '$CTIB', '$AMPG', '$ZI', '$NUVI', '$INTC', '$TSM', '$AAPL', '$MRJT', '$RCMT', '$IZEA', '$BBIG,', '$ARKK', '$LIAUTO', '$MARA:', '$SOS:', '$XOM', '$ET', '$BRNW', '$SYPR', '$LCID', '$QCOM', '$FIZZ', '$TRVG', '$SLV', '$RAFA', '$TGCTengasco,', '$BYND', '$XTNT', '$NBY', '$sos', '$KMPH', '$', '$(0.60)', '$(0.64)', '$BIDU', '$rkt', '$GTT', '$CHUC', '$CLF', '$INUV', '$RKT', '$COST', '$MDCN', '$HCMC', '$UWMC', '$riot', '$OVID', '$HZON', '$SKT', '$FB', '$PLUG', '$BA', '$PYPL', '$PSTH.', '$NVDA', '$AMPG.', '$aese.', '$spy', '$pltr', '$MSFT', '$AMD', '$QQQ', '$LTNC', '$WKHS', '$EYES', '$RMO', '$GNUS', '$gme', '$mdmp', '$kern', '$AEI', '$BABA', '$YALA', '$TWTR', '$WISH', '$GE', '$ORCL', '$JUPW', '$TMBR', '$SSYS', '$NKE', '$AMPGAmpliTech', '$$$', '$$', '$RGLS', '$HOGE', '$GEGR', '$nclh', '$IGAC', '$FCEL', '$TKAT', '$OCG', '$YVR', '$IPDN.', '$IPDN', "$SINO's", '$WIMI', '$TKAT.', '$BAC', '$LZR', '$LGHL', '$F', '$GM', '$KODK', '$atvk', '$ATVK', '$AIKI', '$DS', '$AI', '$WTII', '$oxy', '$DYAI', '$DSS', '$ZKIN', '$MFH', '$WKEY', '$MKGI', '$DLPN', '$PSWW', '$SNOW', '$ALYA', '$AESE', '$CSCW', '$CIDM', '$HOFV.', '$LIVX', '$FNKO', '$HPR', '$BRQS', '$GIGM', '$APOP', '$EA', '$CUEN', '$TMBR?', '$FLNT,', '$APPS', '$METX', '$STG', '$WSRC', '$AMHC', '$VIAC', '$MO', '$UAVL', '$CS', '$MDT', '$GYST', '$CBBT', '$ASTC', '$AACG', '$WAFU.', '$WAFU', '$CASI', '$mmmw', '$MVIS', '$SNOA', '$C', '$KR', '$EWZ', '$VALE', '$EWZ.', '$CSCO', '$PINS', '$XSPA', '$VPRX', '$CEMI', '$M', '$BMRA', '$SPX', '$akt', '$SURG', '$NCLH', '$ARSN', '$ODT', '$SGBX', '$CRWD.', '$TGRR', '$PENN', '$BB', '$XOP', '$XL', '$FREQ', '$IDRA', '$DKNG', '$COHN', '$ADHC', '$ISWH', '$LEGO', '$OTRA', '$NAAC', '$HCAR', '$PPGH', '$SDAC', '$PNTM', '$OUST', '$IO', '$HQGE', '$HENC', '$KYNC', '$ATNF', '$BNSO', '$HDSN', '$AABB', '$SGH', '$BMY', '$VERY', '$EARS', '$ROKU', '$PIXY', '$APRE', '$SFET', '$SQ', '$EEIQ', '$REDU', '$CNWT', '$NFLX', '$RGBPP', '$RGBP', '$SHOP', '$VITL', '$RAAS', '$CPNG', '$JKS', '$COMP', '$NAFS']
You can use regular expressions.
\$[a-zA-Z]+
After reading the df execute the below code
import re
# Create Empty list for final results
results = []
final_results = []
for row_num in range(len(df['TWEET'])):
string_to_check = df['TWEET'][row_num]
# Check for RT at the beginning of the string only.
# if 'RT' in df["TWEET"][row_num] would have found the "RT" anywhere in the string.
if re.match(r"^RT", string_to_check):
continue
else:
# Check for all words starting with $ and followed by only alphabets.
# This will find $FOOBAR but not $600, $6FOOBAR & $FOO6BAR
rel_text_l = re.findall(r"\$[a-zA-Z]+", string_to_check)
# Check for empty list
if rel_text_l:
# Add elements of list to another list directly
results.extend(rel_text_l)
# Making list of the set of list to remove duplicates
final_results = list(set(results))
print(results)
print(final_results)
The results are
['$GME', '$FOOBAR', '$FOO', '$SNDL', '$FUBO', '$AMC', '$GME', '$LOTZ', '$CLOV', '$USAS', '$GOBLIN', '$LTNC']
['$LTNC', '$GOBLIN', '$AMC', '$FOO', '$FOOBAR', '$LOTZ', '$CLOV', '$SNDL', '$GME', '$USAS', '$FUBO']
Notice that $GME is removed once in final_results
If you were not bothered about remove tweets starting with RT, all this could be achieved in one line of code.
direct_result = list(set(re.findall(r"\$[a-zA-Z]+", str(df['TWEET']))))

multiple separator in a string python

text="Brand.*/Smart Planet.#/Color.*/Yellow.#/Type.*/Sandwich Maker.#/Power Source.*/Electrical."
I have this kind of string. I am facing the problem which splits it to 2 lists. Output will be approximately like this :
name = ['Brand','Color','Type','Power Source']
value = ['Smart Plane','Yellow','Sandwich Maker','Electrical']
Is there any solution for this.
name = []
value = []
text = text.split('.#/')
for i in text:
i = i.split('.*/')
name.append(i[0])
value.append(i[1])
This is one approach using re.split and list slicing.
Ex:
import re
text="Brand.*/Smart Planet.#/Color.*/Yellow.#/Type.*/Sandwich Maker.#/Power Source.*/Electrical."
data = [i for i in re.split("[^A-Za-z\s]+", text) if i]
name = data[::2]
value = data[1::2]
print(name)
print(value)
Output:
['Brand', 'Color', 'Type', 'Power Source']
['Smart Planet', 'Yellow', 'Sandwich Maker', 'Electrical']
You can use regex to split the text, and populate the lists in a loop.
Using regex you protect your code from invalid input.
import re
name, value = [], []
for ele in re.split(r'\.#\/', text):
k, v = ele.split('.*/')
name.append(k)
value.append(v)
>>> print(name, val)
['Brand', 'Color', 'Type', 'Power Source'] ['Smart Planet', 'Yellow', 'Sandwich Maker', 'Electrical.']
text="Brand.*/Smart Planet.#/Color.*/Yellow.#/Type.*/Sandwich Maker.#/Power Source.*/Electrical."
name=[]
value=[]
word=''
for i in range(len(text)):
temp=i
if text[i]!='.' and text[i]!='/' and text[i]!='*' and text[i]!='#':
word=word+''.join(text[i])
elif temp+1<len(text) and temp+2<=len(text):
if text[i]=='.' and text[temp+1]=='*' and text[temp+2]=='/':
name.append(word)
word=''
elif text[i]=='.' and text[temp+1]=='#' and text[temp+2]=='/':
value.append(word)
word=''
else:
value.append(word)
print(name)
print(value)
this will be work...

Trouble getting right values against each item

I'm trying to parse the item names and it's corresponding values from the below snippet. dt tag holds names and dd containing values. There are few dt tags which do not have corresponding values. So, all the names do not have values. What I wish to do is keep the values blank against any name if the latter doesn't have any values.
These are the elements I would like to scrape data from:
content="""
<div class="movie_middle">
<dl>
<dt>Genres:</dt>
<dt>Resolution:</dt>
<dd>1920*1080</dd>
<dt>Size:</dt>
<dd>1.60G</dd>
<dt>Quality:</dt>
<dd>1080p</dd>
<dt>Frame Rate:</dt>
<dd>23.976 fps</dd>
<dt>Language:</dt>
</dl>
</div>
"""
I've tried like below:
soup = BeautifulSoup(content,"lxml")
title = [item.text for item in soup.select(".movie_middle dt")]
result = [item.text for item in soup.select(".movie_middle dd")]
vault = dict(zip(title,result))
print(vault)
It gives me messy results (wrong pairs):
{'Genres:': '1920*1080', 'Resolution:': '1.60G', 'Size:': '1080p', 'Quality:': '23.976 fps'}
My expected result:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p','Frame Rate:':'23.976 fps','Language:':''}
Any help on fixing the issue will be highly appreciated.
You can loop through the elements inside dl. If the current element is dt and the next element is dd, then store the value as the next element, else set the value as empty string.
dl = soup.select('.movie_middle dl')[0]
elems = dl.find_all() # Returns the list of dt and dd
data = {}
for i, el in enumerate(elems):
if el.name == 'dt':
key = el.text.replace(':', '')
# check if the next element is a `dd`
if i < len(elems) - 1 and elems[i+1].name == 'dd':
data[key] = elems[i+1].text
else:
data[key] = ''
You can use BeautifulSoup to parse the dl structure, and then write a function to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a[4:-5], _d[0][4:-5]]
d = _d[1:]
else:
yield [a[4:-5], '']
d = _d
else:
yield [a[4:-5], '']
d = []
print(dict(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1])))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
For a slightly longer, although cleaner solution, you can create a decorator to strip the HTML tags of the output, thus removing the need for the extra string slicing in the main parse_result function:
def strip_tags(f):
def wrapper(data):
return {a[4:-5]:b[4:-5] for a, b in f(data)}
return wrapper
#strip_tags
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a, _d[0]]
d = _d[1:]
else:
yield [a, '']
d = _d
else:
yield [a, '']
d = []
print(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1]))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
from collections import defaultdict
test = soup.text.split('\n')
d = defaultdict(list)
for i in range(len(test)):
if (':' in test[i]) and (':' not in test[i+1]):
d[test[i]] = test[i+1]
elif ':' in test[i]:
d[test[i]] = ''
d
defaultdict(list,
{'Frame Rate:': '23.976 fps',
'Genres:': '',
'Language:': '',
'Quality:': '1080p',
'Resolution:': '1920*1080',
'Size:': '1.60G'})
The logic here is that you know that every key will have a colon. Knowing this, you can write an if else statement to capture the unique combinations, whether that is key followed by key or key followed by value
Edit:
In case you wanted to clean your keys, below replaces the : in each one:
d1 = { x.replace(':', ''): d[x] for x in d.keys() }
d1
{'Frame Rate': '23.976 fps',
'Genres': '',
'Language': '',
'Quality': '1080p',
'Resolution': '1920*1080',
'Size': '1.60G'}
The problem is that empty elements are not present. Since there is no hierarchy between the <dt> and the <dd>, I'm afraid you'll have to craft the dictionary yourself.
vault = {}
category = ""
for item in soup.find("dl").findChildren():
if item.name == "dt":
if category == "":
category = item.text
else:
vault[category] = ""
category = ""
elif item.name == "dd":
vault[category] = item.text
category = ""
Basically this code iterates over the child elements of the <dl> and fills the vault dictionary with the values.

insert a list 2xn obtained from a json in python

Hi I'm trying to access a json to save it in a list to perform a sort of append and create a pdf in ReportLab, I have the following code but I have several problems the first is that I would like to have a list of 2xn to always it has columns and rows be dynamic according to the json.
If anyone can help me be grateful much
import json
json_data = []
attributesName = []
testTable = { "attributes":[] }
attributesValue = []
path="prueba2.pdf"
doc = SimpleDocTemplate(path, pagesize=letter)
styleSheet = getSampleStyleSheet()
text = []
with open("prueba.json") as json_file:
document = json.load(json_file)
for item in document:
for data_item in item['data']:
attributesName.append([str(data_item['name'])
attributesValue.append([data_item['value']])
testTable[attributesName].extend({data_item['name'], data_item['value']})
print attributesName[0]
print testTable[0]
parts = []
p = Paragraph('''<para align=left fontsize=9>{0}</para>'''.format(text), styleSheet["BodyText"])
parts.append(p)
doc.build(parts)
I implemented the following,but it prints the list
[[['RFC', 'NOMBRE', 'APELLIDO PATERNO', 'APELLIDO MATERNO', 'FECHA NACIMIENTO', 'CALLE', 'No. EXTERI
OR', 'No. INTERIOR', 'C.P.', 'ENTIDAD', 'MUNICIPIO', 'COLONIA', 'DOCUMENTO']], [['MORR910304FL2', 'R
JOSE', 'MONTIEL', 'ROBLES', '1992-02-04', 'AMOR', '4', '2', '55064', 'EDO DE MEX', 'ECATEPEC', 'INDUSTRIAL', 'Documento']]]
I want some like this
[['RFC'], ['22232446']]
[['NOMBRE'], ['22239952']]
[['APELLIDO'], ['22245430']]
if you change your code with the next code
with open("prueba.json") as json_file:
document = json.load(json_file)
for item in document:
for data_item in item['data']:
attributesName.append(str(data_item["name"]))
attributesValue.append(str(data_item["value"]))
tabla.append([[attributesName],[attributesValue]])
print attributesName
print attributesValue
for Y in tabla:
print(Y)

Pass list of elements to named tuple

I'm suppose to create a namedtuple which has 27 field_names. Though it has too many field_names I created a list called sub which has list of items for field_names. The result is my reference to the instance of namedtuple.
sub = [
'MA9221', 'MC9211', 'MC9212', 'MC9213', 'MC9214',
'MC9215', 'MC9222', 'MC9223', 'MC9224', 'MC9225',
'MC9231', 'MC9232', 'MC9233', 'MC9234', 'MC9235',
'MC9241', 'MC9242', 'MC9243', 'MC9244', 'MC9251',
'MC9252', 'MC9273', 'MC9277', 'MC9283', 'MC9285']
result = namedtuple('result', ['rollno', 'name'] + sub)
Result values:
rollno = 123123
name = "Sam"
sub_value = [
1,0,0,0,0,
0,0,1,1,1,
1,1,1,0,0,
1,1,0,0,1,
1,1,1,0,1]
Now, I don't know how the pass the elements of sub_value to result(rollno, name, ...).
This line actually defines the type itself:
result = namedtuple('result', ['rollno', 'name'] + sub)
To create an instance, you now need to call result(...).
>>> result(rollno, name, *sub_value)
result(rollno=123123, name='Sam', MA9221=1, MC9211=0, MC9212=0, MC9213=0, MC9214=0, MC9215=0, MC9222=0, MC9223=1, MC9224=1, MC9225=1, MC9231=1, MC9232=1, MC9233=1, MC9234=0, MC9235=0, MC9241=1, MC9242=1, MC9243=0, MC9244=0, MC9251=1, MC9252=1, MC9273=1, MC9277=1, MC9283=0, MC9285=1)

Categories