Storing keyvalue as header and value text as rows using data frame in python using beautiful soup - python

for imo in imos:
...
...
keys_div= soup.find_all("div", {"class","col-4 keytext"})
values_div = soup.find_all("div",{"class","col-8 valuetext"})
for key, value in zip(keys_div, values_div):
print(key.text + ": " + value.text)
'......
Output:
Ship Name: MAERSK ADRIATIC
Shiptype: Chemical/Products Tanker
IMO/LR No.: 9636632
Gross: 23,297
Call Sign: 9V3388
Deadweight: 37,538
MMSI No.: 566429000
Year of Build: 2012
Flag: Singapore
Status: In Service/Commission
Operator: Handytankers K/S
Shipbuilder: Hyundai Mipo Dockyard Co Ltd
ShipType: Chemical/Products Tanker
Built: 2012
GT: 23,297
Deadweight: 37,538
Length Overall: 184.000
Length (BP): 176.000
Length (Reg): 177.460
Bulbous Bow: Yes
Breadth Extreme: 27.430
Breadth Moulded: 27.400
Draught: 11.500
Depth: 17.200
Keel To Mast Height: 46.900
Displacement: 46565
T/CM: 45.0
This is the output for one imo, i want to store this output in dataframe and write to csv, the csv will have the keytext as header and value text as rows for all the IMO's please help me on how to do it

All you have to do is add the results to a list and then output that list to a dataframe.
import pandas as pd
filepath = r"C\users\test\test_file.csv"
output_data = []
for imo in imos:
keys_div = [i.text for i in soup.find_all("div", {"class","col-4 keytext"})]
values_div = [i.text for i in soup.find_all("div",{"class","col-8 valuetext"})]
dict1 = dict(zip(keys_div, values_div))
output_data.append(dict1)
df = pd.DataFrame(output_data)
df.to_csv(filepath, index=False)

Related

Determining most common name from web scraped birth name data

I have the task to do web scraping from this page https://www.ssa.gov/cgi-bin/popularnames.cgi. There you can find a list of the most common birth names. Now I have to find the most common name that both girls and boys have for a given year (in other words, the exact same name is used in both genders), but I don't know how I am able to do that. With the code below I solved the previous task to output the list for a given year but I have no clue how I can modify my code so I get the most common name that both girls and boys have.
import requests
import lxml.html as lh
url = 'https://www.ssa.gov/cgi-bin/popularnames.cgi'
string = input("Year: ")
r = requests.post(url, data=dict(year=string, top="1000", number="n" ))
doc = lh.fromstring(r.content)
tr_elements = doc.xpath('//table[2]//td[2]//tr')
cols = []
for col in tr_elements[0]:
name = col.text_content()
number = col.text_content()
cols.append((number, []))
count=0
for row in tr_elements[1:]:
i = 0
for col in row:
val = col.text_content()
cols[i][1].append(val)
i += 1
if(count<4):
print(val, end = ' ')
count += 1
else:
count=0
print(val)
Here's one approach. The first step is to group the data by name and record how many genders have used the name and their aggregate total. After that, we can filter the structure by names with more than one gender using it. Finally, we sort this multi-gender list by counts and take the 0-th element. This is our most popular multi-gender name for the year.
import requests
import lxml.html as lh
url = "https://www.ssa.gov/cgi-bin/popularnames.cgi"
year = input("Year: ")
response = requests.post(url, data=dict(year=year, top="1000", number="n"))
doc = lh.fromstring(response.content)
tr_elements = doc.xpath("//table[2]//td[2]//tr")
column_names = [col.text_content() for col in tr_elements[0]]
names = {}
most_common_shared_names_by_year = {}
for row in tr_elements[1:-1]:
row = [cell.text_content() for cell in row]
for i, gender in ((1, "male"), (3, "female")):
if row[i] not in names:
names[row[i]] = {"count": 0, "genders": set()}
names[row[i]]["count"] += int(row[i+1].replace(",", ""))
names[row[i]]["genders"].add(gender)
shared_names = [
(name, data) for name, data in names.items() if len(data["genders"]) > 1
]
most_common_shared_names = sorted(shared_names, key=lambda x: -x[1]["count"])
print("%s => %s" % most_common_shared_names[0])
If you're curious, here are the results since 2000:
2000 => Tyler, 22187
2001 => Tyler, 19842
2002 => Tyler, 18788
2003 => Ryan, 20171
2004 => Madison, 20829
2005 => Ryan, 18661
2006 => Ryan, 17116
2007 => Jayden, 17287
2008 => Jayden, 19040
2009 => Jayden, 19053
2010 => Jayden, 18641
2011 => Jayden, 18064
2012 => Jayden, 16952
2013 => Jayden, 15462
2014 => Logan, 14478
2015 => Logan, 13753
2016 => Logan, 12099
2017 => Logan, 15117

how to convert a vertical list to panda dataframe?

i have a list from a webscraper that makes a log file in a vertical list from.
example:
21-Oct-19 14:46:14 - Retrieving data from https://www.finn.no/bap/forsale/search.html?category=0.93&page=1&product_category=2.93.3904.69&sub_category=1.93.3904
0 21-Oct-19 14:46:14 - Found:
1 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
2 Price: 4�900 kr
3 Link: https://www.finn.no/bap/forsale/ad.html?...
4 21-Oct-19 14:46:14 - Found:
5 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
6 Price: 4�900 kr
7 Link: https://www.finn.no/bap/forsale/ad.html?...
8 21-Oct-19 14:46:14 - Found:
9 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
10 Price: 4�900 kr
11 Link: https://www.finn.no/bap/forsale/ad.html?...
12 21-Oct-19 14:46:14 - Found:
13 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
Can i convert it intro readble dataframe for Pandas ?
example:
title price link
canon 100mm 6900kr https
canon 50mm 100r https
canon 17mm 63530kr https
my code right now look like this:
import pandas as pd
data = pd.read_csv('finn.no-2019-10-21-.log', sep ="Line", engine='python')
df = pd.DataFrame(data)
title = 1,5,9,13,17,21
price = 2,6,10,14,18,22
link = 3,7,11,15,19,23
print(df)
can i do anything with the numbers in the original row to convert to a more traditinal dataframe ?
This should do it for you:
with open('finn.no-2019-10-21-.log') as f:
lines = f.readlines()
clean = [line.strip() for line in lines]
title = [j.split('Title: ')[1] for j in clean if j.startswith('Title: ')]
price = [k.split('Price: ')[1] for k in clean if k.startswith('Price: ')]
link = [l.split('Link: ')[1] for l in clean if l.startswith('Link: ')]
df = pd.DataFrame(data=[title, price, link], columns=['Title', 'Price', 'Link'])
from help of #zipa i got it right:
import pandas as pd
with open('finn.no-2019-10-22-.log') as f:
lines = f.readlines()
clean = [line.strip() for line in lines]
titles = [j.split('Title: ')[1] for j in clean if j.startswith('Title: ')]
prices = [k.split('Price: ')[1] for k in clean if k.startswith('Price: ')]
links = [l.split('Link: ')[1] for l in clean if l.startswith('Link: ')]
output = []
for title, price, link in zip(titles, prices, links):
articles = {}
articles['titles'] = title
articles['prices'] = price
articles['links'] = link
output.append(articles)
df = pd.DataFrame(data=output)
print(df)

Extracting Author name from XML tags using ElelemtTree

Following is the link to access the XML document:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%2726161999%27&retmode=xml
I'm trying to extract the author Name which includes Lastname+Forename and make a string with only author name. I'm only being able to extract the details separately.
Following is the code that I have tried
r = requests.get(
'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id='26161999'&retmode=xml')
root = et.fromstring(r.content)
for elem in root.findall(".//ForeName"):
elem_ = elem.text
auth_name = list(elem_.split(" "))
authordata.append(auth_name)
val = [item if isinstance(item, str) else " ".join(item) for item in authordata] #flattening the list since its a nested list, converting nested list into string
seen = set()
val = [x for x in val if x not in seen and not seen.add(x)]
author= ' '.join(val)
print(author)
The output obtained from the above code is:
Elisa Riccardo Mirco Laura Valentina Antonio Sara Carla Borri Barbara
The expected output is a combination of firstname + Lastname:
Elisa Oppici Riccardo Montioli Mirco Dindo Laura Maccari Valentina Porcari Antonio Lorenzetto Chellini Sara Carla Borri Voltattorni Barbara Cellini
From your question I understand that you want a concatenation of ForeName and LastName for each author. You can achieve that by querying directly for those fields for each Author element in the tree and concatenate the corresponding text fields:
import xml.etree.ElementTree as et
import requests
r = requests.get(
'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="26161999"&retmode=xml'
)
root = et.fromstring(r.content)
author_names = []
for author in root.findall(".//Author"):
fore_name = author.find('ForeName').text
last_name = author.find('LastName').text
author_names.append(fore_name + ' ' + last_name)
print(author_names)
# or to get your exact output format:
print(' '.join(author_names))

Printing a list in dataframe format

I am trying to print a two-dimensional list in pandas data frame format.
The result of printing the data as Pandas Data frame
Pandas Data Frame
My Code
cols = ["prod_id", "description", "cost"]
data = [["p01", "Domaxx Geniune Leather RFID Blocking Trifold Wallets-Made Genuine Soft Leather Large Classic Pocket Wallet,Holding 9 Cards Photo ID Coin Pocket and 2 Note compartments-Black Surface/Orange Inner", "10.00"],
["p02","Neck Wallet, Passport Holder with RFID Blocking Anti-Theft Travel Pouch Security Wallet for Credit Cards and Passport - Silver","15.00"]]
temp_str = ''
for item in cols :
temp_str += "\t " + item
print(temp_str)
i = 0
for row in data :
print(str(i) + "\t" + row[0] + "\t" + row[1] + "\t" + row[2])
i += 1
============
Print Result
normal List
I am not sure if I understood your input, but if you want this:
prod_id description cost
0 p01 Domaxx Geniune Leather RFID Blocking Trifold W... 10.00
1 p02 Neck Wallet, Passport Holder with RFID Blockin... 15.00
from:
cols = ["prod_id", "description", "cost"]
data = [["p01", "Domaxx Geniune Leather RFID Blocking Trifold Wallets-Made Genuine Soft Leather Large Classic Pocket Wallet,Holding 9 Cards Photo ID Coin Pocket and 2 Note compartments-Black Surface/Orange Inner", "10.00"], ["p02","Neck Wallet, Passport Holder with RFID Blocking Anti-Theft Travel Pouch Security Wallet for Credit Cards and Passport - Silver","15.00"]]
Just do this:
import pandas as pd
df = pd.DataFrame(data, columns=cols)
EDIT
I show you an example with shorter columns fields:
data = [["p01", "Domaxx Geniune Leather RFID Blocking Trifold Wallets-Made", "10.00"],
["p02","Neck Wallet, Passport Holder with RFID Blocking Anti-Theft Travel Pouch Security Wallet f","15.00"]]
fmt = '{:<4}{:<10}{:<100}{}'
data1 = map(list, zip(*data))
print(fmt.format('', "prod_id", "description", "cost")) # your columns here
for i, (x, y, z) in enumerate(zip(data1[0], data1[1], data1[2])):
print(fmt.format(i, x, y, z))
You can play with the values of the format to reach the result that is the best for you

Organize by Twitter unique identifier using python

I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
​
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.

Categories