Determining most common name from web scraped birth name data - python

I have the task to do web scraping from this page https://www.ssa.gov/cgi-bin/popularnames.cgi. There you can find a list of the most common birth names. Now I have to find the most common name that both girls and boys have for a given year (in other words, the exact same name is used in both genders), but I don't know how I am able to do that. With the code below I solved the previous task to output the list for a given year but I have no clue how I can modify my code so I get the most common name that both girls and boys have.
import requests
import lxml.html as lh
url = 'https://www.ssa.gov/cgi-bin/popularnames.cgi'
string = input("Year: ")
r = requests.post(url, data=dict(year=string, top="1000", number="n" ))
doc = lh.fromstring(r.content)
tr_elements = doc.xpath('//table[2]//td[2]//tr')
cols = []
for col in tr_elements[0]:
name = col.text_content()
number = col.text_content()
cols.append((number, []))
count=0
for row in tr_elements[1:]:
i = 0
for col in row:
val = col.text_content()
cols[i][1].append(val)
i += 1
if(count<4):
print(val, end = ' ')
count += 1
else:
count=0
print(val)

Here's one approach. The first step is to group the data by name and record how many genders have used the name and their aggregate total. After that, we can filter the structure by names with more than one gender using it. Finally, we sort this multi-gender list by counts and take the 0-th element. This is our most popular multi-gender name for the year.
import requests
import lxml.html as lh
url = "https://www.ssa.gov/cgi-bin/popularnames.cgi"
year = input("Year: ")
response = requests.post(url, data=dict(year=year, top="1000", number="n"))
doc = lh.fromstring(response.content)
tr_elements = doc.xpath("//table[2]//td[2]//tr")
column_names = [col.text_content() for col in tr_elements[0]]
names = {}
most_common_shared_names_by_year = {}
for row in tr_elements[1:-1]:
row = [cell.text_content() for cell in row]
for i, gender in ((1, "male"), (3, "female")):
if row[i] not in names:
names[row[i]] = {"count": 0, "genders": set()}
names[row[i]]["count"] += int(row[i+1].replace(",", ""))
names[row[i]]["genders"].add(gender)
shared_names = [
(name, data) for name, data in names.items() if len(data["genders"]) > 1
]
most_common_shared_names = sorted(shared_names, key=lambda x: -x[1]["count"])
print("%s => %s" % most_common_shared_names[0])
If you're curious, here are the results since 2000:
2000 => Tyler, 22187
2001 => Tyler, 19842
2002 => Tyler, 18788
2003 => Ryan, 20171
2004 => Madison, 20829
2005 => Ryan, 18661
2006 => Ryan, 17116
2007 => Jayden, 17287
2008 => Jayden, 19040
2009 => Jayden, 19053
2010 => Jayden, 18641
2011 => Jayden, 18064
2012 => Jayden, 16952
2013 => Jayden, 15462
2014 => Logan, 14478
2015 => Logan, 13753
2016 => Logan, 12099
2017 => Logan, 15117

Related

Storing keyvalue as header and value text as rows using data frame in python using beautiful soup

for imo in imos:
...
...
keys_div= soup.find_all("div", {"class","col-4 keytext"})
values_div = soup.find_all("div",{"class","col-8 valuetext"})
for key, value in zip(keys_div, values_div):
print(key.text + ": " + value.text)
'......
Output:
Ship Name: MAERSK ADRIATIC
Shiptype: Chemical/Products Tanker
IMO/LR No.: 9636632
Gross: 23,297
Call Sign: 9V3388
Deadweight: 37,538
MMSI No.: 566429000
Year of Build: 2012
Flag: Singapore
Status: In Service/Commission
Operator: Handytankers K/S
Shipbuilder: Hyundai Mipo Dockyard Co Ltd
ShipType: Chemical/Products Tanker
Built: 2012
GT: 23,297
Deadweight: 37,538
Length Overall: 184.000
Length (BP): 176.000
Length (Reg): 177.460
Bulbous Bow: Yes
Breadth Extreme: 27.430
Breadth Moulded: 27.400
Draught: 11.500
Depth: 17.200
Keel To Mast Height: 46.900
Displacement: 46565
T/CM: 45.0
This is the output for one imo, i want to store this output in dataframe and write to csv, the csv will have the keytext as header and value text as rows for all the IMO's please help me on how to do it
All you have to do is add the results to a list and then output that list to a dataframe.
import pandas as pd
filepath = r"C\users\test\test_file.csv"
output_data = []
for imo in imos:
keys_div = [i.text for i in soup.find_all("div", {"class","col-4 keytext"})]
values_div = [i.text for i in soup.find_all("div",{"class","col-8 valuetext"})]
dict1 = dict(zip(keys_div, values_div))
output_data.append(dict1)
df = pd.DataFrame(output_data)
df.to_csv(filepath, index=False)

How do I transform a non-CSV text file into a CSV using Python/Pandas?

I have a text file that looks like this:
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
I'd like the file to look like this:
Id Number
Location
Street
Buyer
12345678
1234561791234567090-8.9
999 Street AVE
john doe
12345688
3582561791254567090-8.9
123 Street AVE
Jane doe # buyer % LLC
12345689
8542561791254567090-8.9
854 Street AVE
Jake and Bob: Owner%LLC: Inc
I have tried the following:
# 1 Read text file and ignore bad lines (lines with extra colons thus reading as extra fields).
tr = pd.read_csv('C:\\File Path\\test.txt', sep=':', header=None, error_bad_lines=False)
# 2 Convert into a dataframe/pivot table.
ndf = pd.DataFrame(tr.pivot(index=None, columns=0, values=1))
# 3 Clean up the pivot table to remove NaNs and reset the index (line by line).
nf2 = ndf.apply(lambda x: x.dropna().reset_index(drop=True))
Here is where got the last line (#3): https://stackoverflow.com/a/62481057/10448224
When I do the above and export to CSV the headers are arranged like the following:
(index)
Street
Buyer
Id Number
Location
The data is filled in nicely but at some point the Buyer field becomes inaccurate but the rest of the fields are accurate through the entire DF.
My guesses:
When I run #1 part of my script I get the following errors 507 times:
b'Skipping line 500: expected 2 fields, saw 3\nSkipping line 728: expected 2 fields, saw 3\
At the tail end of the new DF I am missing exactly 507 entries for the Byer field. So I think when I drop my bad lines, the field is pushing my data up.
Pain Points:
The Buyer field will sometimes have extra colons and other odd characters. So when I try to use a colon as a delimiter I run into problems.
I am new to Python and I am very new to using functions. I primarily use Pandas to manipulate data at a somewhat basic level. So in the words of the great Michael Scott: "Explain it to me like I'm five." Many many thanks to anyone willing to help.
Here's what I meant by reading in and using split. Very similar to other answers. Untested and I don't recall if inputline include eol, so I stripped it too.
with open('myfile.txt') as f:
data = [] # holds database
record = {} # holds built up record
for inputline in f:
key,value = inputline.strip().split(':',1)
if key == "Id Number": # new record starting
if len(record):
data.append(record) # write previous record
record = {}
record.update({key:value})
if len(record):
data.append(record) # out final record
df = pd.DataFrame(data)
This is a minimal example that demonstrates the basics:
cat split_test.txt
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
import csv
with open("split_test.txt", "r") as f:
id_val = "Id Number"
list_var = []
for line in f:
split_line = line.strip().split(':')
print(split_line)
if split_line[0] == id_val:
d = {}
d[split_line[0]] = split_line[1]
list_var.append(d)
else:
d.update({split_line[0]: split_line[1]})
list_var
[{'Id Number': ' 12345689',
'Location': ' 8542561791254567090-8.9',
'Street': ' 854 Street AVE',
'Buyer': ' Jake and Bob'},
{'Id Number': ' 12345678',
'Location': ' 1234561791234567090-8.9',
'Street': ' 999 Street AVE',
'Buyer': ' john doe'},
{'Id Number': ' 12345688',
'Location': ' 3582561791254567090-8.9',
'Street': ' 123 Street AVE',
'Buyer': ' Jane doe # buyer % LLC'}]
with open("split_ex.csv", "w") as csv_file:
field_names = list_var[0].keys()
csv_writer = csv.DictWriter(csv_file, fieldnames=field_names)
csv_writer.writeheader()
for row in list_var:
csv_writer.writerow(row)
I would try reading the file line by line, splitting the key-value pairs into a list of dicts to look something like:
data = [
{
"Id Number": 12345678,
"Location": 1234561791234567090-8.9,
...
},
{
"Id Number": ...
}
]
# easy to create the dataframe from here
your_df = pd.DataFrame(data)

How to number from an SQL database in Python?

How to get numbers 1 to 10 next to the SQL table contents from the Chinook database in a good format? I can't get the loop from 1 to 10 next to the other three elements of the database file. The output I want :
1 Chico Buarque Minha Historia 27
2 Lenny Kravitz Greatest Hits 26
3 Eric Clapton Unplugged 25
4 Titãs Acústico 22
5 Kiss Greatest Kiss 20
6 Caetano Veloso Prenda Minha 19
7 Creedence Clearwater Revival Chronicle, Vol. 2 19
8 The Who My Generation - The Very Best Of The Who 19
9 Green Day International Superhits 18
10 Creedence Clearwater Revival Chronicle, Vol. 1 18
My code :
import sqlite3
try:
conn = sqlite3.connect(r'C:\Users\Just\Downloads\chinook.db')
except Exception as e:
print(e)
cur = conn.cursor()
cur.execute('''SELECT artists.Name, albums.Title, count (albums.AlbumId) AS AlbumAmountListened
FROM albums
INNER JOIN tracks ON albums.AlbumId = tracks.AlbumId
INNER JOIN invoice_items ON tracks.TrackId = invoice_items.TrackId
INNER JOIN artists ON albums.ArtistId = artists.ArtistId
GROUP BY albums.AlbumId
ORDER BY AlbumAmountListened DESC
LIMIT 10''')
top_10_albums = cur.fetchall()
def rank():
for item in top_10_albums:
name = item[0]
artist = item[1]
album_played = item[2]
def num():
for i in range(1,11):
print (i)
return i
print (num(),'\t', name, '\t', artist, '\t', album_played, '\t')
print (rank())
My 1-10 number loops like this:
1
2
3
4
5
6
7
8
9
10
10 Chico Buarque Minha Historia 27
1
2
3
4
5
6
7
8
9
10
10 Lenny Kravitz Greatest Hits 26
And so on. How do I correctly combine my range object?
You can use enumerate() to provide the numbers for you as you iterate over the rows:
top_10_albums = cur.fetchall()
for i, item in enumerate(top_10_albums, start=1):
name = item[0]
artist = item[1]
album_played = item[2]
print(f'{i}\t{name}\t{artist}\t{album_played}')
You don't even have to unpack the item into variables, just reference them directly in the fstring:
for i, item in enumerate(top_10_albums, start=1):
print(f'{i}\t{item[0]}\t{item[1]}\t{item[2]')
But this is perhaps nicer:
for i, (name, artist, album_played) in enumerate(top_10_albums, start=1):
print(f'{i}\t{name}\t{artist}\t{album_played}')
This uses tuple unpacking to bind the fields from the row to descriptively named variables, which makes it self documenting.
Just need to iterate with an index(i) within the for loop such as
top_10_albums = cur.fetchall()
i=0
for item in top_10_albums:
name = item[0]
artist = item[1]
album_played = item[2]
i += 1
print (i,'\t', name, '\t', artist, '\t', album_played, '\t')
in your case, inner loop produces 10 numbers for each step of outer loop.
Numbered Version
def rowView(strnum,row,flen_align=[(30,"l"),(30,"r"),(5,"r")]):
i = 0
line=""
for k,v in row.items():
flen , align = flen_align[i]
strv = str(v)
spaces = "_" * abs(flen - len(strv))
if align == "l":
line += strv+spaces
if align == "r":
line += spaces+strv
i+=1
return strnum+line
dlist=[
{ "name":"Chico Buarque", "title":"Minha Historia","AAL":27},
{ "name":"Lenny Kravit", "title":"Greatest Hits","AAL":26},
{ "name":"Eric Clapton", "title":"Unplugged","AAL":25},
{ "name":"Titã", "title":"Acústico","AAL":22},
{ "name":"Kis", "title":"Greatest Kiss","AAL":20},
{ "name":"Caetano Velos", "title":"Prenda Minha","AAL":19},
{ "name":"Creedence Clearwater Reviva", "title":"Chronicle,Vol.2","AAL":19},
{ "name":"TheWho My Generation", "title":"The Very Best Of The Who","AAL":19},
{ "name":"Green Da", "title":"International Superhits","AAL":18},
{ "name":"Creedence Clearwater Reviva", "title":"Chronicle,Vol.1","AAL":18}
]
for num, row in enumerate(dlist,start=1):
strnum=str(num)
strnum += "_" * (5-len(strnum))
print(rowView(strnum,row))
Or using record id directly
def rowView(row,flen_align=[(5,"l"),(30,"l"),(30,"r"),(5,"r")]):
i,line = 0,""
for k,v in row.items():
flen , align = flen_align[i]
strv = str(v)
spaces = "_" * abs(flen - len(strv))
if align == "l":
line += strv+spaces
if align == "r":
line += spaces+strv
i+=1
return line
dlist=[
{"id":1, "name":"Chico Buarque", "title":"Minha Historia","AAL":27},
{"id":2, "name":"Lenny Kravit", "title":"Greatest Hits","AAL":26},
{"id":3, "name":"Eric Clapton", "title":"Unplugged","AAL":25},
{"id":4, "name":"Titã", "title":"Acústico","AAL":22},
{"id":5, "name":"Kis", "title":"Greatest Kiss","AAL":20},
{"id":6, "name":"Caetano Velos", "title":"Prenda Minha","AAL":19},
{"id":7, "name":"Creedence Clearwater Reviva", "title":"Chronicle,Vol.2","AAL":19},
{"id":8, "name":"TheWho My Generation", "title":"The Very Best Of The Who","AAL":19},
{"id":9, "name":"Green Da", "title":"International Superhits","AAL":18},
{"id":10, "name":"Creedence Clearwater Reviva", "title":"Chronicle,Vol.1","AAL":18}
]
for row in dlist:
print(rowView(row))
same output for both versions:
1____Chico Buarque_________________________________Minha Historia___27
2____Lenny Kravit___________________________________Greatest Hits___26
3____Eric Clapton_______________________________________Unplugged___25
4____Titã________________________________________________Acústico___22
5____Kis____________________________________________Greatest Kiss___20
6____Caetano Velos___________________________________Prenda Minha___19
7____Creedence Clearwater Reviva__________________Chronicle,Vol.2___19
8____TheWho My Generation________________The Very Best Of The Who___19
9____Green Da_____________________________International Superhits___18
10___Creedence Clearwater Reviva__________________Chronicle,Vol.1___18

Grab useful data between strings using python from a file

I have the following Raw data from a HTML source code file
{$deletedFields:[courses,projects,description,degreeName,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),school:urn:li:fs_miniSchool:11709,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,schoolName:Charles University in Prague,fieldOfStudy:Economics, Politics,schoolUrn:urn:li:fs_miniSchool:11709,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[courses,projects,description,recommendations,honors,entityLocale,activities,grade,fieldOfStudyUrn,testScores,degreeUrn],entityUrn:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),school:urn:li:fs_miniSchool:17888,timePeriod:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,degreeName:BA,schoolName:Occidental College,fieldOfStudy:Economics,schoolUrn:urn:li:fs_miniSchool:17888,$type:com.linkedin.voyager.identity.profile.Education},
{$deletedFields:[],profileId:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,elements:[urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717)],paging:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView,paging,$type:com.linkedin.voyager.identity.profile.EducationView,$id:urn:li:fs_profileView:ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,educationView},
{$deletedFields:[],start:501,end:1000,$type:com.linkedin.voyager.identity.profile.EmployeeCountRange,$id:urn:li:fs_position:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,323432440),company,employeeCountRange}
{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,startDate},
{$deletedFields:[month,day],year:2004,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,startDate},
{$deletedFields:[month,day],year:2008,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,26812055),timePeriod,endDate},
{$deletedFields:[month,day],year:2007,$type:com.linkedin.common.Date,$id:urn:li:fs_education:(ACoAAAIUJvgBC7QTHSmLEjgtomzxvwceeM71E1c,75863717),timePeriod,endDate},
What i need is extract some data out of it using.
schoolname = re.findall(r',schoolname:(.*?),' , page_html)
fieldofstudy = skills = re.findall(r'fieldOfStudy:(.*?),s' , page_html)
degreename = re.findall(r'degreeName:(.*?),' , page_html)
Needed Output
schoolName:Charles University in Prague
fieldOfStudy:Economics, Politics
Start : Year 2007
End : 2007
schoolName:Occidental College
fieldOfStudy:Economics
degreeName:BA
start : 2004
End : 2008
Question: What i need is extract some data out of it using
Define a Data Container class School:
class School(object):
def __init__(self, raw_data):
key = None
year = '?'
for kv in raw_data:
i = kv.find(':')
if i >= 0:
key = kv[0:i]
value = kv[i + 1:]
if key in ['schoolName', 'fieldOfStudy', 'startDate', 'endDate', 'degreeName']:
object.__setattr__(self, key, value)
if key in ['year']:
year = value
else:
if key in ['entityUrn', '$id']:
if kv[:-1].isdigit():
self.entity = kv[:-1]
elif key in ['fieldOfStudy']:
self.fieldOfStudy += ', '+kv
elif kv in ['startDate', 'endDate']:
object.__setattr__(self, kv, year)
key = ''
if not hasattr(self, 'degreeName'):
self.degreeName = 'unknown'
def __repr__(self):
return "entity:\t\t{s.entity:>28}\n" \
"schoolName:\t{s.schoolName:>28}\n" \
"fieldOfStudy:{s.fieldOfStudy:>27}\n" \
"degreeName:\t{s.degreeName:>28}\n" \
"startDate:\t{s.startDate:>28}\n" \
"endDate:\t{s.endDate:>28}\n".format(s=self)
Read the file line by line:
with open('<path to file>') as fh:
degreeUrn = {}
for line in fh:
match = re.findall(r'\{(.*?)\:\[(.*?)\],(.*)\}', line)
m2 = match[0][2].split(',')
school = School(m2)
if hasattr(school, 'entity'):
if hasattr(school, 'startDate'):
degreeUrn[school.entity].startDate = school.startDate
del school
elif hasattr(school, 'endDate'):
degreeUrn[school.entity].endDate = school.endDate
del school
elif hasattr(school, 'schoolName'):
degreeUrn[school.entity] = school
else:
del school
for entity in degreeUrn:
print(degreeUrn[entity])
Output:
entity: 75863717
schoolName: Charles University in Prague
fieldOfStudy: Economics, Politics
degreeName: unknown
startDate: 2007
endDate: 2007
entity: 26812055
schoolName: Occidental College
fieldOfStudy: Economics
degreeName: BA
startDate: 2004
endDate: 2008
Tested with Python: 3.4.2

Organize by Twitter unique identifier using python

I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
​
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.

Categories