Related
I am a self-teaching data science student, currently doing my first big Python portfolio project in several steps, the first of which is using pandas to work with IMDb [Internet Movie Database]'s rather oddly structured .tsv files in an effort to create a fully searchable big data repository of all IMDb data (the officially supported searches and even APIs like OMDB (Open Movie Database) don't allow for the kinds of detailed queries I need to do for the larger project).
The structure of IMDb's public files is that they include all data on movies, TV shows, episodes, actors, directors, crew, the whole business, scattered rather haphazardly across seven massive tsv files. I've confirmed that pandas can, in fact, read in all of this data and that my computer's memory can handle it, but what I want to do is merge the seven tsv files into a single DataFrame object which can then be exported to (preferably) a SQL database or even a huge spreadsheet/another TSV file but larger.
Each thing in the database (movie, actor, individual TV episode) has a tconst row, which, in one file is identified as "titleId", a string. In every other file, this is identified as "tconst", also a string. I'm going to need to change titleId when I read that file into tconst; this is one of several challenges I haven't got to yet.
#set pandas formatting parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
This works, but I'm not sure exactly how to proceed next. I want to import each of the other tsv files to attempt to locally reconstruct the imdb database. This means that I don't want to have duplicate tconst strings, but rather to end up with new information about a tconst entry (like a film) appended to it as new columns.
Should I be looking to do a "for i in [new file]" type loop somehow? How would you go about this?
The IMdB files are actually highly structured. Looping is always a bad structure for merging data
structure data sourcing - I used wget rather than manually sourcing
the files are large so work with a subset for modelling purpose. I just have used popular movies and actors as driver
the CSV columns in the tsv files are actually sub-tables. Treat them as such. I build a reference entity dmi to do this
there are other associative relationships there as well primaryProfession, genres
finally join (merge) everything together from OMDB and IMdB. Taken first rows where many items associate to a title
I have left data currently as tsv clearly it would be very simple to put into a database using to_sql() method. Main point is sourcing and transformation. aka ETL which has become an unfashionable term. This can be further supplemented with web scraping. I looked at Box Office Mojo however this would require selenium to scrape it as it's dynamic HTML
IMdB sourcing
import requests, json, re, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
files = {}
for f in soup.find_all("a", href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB transform
set alldata=True on first run to prep data. Second run false and you have a manageable subset
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],
'averageRating': [9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 9.3, 8.7, 8.8, 8.9, 9.2],
'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False))
except KeyError as e:
print(k, dfs[k].columns, e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"], on="nconst").merge(dfs["title.basics"], on="tconst")
OMDB sourcing
omdbcols = ['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk], sep="\t", thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns}, errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day, so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/", params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk], pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk], index=False, sep="\t")
sample analysis
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"], on="tconst")
.merge(dfs["title.ratings"], on="tconst")
.merge(dfs["title.akas"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.crew"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.principals"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst", as_index=False).first(),
left_on="tconst", right_on="parentTconst", how="left", suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"], on="nconst")
.groupby(["tconst"], as_index=False).first(), on="tconst", suffixes=("","_name"))
).T
output
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action, Crime, Drama, Thriller Action, Adventure, Drama, Fantasy, Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay), Christopher Nolan (screenplay), Christopher Nolan (story), David S. Goyer (story), Bob Kane (characters) David Benioff, D.B. Weiss
OMDBActors Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington
OMDBLanguage English, Mandarin English
OMDBCountry USA, UK USA, UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw##._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc#._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database', 'Value': '9.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '84/100'}] [{'Source': 'Internet Movie Database', 'Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Crime,Drama Action,Adventure,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm1888967,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department
I'm using tabula in order to concat all tables in the following pdf file
To be a one table within excel format.
Here's my code:
from tabula import read_pdf
import pandas as pd
allin = []
for page in range(1, 115):
table = read_pdf("goal.pdf", pages=page,
pandas_options={'header': None})[0]
allin.append(table)
new = pd.concat(allin)
new.to_excel("out.xlsx", index=False)
Also i tried the following as well:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Current output: check
But the issue which am facing that from page# 91 i start to see the data not formatted correctly within the excel file.
I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]
print(table)
Example:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Here I've ran the code for two pages 90 and 91.
starting from row# 48 you will see the difference here
Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well
I digged in source code and it has option columns and you can manually define column boundaries. When you set columns then you have to use guess=False.
tabula-py uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.
from tabula import read_pdf
import pandas as pd
# display all columns in dataframe
pd.set_option('display.width', None)
columns = [210, 350, 420, 450] # boundaries in points
#columns = ['210,350,420,450'] # boundaries in points
pages = '90-92'
#pages = [90,91,92]
#pages = list(range(90,93))
#pages = 'all' # read all pages
tables = read_pdf("goal.pdf",
pages=pages,
pandas_options={'header': None},
columns=columns,
guess=False)
df = pd.concat(tables).reset_index(drop=True)
#df.rename(columns=df.iloc[0], inplace=True) # convert first row to headers
#df.drop(df.index[0], inplace=True) # remove first row with headers
# display
#for x in range(0, len(df), 20):
# print(df.iloc[x:x+20])
# print('----------')
print(df.iloc[45:50])
#df.to_csv('output-pdf.csv')
#print(df[ df['State'].str.contains(' ') ])
#print(df[ df.iloc[:,3].str.contains(' ') ])
Result:
0 1 2 3 4
45 JARRARD, GARY 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
46 JARRARD, GARY 2219 COLORADO BLVD DENTON TX (940) 380-1661
47 MASON HARRISON, RATLIFF ENTERPRISES 1815 W. UNIVERSITY DRIVE DENTON TX (940) 387-5431
48 MASON HARRISON, RATLIFF ENTERPRISES 109 N. LOOP #288 DENTON TX (940) 484-2904
49 MASON HARRISON, RATLIFF ENTERPRISES 930 FORT WORTH DRIVE DENTON TX (940) 565-6548
EDIT:
It may need also option area (also in points) to skip headers. Or you will have to remove first row on first page.
I didn't check all rows but it may need some changes in column boundaries.
EDIT:
Few rows make problem - probably because text in City is too long.
col3 = df.iloc[:,3]
print(df[ col3.str.contains(' ') ])
Result:
0 1 2 3 4
1941 UMSTATTD RESTAURANTS, LLC 120 WEST US HIGHWAY 54 EL DORADO SPRING MS O (417) 876-5755
2079 SIMONS, GARY 1412 BURLINGTON NORTH KANSAS CIT MY O (816) 421-5941
2763 GRISHAM, ROBERT (RB) 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830
2764 STAUFFER, JACOB 403 WEST COURT STREET WASHINGTON COU ORTH HOU S(E740) 335-7830
I had converted json data from single folder to pandas dataframe. But the list didn't come out sequentially. Does anybody know how to sort the data?
This is output of json_files:
['BuzzFeed_Real_5-Webpage.json',
'BuzzFeed_Fake_9-Webpage.json',
'BuzzFeed_Fake_6-Webpage.json',
'BuzzFeed_Fake_5-Webpage.json',
'BuzzFeed_Fake_8-Webpage.json',
'BuzzFeed_Real_6-Webpage.json',
'BuzzFeed_Real_7-Webpage.json',
'BuzzFeed_Real_8-Webpage.json',
'BuzzFeed_Real_9-Webpage.json',
'BuzzFeed_Real_2-Webpage.json',
'BuzzFeed_Real_4-Webpage.json',
'BuzzFeed_Real_1-Webpage.json',
'BuzzFeed_Real_10-Webpage.json',
'BuzzFeed_Fake_4-Webpage.json',
'BuzzFeed_Fake_10-Webpage.json',
'BuzzFeed_Fake_1-Webpage.json',
'BuzzFeed_Fake_2-Webpage.json',
'BuzzFeed_Real_3-Webpage.json',
'BuzzFeed_Fake_3-Webpage.json',
'BuzzFeed_Fake_7-Webpage.json']
However, my label is sequential as follows:
Label
label
0 BuzzFeed_Real_1
1 BuzzFeed_Real_2
2 BuzzFeed_Real_3
3 BuzzFeed_Real_4
4 BuzzFeed_Real_5
5 BuzzFeed_Real_6
6 BuzzFeed_Real_7
7 BuzzFeed_Real_8
8 BuzzFeed_Real_9
9 BuzzFeed_Real_10
10 BuzzFeed_Fake_1
11 BuzzFeed_Fake_2
12 BuzzFeed_Fake_3
13 BuzzFeed_Fake_4
14 BuzzFeed_Fake_5
15 BuzzFeed_Fake_6
16 BuzzFeed_Fake_7
17 BuzzFeed_Fake_8
18 BuzzFeed_Fake_9
19 BuzzFeed_Fake_10
Does anybody know how to sort the data based on the label? Thank you
Here is my code:
import os, json
import pandas as pd
import numpy as np
path_to_json = 'data/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('json')]
print(json_files)
#Here I define my pandas dataframe with the colums I want to get from json
jsons_data = pd.DataFrame(columns=['text','title'])
#We need both json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json,js)) as json_file:
json_text = json.load(json_file)
#the same structure
text = json_text['text']
title = json_text['title']
#Here I push a list of data into pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [text,title]
#Now that we have the pertinen json data in our DataFrame
print(jsons_data)
and this is output of jsons_data:
text title
0 Story highlights Obams reaffirms US commitment... Obama in NYC: 'We all have a role to play' in ...
1 Well THAT’S Weird. If the Birther movement is ... The AP, In 2004, Said Your Boy Obama Was BORN ...
2 The man arrested Monday in connection with the... Bombing Suspect Filed Anti-Muslim Discriminati...
3 The Haitians in the audience have some newswor... 'Reporters' FLEE When Clintons Get EXPOSED!
4 Chicago Environmentalist Scumbags\n\nLeftists ... The Black Sphere with Kevin Jackson
5 Obama weighs in on the debate\n\nPresident Bar... Obama weighs in on the debate
6 Story highlights Ted Cruz refused to endorse T... Donald Trump's rise puts Ted Cruz in a bind
7 Last week I wrote an article titled “Donald Tr... More Milestone Moments for Donald Trump! – Eag...
8 Story highlights Trump has 45%, Clinton 42% an... Georgia poll: Donald Trump, Hillary Clinton in...
9 Story highlights "This, though, is certain: to... Hillary Clinton on police shootings: 'too many...
10 McCain Criticized Trump for Arpaio’s Pardon… S... NFL Superstar Unleashes 4 Word Bombshell on Re...
11 On Saturday, September 17 at 8:30 pm EST, an e... Another Terrorist Attack in NYC…Why Are we STI...
12 Less than a day after protests over the police... Donald Trump: Drugs a 'Very, Very Big Factor' ...
13 Dolly Kyle has written a scathing “tell all” b... HILLARY ON DISABLED CHILDREN During Easter Egg...
14 Former President Bill Clinton and his Clinton ... Charity: Clinton Foundation Distributed “Water...
15 I woke up this morning to find a variation of ... Proof The Mainstream Media Is Manipulating The...
16 Thanks in part to the declassification of Defe... Declassified Docs Show That Obama Admin Create...
17 Critical Counties is a CNN series exploring 11... Critical counties: Wake County, NC, could put ...
18 The Democrats are using an intimidation tactic... Why is it “RACIST” to Question Someone’s Birth...
19 Back when the news first broke about the pay-t... Clinton Foundation Spent 5.7% on Charity; Rest...
You can use solution from this with split values for Fake and Real strings sorted descending and numbers are sorted ascending:
L = ['BuzzFeed_Real_5-Webpage.json',
'BuzzFeed_Fake_9-Webpage.json',
'BuzzFeed_Fake_6-Webpage.json',
'BuzzFeed_Fake_5-Webpage.json',
'BuzzFeed_Fake_8-Webpage.json',
'BuzzFeed_Real_6-Webpage.json',
'BuzzFeed_Real_7-Webpage.json',
'BuzzFeed_Real_8-Webpage.json',
'BuzzFeed_Real_9-Webpage.json',
'BuzzFeed_Real_2-Webpage.json',
'BuzzFeed_Real_4-Webpage.json',
'BuzzFeed_Real_1-Webpage.json',
'BuzzFeed_Real_10-Webpage.json',
'BuzzFeed_Fake_4-Webpage.json',
'BuzzFeed_Fake_10-Webpage.json',
'BuzzFeed_Fake_1-Webpage.json',
'BuzzFeed_Fake_2-Webpage.json',
'BuzzFeed_Real_3-Webpage.json',
'BuzzFeed_Fake_3-Webpage.json',
'BuzzFeed_Fake_7-Webpage.json']
class reversor:
def __init__(self, obj):
self.obj = obj
def __eq__(self, other):
return other.obj == self.obj
def __lt__(self, other):
return other.obj < self.obj
a = sorted(L, key=lambda x: (reversor(x.split('_')[1]), int(x.split('_')[2].split('-')[0])))
print (a)
['BuzzFeed_Real_1-Webpage.json', 'BuzzFeed_Real_2-Webpage.json',
'BuzzFeed_Real_3-Webpage.json', 'BuzzFeed_Real_4-Webpage.json',
'BuzzFeed_Real_5-Webpage.json', 'BuzzFeed_Real_6-Webpage.json',
'BuzzFeed_Real_7-Webpage.json', 'BuzzFeed_Real_8-Webpage.json',
'BuzzFeed_Real_9-Webpage.json', 'BuzzFeed_Real_10-Webpage.json',
'BuzzFeed_Fake_1-Webpage.json', 'BuzzFeed_Fake_2-Webpage.json',
'BuzzFeed_Fake_3-Webpage.json', 'BuzzFeed_Fake_4-Webpage.json',
'BuzzFeed_Fake_5-Webpage.json', 'BuzzFeed_Fake_6-Webpage.json',
'BuzzFeed_Fake_7-Webpage.json', 'BuzzFeed_Fake_8-Webpage.json',
'BuzzFeed_Fake_9-Webpage.json', 'BuzzFeed_Fake_10-Webpage.json']
Another similar idea by pandas - splitted values to new columns and last sorting by DataFrame.sort_values:
df = pd.DataFrame({'a':L})
df = df.join(df['a'].str.split('_', expand=True))
df['num'] = df[2].str.extract('(\d+)', expand=False).astype(int)
df = df.sort_values([1, 'num'], ascending=[False, True])
print (df)
a 0 1 2 num
11 BuzzFeed_Real_1-Webpage.json BuzzFeed Real 1-Webpage.json 1
9 BuzzFeed_Real_2-Webpage.json BuzzFeed Real 2-Webpage.json 2
17 BuzzFeed_Real_3-Webpage.json BuzzFeed Real 3-Webpage.json 3
10 BuzzFeed_Real_4-Webpage.json BuzzFeed Real 4-Webpage.json 4
0 BuzzFeed_Real_5-Webpage.json BuzzFeed Real 5-Webpage.json 5
5 BuzzFeed_Real_6-Webpage.json BuzzFeed Real 6-Webpage.json 6
6 BuzzFeed_Real_7-Webpage.json BuzzFeed Real 7-Webpage.json 7
7 BuzzFeed_Real_8-Webpage.json BuzzFeed Real 8-Webpage.json 8
8 BuzzFeed_Real_9-Webpage.json BuzzFeed Real 9-Webpage.json 9
12 BuzzFeed_Real_10-Webpage.json BuzzFeed Real 10-Webpage.json 10
15 BuzzFeed_Fake_1-Webpage.json BuzzFeed Fake 1-Webpage.json 1
16 BuzzFeed_Fake_2-Webpage.json BuzzFeed Fake 2-Webpage.json 2
18 BuzzFeed_Fake_3-Webpage.json BuzzFeed Fake 3-Webpage.json 3
13 BuzzFeed_Fake_4-Webpage.json BuzzFeed Fake 4-Webpage.json 4
3 BuzzFeed_Fake_5-Webpage.json BuzzFeed Fake 5-Webpage.json 5
2 BuzzFeed_Fake_6-Webpage.json BuzzFeed Fake 6-Webpage.json 6
19 BuzzFeed_Fake_7-Webpage.json BuzzFeed Fake 7-Webpage.json 7
4 BuzzFeed_Fake_8-Webpage.json BuzzFeed Fake 8-Webpage.json 8
1 BuzzFeed_Fake_9-Webpage.json BuzzFeed Fake 9-Webpage.json 9
14 BuzzFeed_Fake_10-Webpage.json BuzzFeed Fake 10-Webpage.json 10
a = df['a'].tolist()
print (a)
['BuzzFeed_Real_1-Webpage.json', 'BuzzFeed_Real_2-Webpage.json',
'BuzzFeed_Real_3-Webpage.json', 'BuzzFeed_Real_4-Webpage.json',
'BuzzFeed_Real_5-Webpage.json', 'BuzzFeed_Real_6-Webpage.json',
'BuzzFeed_Real_7-Webpage.json', 'BuzzFeed_Real_8-Webpage.json',
'BuzzFeed_Real_9-Webpage.json', 'BuzzFeed_Real_10-Webpage.json',
'BuzzFeed_Fake_1-Webpage.json', 'BuzzFeed_Fake_2-Webpage.json',
'BuzzFeed_Fake_3-Webpage.json', 'BuzzFeed_Fake_4-Webpage.json',
'BuzzFeed_Fake_5-Webpage.json', 'BuzzFeed_Fake_6-Webpage.json',
'BuzzFeed_Fake_7-Webpage.json', 'BuzzFeed_Fake_8-Webpage.json',
'BuzzFeed_Fake_9-Webpage.json', 'BuzzFeed_Fake_10-Webpage.json']
This should give you what you need to create an index from the filenames. Let me know if you need help setting the index and if you want it a dual index or combine to a single index:
import os, json
import pandas as pd
import numpy as np
path_to_json = 'data/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('json')]
print(json_files)
#Here I define my pandas dataframe with the colums I want to get from json
jsons_data = pd.DataFrame(columns=['text','title'])
#We need both json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json,js)) as json_file:
json_text = json.load(json_file)
#the same structure
text = json_text['text']
title = json_text['title']
#Here I push a list of data into pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [text,title]
# Add column to your data frame containing 'json_files' list values
jsons_data['json_files'] = json_files
import re
# Create Regex to identify 'Fake' or 'Real' BuzzFeed
news_type = r"(Fake|Real)"
# Create Regex to extract numeric count
news_type_count = r"(\d+)"
# Extract new type to column
jsons_data['news_type'] = jsons_data['json_files'].str.extract(pat=news_type)
# Extract numeric count to columne
jsons_data['news_type_count'] = jsons_data['json_files'].str.extract(pat=news_type_count)
# Convert numeric count to integer
jsons_data['news_type_count'] = jsons_data['news_type_count'].astype(int)
# Sort dataframe by 'news_type' and 'news_type_count'
jsons_data = jsons_data.sort_values(by=['news_type', 'news_type_count'])
# Print head of dataframe
print(jsons_data.head())
I need to create a pandas dataframe in Python by reading in an Excel spreadsheet that contains almost 50,000 rows and 81 columns. The file contains information about medical professionals of all kinds: physicians, nurses, nurse practitioners, etc. I want to read in only rows where a column 'PROFTYPE' has value of 'NURSEPRACT'.
I'm using Python 3.73, and I've read in the entire file, and then I trim it down by the column PROFTYPE afterward; but the reading in takes too long. I'd like to read in only those rows where PROFTYPE == 'NURSEPRACT'.
df_np = pd.read_excel(SourceFile, sheetname='Data', header=0)
df_np = df_np[df_np['PROFTYPE'] == 'NURSEPRACT']
This code actually works, but that's because I'm reading in the entire file first. I'm actually interested in reading in only those that meet the condition of PROFTYPE = 'NURSEPRACT'.
One idea is that you can
load only the 'PROFTYPE' column,
identify the non-nurse practitioner rows,
load the entire table to keep only the nurse practitioner rows.
Here that strategy is in action:
df = pd.read_excel(SourceFile,
sheet_name='Data',
header=0,
usecols=['PROFTYPE']) # <-- Load just 'PROFTYPE' of the following table
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 43 NURSE 32
# 789 NURSEPRACT 4
# 34 PHYSICIAN 2
# 93 NURSEPRACT 13
row_numbers = [x+1 for x in df[df['PROFTYPE'] != 'NURSEPRACT'].index]
df = pd.read_excel(SourceFile, sheet_name='Data', header=0, skiprows=row_numbers)
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 789 NURSEPRACT 4
# 93 NURSEPRACT 13
I have a dataset provided properties.csv (4000 rows and 6 columns). The csv file including many features some of these features are numerical and some of them are nominal (features contain text). Suppose the features in this dataset are
id
F1
F2
F3
F4
Price
Examples of the content of each feature:
id (row 1 to 3 in CSV File) ---> 44525
44859
45465
F1 (row 1 to 3 in CSV File) ---> "Stunning 6 bedroom villa in the heart of the
Golden Mile, Marbella"
"Villa for sale in Rocio de Nagüeles, Marbella
Golden Mile"
"One level 5 bedroom villa for sale in
Nagüeles"
F2 (row 1 to 3 in CSV File) ---> "Fireplace, Elevator, Terrace, Mountain view,
Freight Elevator, Air conditioning, Patio,
Guest toilet, Garden, Balcony, Sea/lake view,
Built-in kitchen"
"Mountain view"
"Elevator, Terrace, Alarm system, Mountain
view, Swimming pool, Air conditioning,
Basement, Sea/lake view"
F3 (row 1 to 3 in CSV File) - contains numerical values ---> 0
0
0
F4 (row 1 to 3 in CSV File) - contains numerical values ---> 393
640
4903
F3 (row 1 to 3 in CSV File) - contains numerical values ---> 4400000
2400000
1900000
In F1, I am looking to do the following:
1- Extract the type of the properties (apartment’, ‘house’ or ‘Villa’) and put it in a separate feature (independent variable) calls "Type" in CSV file. After that, I want to separate them in groups (apartments group, houses group, Vilas group) with calculating the mean price of each type group.
2- Extract the location of each property (locations can be: Alenquer, Quinta da Marinha, Golden Mile, Nagüeles) and put it in a separate feature (independent variable) calls "Location" in csv file.
I am a beginner in NLP. I tried to write this code to extract information "Apartment" from F1, but it does not work probably:
import pandas as pd
from pandas import DataFrame
import re
properties = pd.read_csv (r'C:/Users/User/Desktop/properties.csv')
Extract "Apartment" from F1
Title= DataFrame(properties,columns= ['F1'])
for line in F1:
#return list of apartments in that line
x = re.findall("\apartment", line)
#if a date is found
if len(x) != 0:
print(x)
I need your help to fix this code and what should I do to extract the other information ‘houses’ and ‘Villa’ from F1.
After that, Create a property dataset in this format and save it as a csv file:
id
Location (Information extracted from F1)
type (information extracted from F1 in groups "apartments’, ‘houses’, ‘Villas’")
F1
F2
F3
F4
Price
In case, F1 does not contain the type of some properties "Blank field (no text)", what should I do to deal with the blanks fields (no text) in F1 and extract the type of the properties from other properties?
Here is a solution:
import pandas as pd
import re
df = pd.read_csv('appt_info.csv', delimiter=';')
def extract_housing_type(text):
# Do a regular expression search for the pattern
match = re.search('(apartment|house|villa)s?', text, flags=re.I)
if match is not None:
return match.group(0) # return the value of the match
return 'Unknown' # return a default value if there is no match
df['Type'] = df.F1.apply(lambda x: extract_housing_type(x)) # assign the output to a new column
This should give you a dataframe that looks like this:
id F1 \
0 44525 Stunning 6 bedroom villa in the heart of the G...
1 44859 Villa for sale in Rocio de Nageles, Marbella G...
2 45465 One level 5 bedroom villa for sale in Nageles
F2 F3 F4 Price Type
0 Fireplace, Elevator, Terrace, Mountain view, F... 0 393 4400000 villa
1 Mountain view 0 640 2400000 Villa
2 Elevator, Terrace, Alarm system, Mountain view... 0 4903 1900000 villa