I'm trying to scrape wikipedia for data on some famous people. I've got no problems getting the data, but when I try to export it to csv there's always a few entries causing a major issue. Basically, the output csv is formatted fine for most entries, except a few that cause random line-breaks that I can't seem to overcome. Here is sample data and code:
# 1. pull out wiki pages
sample_names_list = [{'name': 'Mikhail Fridman', 'index': 11.0}, #will work fine
{'name': 'Roman Abramovich', 'index': 12.0}, #will cause issue
{'name': 'Marit Rausing', 'index': 13.0}] #has no wiki page, hence 'try' in loops below
# 1.1 get page title for each name in list
import wikipedia as wk
for person in sample_names_list:
try:
wiki_page = person['name']
person['wiki_page'] = wk.page(title = wiki_page, auto_suggest = True)
except: pass
# 1.2 get page content for each page title in list
for person in sample_names_list:
try:
person_page = person['wiki_page']
person['wiki_text'] = person_page.content
except: pass
# 2. convert to dataframe
import pandas as pd
sample_names_data = pd.DataFrame(sample_names_list)
sample_names_data.drop('wiki_page', axis = 1, inplace= True) #drop unnecessary col
# 3. export csv
sample_names_data.to_csv('sample_names_data.csv')
Here is a screenshot of the output where, as you can see, random line-breaks are inserted in one of the entries and dispersed throughout with no apparent pattern:
I've tried fiddling with the data types in sample_names_list, I've tried messing with to_csv's parameters, I've tried other ways to export the csv. None of these approaches worked. I'm new to python so it could well be a very obvious solution. Any help much appreciated!
The wikipedia content has newlines in it, which are hard to reliably represent in a line-oriented format such as CSV.
You can use Excel's Open dialog (not just double-clicking the file) and select "Text file" as the format, which lets you choose how to interpret various delimiters and quoted strings... but preferably just don't use CSV for data interchange at all.
If you need to work with Excel,use .to_excel() in Pandas.
If you need to just work with Pandas, use e.g. .to_pickle().
If you need interoperability with other software, .to_json() would be a decent choice.
Related
Environment:
python 3.8.5
ipython 7.20.0
jupyterlab 3.0.7
ubuntu 20.04 LTS
pandas 1.2.2
openpyxl 3.0.10
googledrive (spreadsheet)
I'm doing...
import pandas as pd
ef = pd.ExcelFile('myfile.xlsx')
df = ef.parse(ef.sheet_names[0], headers=None)
display(df)
Parsing exported xlsx from google spreadsheet to dataframe.
The spreadsheet's content is following next:
The Problem
It always parse A1(=1-1) to pd.Timestamp(2022.01.01 00:00:00).
But I want string value of "1-1".
I think it's origin value already inserted to datetime type.
I tried
Most of SO's solve is next.
So i tried that.
df1 = ef.parse(ef.sheet_names[0], headers=None)
df1.columns #=[0,1,2,3,4,5]
df = ef.parse(ef.sheet_names[0], headers=None, converters={c:str for c in df1.columns})
display(df.iloc[0][0])
But it shows string value but "2022-01-01 00:00:00"
Constraints
The spreadsheet's writer(=operator) says to me "I typed exactly 1-1 on the spreadsheet"
And there are many spreadsheet writer.
So they won't input '1-1 instead of 1-1, and strictly check it is really inserted string type or datetime type.
And google spreadsheet API (not drive api) returns it's value '1-1', so it works. But that API's quota is too small (60 calls per 1min, and consumes 1 call per 1 sub-spreadsheet). So I must need google drive api and export it.
That's why I can't using spreadsheet API notwithstanding it's actually works.
hope
I got xlsx exported file from google drive api like next way.
request = _google_drive_client.files().export_media(fileId=file_id,
mimeType='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
file = io.BytesIO()
downloader = MediaIoBaseDownload(file, request)
done = False
while done is False:
status, done = downloader.next_chunk()
dn = file.getValue()
with open('test.xlsx', 'wb') as f:
f.write(dn)
And my Apple Numbers shows me the information of '1-1' is still alive in that xlsx file.
So I hope I can find and make '1-1' in python again.
Question
Is there any way xlsx file load to python memory that's showing way? (I mean '1-1', not following origin value (datetime type 2022 01 01 00:00:00), or other parsing way)
Or simply I want '1-1' from parsing way.
Help me please!
I hope I understood your question right. If you want to know how you can display 1-1 in it's actual state as straight text after exporting it to excel, I think it's best to use pandas with xlwings:
import pandas as pd
import xlwings as xw
df = pd.DataFrame([
(1, "1-1", 8),
(4, 0, "=1-1")],
columns=["A", "B", "C"]
)
with xw.App(visible=False) as app:
wb = xw.Book()
ws = wb.sheets[0]
ws.api.Cells.NumberFormat = "#" # Or instead: ws.range("A1:Z100").number_format = "#"
ws["A1"].value = df
wb.save("test.xlsx")
wb.close()
The crucial point is to set the NumberFormat or rather number_format property before loading the values of df into the cells. This ensures that 1-1 appears as straight text.
As a side note: ws.api.Cells.NumberFormat = "#" changes the format of the whole sheet to text. If you prefer to change only a certain range, use ws.range("A1:Z100").number_format = "#".
Please also note that xlwings requires an excel installation.
The problem is not pandas, of course. The problem is Excel. Unless told otherwise, it interprets 1-1 as a date. If you want to override that, start with a single quote: '1-1. The quote won't show, but Excel will treat the cell as a string.
I answer for my own question because I found a answer.
I stop using for pandas dataframe excel parser.
I tried many mimeTypes for googledrive API, finally I dropped odf and xlsx export.
Finally I using 'zip' that makes every sheet to html, and one css file.
I downloaded it zip, and extract, and finally find the html contains the contents it shows exact same with google spreadsheet.
My solution is next:
def extract_zip(input_zip):
input_zip=ZipFile(input_zip)
return {name: input_zip.read(name) for name in input_zip.namelist()}
def read_json_from_zip_bytearray(file_bytearray):
from bs4 import BeautifulSoup as soup
from zipfile import ZipFile
do = extract_zip(io.BytesIO(file_bytearray))
dw = {k[:-5]:[[row.text for row in columns.find_all('td')] for columns in soup(v).find('table').find_all('tr') if columns.find_all('td')] for k, v in do.items() if k[-5:] == '.html'}
return {k:[singlev for singlev in vv if singlev] for idx, vv in enumerate(v) for k, v in dw.items()}
read_json_from_zip_bytearray(downloaded_bytearray_from_googledrive_api_and_zip_mimetype)
#the value is
{'sheetname1': ['1-1',
'1-2',
'1-3',
'1-4',
'1-5',
'other...'],
'sheetname2': ['2-1', '2-2', '2-3'],
...}
#anyway you can make dataframe with pd.DataFrame() method
I appreciate for every answers!
And I hope someone for who using google drive API & parse it keep it's own showing way.
So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.
I'm doing some research on Cambridge Analytica and wanted to have as much news articles as I can from some news outlets.
I was able to scrape them and now have a bunch of JSON files in a folder.
Some of them have only this [] written in them while others have the data I need.
Using pandas I used the following and got every webTitle in the file.
df = pd.read_json(json_file)
df['webTitle']
The thing is that whenever there's an empty file it won't even let me assign df['webTitle'] to a variable.
Is there a way for me to check if it is empty and if it is just go to the next file?
I want to make this into a spreadsheet with a few of the keys and columns and the values as rows for each news article.
My files are organized by day and I've used TheGuardian API to get the data.
I did not write much yet but just in case here's the code as it is:
import pandas as pd
import os
def makePathToFile(path):
pathtoJson = []
for root,sub,filename in os.walk(path):
for i in filename:
pathToJson.append(os.path.join(path, i))
return pathToJson
def readJsonAndWriteCSV (pathToJson):
for json_file in pathToJson:
df = pd.read_json(json_file)
Thanks!
You can set up a google Alert for the news keywords you want, then scrape the results in python using https://pypi.org/project/galerts/
I'm parsing from tweets data which is json format and compressed with gzip.
Here's my code:
###Preprocessing
##Importing:
import os
import gzip
import json
import pandas as pd
from pandas.io.json import json_normalize
##Variables:
#tweets: DataFrame for merging. empty
tweets = pd.DataFrame()
idx = 0
#Parser provides parsing the input data and return as pd.DataFrame format
###Directory reading:
##Reading whole directory from
for root, dirs, files in os.walk('D:/twitter/salathe-us-twitter/11April1'):
for file in files:
#file tracking, #Memory Checker:
print(file, tweets.memory_usage())
# ext represent the extension.
ext = os.path.splitext(file)[-1]
if ext == '.gz':
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
# print(tweet_file)
for line in tweet_file:
try:
temp = line.partition('|')
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
#idx for DataFrame ix
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
else:
with open(os.path.join(root, file), "r") as tweet_file:
# print(tweets_file)
for line in tweet_file:
try:
temp = line.partition('|')
#date
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": [tweet["user"]["id"]],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
##STORING PROCESS.
store = pd.HDFStore('D:/Twitter_project/mydata.h5')
store['11April1'] = tweets
store.close()
My code can be distinct to 3 parts: reading, processing to select columns and storing.
What I interest is that I want to parsing them more faster.
So here's my questions:
It's too slow. How could it be much faster? read by pandas json reader?
Well I guess it's much faster than normal json.loads...
But! Because my raw tweet data have multi-index values.
So pandas read_json didn't work.
And overally, I'm not sure I implemented my code well.
Are there something problems or better way? I'm kinda new on programming.
So please teach me to do much better.
p.s The computer just turned off while the code is running. Why this happen?
Memory problem?
Thanks to read this.
p.p.s
20110331010003954|{"text":"#Honestly my toe still aint healed im suppose to be in that boot still!!!","truncated":false,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"favorited":false,"source":"web","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"id_str":"53320627431550976","entities":{"hashtags":[{"text":"Honestly","indices":[0,9]}],"user_mentions":[],"urls":[]},"contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"place":{"country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.161625,35.849573],[-83.688543,35.849573],[-83.688543,36.067417],[-84.161625,36.067417]]]},"attributes":{},"full_name":"Knoxville, TN","name":"Knoxville","id":"6565298bcadb82a1","place_type":"city","url":"http:\/\/api.twitter.com\/1\/geo\/id\/6565298bcadb82a1.json"},"retweet_count":0,"created_at":"Thu Mar 31 05:00:02 +0000 2011","user":{"notifications":null,"profile_use_background_image":true,"default_profile":true,"profile_background_color":"C0DEED","followers_count":161,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1220577968\/RoadRunner_normal.jpg","is_translator":false,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1301071706\/images\/themes\/theme1\/bg.png","default_profile_image":false,"description":"Cool & Calm Basically Females are the way of life and key to my heart...","screen_name":"FranklinOwens","verified":false,"time_zone":"Central Time (US & Canada)","friends_count":183,"profile_text_color":"333333","profile_sidebar_fill_color":"DDEEF6","location":"","id_str":"63499713","show_all_inline_media":true,"follow_request_sent":null,"geo_enabled":true,"profile_background_tile":false,"contributors_enabled":false,"lang":"en","protected":false,"favourites_count":8,"created_at":"Thu Aug 06 18:24:50 +0000 2009","profile_link_color":"0084B4","name":"Franklin","statuses_count":5297,"profile_sidebar_border_color":"C0DEED","url":null,"id":63499713,"listed_count":0,"following":null,"utc_offset":-21600},"id":53320627431550976,"coordinates":null,"geo":null}
it's just one line. I have more than 200GB which is compressed with gzip file. I guess the number at very first refers to its date. I'm not sure it's clear to you.
First of all, my congratulations. You get better as a software engineer when you face real world challenges like this one.
Now, talking about your solution.
Every software works in 3 phases.
Input data.
Process data.
Output data. (response)
Input data
1.1. boring staff
The information preferably should be in one format. To achieve that we write parsers, API, wrappers, adapters. The idea behind all of that is to transform data into the same format. This helps to avoid issues working with different data sources, if one of them brakes - you fix only one adapter and that's it, all other and your parser still works.
1.2. your case
You have data coming in the same scheme but in different file formats. You can either convert it to one format as read as json, txt or extract a method that transforms data into separate function or module and reuse/call it 2 times.
example:
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
process_data(tweet_file)
with open(os.path.join(root, file), "r") as tweet_file:
process_data(tweet_file)
process_data(tweet_file):
for line in tweet_file:
# do your stuff
2. Process data
2.1 boring staff
Most likely this is a bottleneck part. Here your goal is to transform data from the given format into the desired format and do some actions if required. Here you get all exceptions, all performance issues, all business logic. This is where SE craft comes handy, you create an architecture and you decide how many bugs to put in it.
2.2 your case
The simplest way to deal with the issue is to know how to find it. If this is performance - put timestamps to track it. With experience, it will get easier to spot the issues. In this case, dt.concat most likely causes the performance hit. With each call it copies all the data to a new instance, thus you have 2 memory objects when you need only 1. Try to avoid it concat, gather all data into a list and then put it into the DataFrame.
For instance, I would not put all the data into the DataFrame on the start, you can gather it and put into a csv file and then build a DataFrame from it, pandas deals with csv files really well. Here is an example:
import json
import pandas as pd
from pandas.io.json import json_normalize
import csv
source_file = '11April1.txt'
result_file = 'output.csv'
with open(source_file) as source:
with open(result_file, 'wb') as result:
writer = csv.DictWriter(result, fieldnames=['id','text','hashtags','date','idx'])
writer.writeheader();
# get index together with a line
for index, line in enumerate(source):
# a handy way to get data in 1 func call.
date, data = line.split('|')
tweet = json.loads(data)
if tweet['user']['lang'] != 'en' or tweet['place']['country_code'] != 'US':
continue
item = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])],
"idx": index}
# either write it to the csv or save into the array
# tweets.append(item)
writer.writerow(item)
print "done"
3. Output data.
3.1. boring staff
After your data is processed and in the right format, you need to see the results, right? This is where HTTP responses and page loads happen, where pandas builds graphs etc. You decide what kind of output you need, that's why you created software, to get what you want from the format you did not want to go through by yourself.
3.2 your case
You have to find an efficient way to get the desired output from the processed files. Maybe you need to put data into HDF5 format and process it on Hadoop, in this case, your software output becomes someone's software input, sexy right? :D
Jokes aside, gather all processed data from csv or arrays and put it into the HDF5 by chunks, this is important as you cannot load everything into RAM, RAM was called temporary memory within a reason, it is fast and very limited, use it wisely. This is reason your PC turned off, from my opinion. Or there maybe a memory corruption due to some C libraries nature which is OK from time to times.
Overall, try to experiment and get back to StackOverflow if anything.
As it says on the tin, I'm writing a Python (2.7) script that looks at two spreadsheets (saved as .csv), comparing certain columns to find the rows containing the same person's information, and extracting certain data from each spreadsheet for that person and putting that data on another spreadsheet.
What I've written is included below.
The problem I'm encountering is that when I run this program, the first person and a person in the middle of the sheet from the G__Notification spreadsheet are being found and reported. However, none of the others are, despite the fact that I can manually look through the spreadsheet and compare the columns myself and find the person.
It's not that they just aren't being reported, as I included a print function at the point where a match has been found, and it only prints 2--once for each of the aforementioned people.
Have I done something bizarre here? Any help would be great.
# This script will take data from Matrix.csv and GO_Notification.csv, find the matching
# data sets, and report the data in an output folder.
import csv
# Will read data from Matrix_CSV, goFile_CSV
# Will write data to AutoGeneratorOutput.csv
f = open('AutoGeneratorOutput.csv', 'wb')
g = open('GO_Notification.csv', 'rb')
h = open('Matrix.csv', 'rb')
matrixFile_reader = csv.reader(h)
goFile_reader = csv.reader(g)
outputFile_writer = csv.writer(f)
# Create the headings in the output file
headings = ['Employee #', 'Name', 'Last 4 of SS', 'Hired', 'GO Date',\
'PL Description', 'Department Name', 'Title', 'Supervisor'\
'Accudose', 'Intellishelf', 'Vocera']
outputFile_writer.writerow(headings)
matrixFile_reader.next()
goFile_reader.next()
while 1:
for goRow in goFile_reader:
goLine = goRow
h.seek(0) # Return to the top of the matrixFile for the next iteration
for matrixRow in matrixFile_reader:
try:
matrixLine = matrixRow
# Compare the departments, job numbers, and PLs to find a match
if goLine[9].strip() == matrixLine[1].strip() and goLine[11].strip() == matrixLine[5].strip() \
and goLine[12].strip() == matrixLine[3].strip():
# Here's a match
output = [goLine[0], goLine[1], '', goLine[2], goLine[3], goLine[9],\
goLine[11], goLine[13], goLine[15], matrixLine[20], matrixLine[21],\
matrixLine[22], matrixLine[23]]
outputFile_writer.writerow(output)
print(goLine[1])
except StopIteration:
pass
break
# Close the files when finished
f.close()
g.close()
h.close()
print('Finished')
It's a little hard to tell what you mean without sample input data. You've also got some confusing unnecessary code and removing that is the first step.
while 1:
for foo:
goLine=goow
[etcetera]
break
does the same thing as
for foo:
goLine=goRow
[etcetera]
so you can get rid of your "while" and "break" lines.
Also, I'm not sure why you're catching StopIteration. Delete your try / catch lines.
Load each file as a table table in a DB then query using a join ;)
... well, not that stupid since Python has support for Sqlite3 + in-memory DB.
See Importing a CSV file into a sqlite3 database table using Python