Scraping data parallel + batch processing - python

I am doing a task that requires scraping. I have a dataset with ids and for each id i need to scrape some new information. This dataset has around 4 million rows. Here is my code:
import pandas as pd
import numpy as np
import semanticscholar as sch
import time
# dataset with ids
df = pd.read_csv('paperIds-1975-2005-2015-2.tsv', sep='\t', names=["id"])
# columns that will be produced
cols = ['id', 'abstract', 'arxivId', 'authors',
'citationVelocity', 'citations',
'corpusId', 'doi', 'fieldsOfStudy',
'influentialCitationCount', 'is_open_access',
'is_publisher_licensed', 'paperId',
'references', 'title', 'topics',
'url', 'venue', 'year']
# a new dataframe that we will append the scraped results
new_df = pd.DataFrame(columns=cols)
# a counter so we know when every 100000 papers are scraped
c = 0
i = 0
while i < df.shape[0]:
try:
paper = sch.paper(df.id[i], timeout=10) # scrape the paper
new_df = new_df.append([df.id[i]]+paper, ignore_index=True) # append to the new dataframe
new_df.to_csv('abstracts_impact.csv', index=False) # save it
if i % 100000 == 0: # to check how much we did
print(c)
c += 1
i += 1
except:
time.sleep(60)
The problem is that the dataset is pretty big and this approach is not working. I left it working for 2 days and it scraped around 100000 ids, and then suddenly just froze and all the data that was saved was just empty rows.
I was thinking that the best solution would be to parallelize and batch processing. I never have done this before and I am not familiar with these concepts. Any help would be appreciated. Thank you!

Okay, so first of all there is no data :( so I am just taking a sample ID from semanticscholar documents. Looking at your code, there I can see plenty of mistakes:
Don't always stick to pd.DataFrame for your work! Dataframe are great, but are also slow! You just need to get the ID from the 'paperIds-1975-2005-2015-2.tsv' so you can either read the file using file.readline() or you can save the data into a list:
data = pd.read_csv('paperIds-1975-2005-2015-2.tsv', sep='\t', names=["id"]).id.values
From the code flow, what I understand is you want to save the scraped data into a single CSV file, right? So, why are you appending the data and saving the file again and again? This makes the code like 100000s time slower!
I really don't understand the purpose of time.sleep(60) you have added. If there is some error, you should print and move on - why wait?
For checking the progress, you can use the tqdm library which shows a nice progress bar for your code!
Taking these into consideration, I have modified your code as follows:
import pandas as pd
import semanticscholar as sch
from tqdm import tqdm as TQ # for progree-bar
data = ['10.1093/mind/lix.236.433', '10.1093/mind/lix.236.433'] # using list or np.ndarray looks more logical!
print(data)
>> ['10.1093/mind/lix.236.433', '10.1093/mind/lix.236.433']
Once you have done this, you can now go and scrape the data. Okay, before that pandas DataFrame is basically a dictionary with advanced features. So, for our purpose, we will first add all the information to the dictionary and then create the dataframe. I personally prefer this process - gives me more control, if there are any changes need to be done.
cols = ['id', 'abstract', 'arxivId', 'authors', 'citationVelocity', 'citations',
'corpusId', 'doi', 'fieldsOfStudy', 'influentialCitationCount', 'is_open_access',
'is_publisher_licensed', 'paperId', 'references', 'title', 'topics', 'url', 'venue', 'year']
outputData = dict((k, []) for k in cols)
print(outputData)
{'id': [],
'abstract': [],
'arxivId': [],
'authors': [],
'citationVelocity': [],
'citations': [],
'corpusId': [],
'doi': [],
'fieldsOfStudy': [],
'influentialCitationCount': [],
'is_open_access': [],
'is_publisher_licensed': [],
'paperId': [],
'references': [],
'title': [],
'topics': [],
'url': [],
'venue': [],
'year': []}
Now you can simply fetch the data and save it into your dataframe as below:
for _paperID in TQ(data):
paper = sch.paper(_paperID, timeout = 10) # scrape the paper
for key in cols:
try:
outputData[key].append(paper.get(key))
except KeyError:
outputData[key].append(None) # if there is no data, append none
print(f"{key} not Found for {_paperID}")
pd.DataFrame(outputData).to_csv('output_file_name.csv', index = False)
This is the output that I have obtained:

Related

Convert string from database to dataframe

My database has a column where all the cells have a string of data. There are around 15-20 variables, where the information is assigned to the variables with an "=" and separated by a space. The number and names of the variables can differ in the individual cells... The issue I face is that the data is separated by spaces and so are some of the variables. The variable name is in every cell, so I can't just make the headers and add the values to the data frame like a csv. The solution also needs to be able to do this process automatically for all the new data in the database.
Example:
Cell 1: TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520"... RELEASED="1880".
Cell 2: TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655"... MAIN CHARACTER="Ishmael".
I want to convert these strings of data into a structured dataframe like.
TITLE
AUTHOR
PAGES
RELEASED
MAIN
Brothers Karamazov
Fyodor Dostoevsky
520
1880
NaN
Moby Dick
Herman Meville
655
NaN
Ishmael
Any tips on how to move forwards? I have though about converting it into a JSON format by using the replace() function, before turning it into a dataframe, but have not yet succeeded. Any tips or ideas are much appreciated.
Thanks,
I guess this sample is what you need.
import pandas as pd
# Helper function
def str_to_dict(cell) -> dict:
normalized_cell = cell.replace('" ', '\n').replace('"', '').split('\n')
temp = {}
for x in normalized_cell:
key, value = x.split('=')
temp[key] = value
return temp
list_of_cell = [
'TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520" RELEASED="1880"',
'TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655" MAIN CHARACTER="Ishmael"'
]
dataset = [str_to_dict(i) for i in list_of_cell]
print(dataset)
"""
[{'TITLE': 'Brothers Karamazov', 'AUTHOR': 'Fyodor Dostoevsky', 'PAGES': '520', 'RELEASED': '1880'}, {'TITLE': 'Moby Dick', 'AUTHOR': 'Herman Melville', 'PAGES': '655', 'MAIN CHARACTER': 'Ishmael'}]
"""
df = pd.DataFrame(dataset)
df.head()
"""
TITLE AUTHOR PAGES RELEASED MAIN CHARACTER
0 Brothers Karamazov Fyodor Dostoevsky 520 1880 NaN
1 Moby Dick Herman Melville 655 NaN Ishmael
"""
Pandas lib can read them from a .csv file and make a data frame - try this:
import pandas as pd
file = 'xx.csv'
data = pd.read_csv(file)
print(data)
Create a Python dictionary from your database rows.
Then create Pandas dataframe using the function: pandas.DataFrame.from_dict
Something like this:
import pandas as pd
# Assumed data from DB, structure it like this
data = [
{
'TITLE': 'Brothers Karamazov',
'AUTHOR': 'Fyodor Dostoevsky'
}, {
'TITLE': 'Moby Dick',
'AUTHOR': 'Herman Melville'
}
]
# Dataframe as per your requirements
dt = pd.DataFrame.from_dict(data)

pandas, dataframe: If you need to process data row by row, how to do it faster than itertuples

I know that .itertuples() and .iterrows() are slow, but how can I speed them up if I need to use and process data one row at a time, as shown below?
df = pd.read_csv('example.csv')
posts = []
for row in df.itertuples():
post = Post(title=row.title, text=row.text, ...)
posts.append(post)
You can use list comprehension and unpacking (using kwargs) if your DataFrame columns have the same names as your class attributes. An example is shown below.
df = pd.DataFrame({"title": ["fizz", "buzz"], "text": ["aaaa", "bbbb"]})
posts = [Post(**kwargs) for kwargs in df.to_dict("records")]
What I usually do is using apply function.
import pandas as pd
df = pd.DataFrame(dict(title=["title1", "title2", "title3"],text=["text1", "text2", "text3"]))
df["Posts"] = df.apply(lambda x: dict(title=x["title"], text=x["text"]), axis=1)
posts = list(df["Posts"])
print(posts)
Output:
[{'title': 'title1', 'text': 'text1'}, {'title': 'title2', 'text': 'text2'}, {'title': 'title3', 'text': 'text3'}]
It's better to avoid a for loop when you have another methods to do that.

how to normalize this below json using panda in django

using this view.py query my output is showing something like this. you can see in choices field there are multiple array so i can normalize in serial wise here is my json
{"pages":[{"name":"page1","title":"SurveyWindow Pvt. Ltd. Customer Feedback","description":"Question marked * are compulsory.",
"elements":[{"type":"radiogroup","name":"question1","title":"Do you like our product? *","isRequired":true,
"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]},{"type":"checkbox","name":"question2","title":"Please Rate Our PM Skill","isRequired":false,"choices":[{"value":"High","text":"High"},{"value":"Low","text":"Low"},{"value":"Medium","text":"Medium"}]},{"type":"radiogroup","name":"question3","title":"Do you like our services? *","isRequired":true,"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]}]}]}
this is my view.py
jsondata=SurveyMaster.objects.all().filter(survey_id='1H2711202014572740')
q = jsondata.values('survey_json_design')
qs_json = pd.DataFrame.from_records(q)
datatotable = pd.json_normalize(qs_json['survey_json_design'], record_path=['pages','elements'])
qs_json = datatotable.to_html()
Based on your comments and picture here's what I would do to go from the picture to something more SQL-friendly (what you refer to as "normalization"), but keep in mind this might blow up if you don't have sufficient memory.
Create a new list which you'll fill with the new data, then iterate over the pandas table's rows, and then over every item in your list. For every iteration in the inner loop use the data from the row (minus the column you're iteration over). For convenience I added it as the last element.
# Example data
df = pd.DataFrame({"choices": [[{"text": "yes", "value": "yes"},
{"text": "no", "value": "no"}],
[{"ch1": 1, "ch2": 2}, {"ch3": "ch3"}]],
"name": ["kostas", "rajesh"]})
data = []
for i, row in df.iterrows():
for val in row["choices"]:
data.append((*row.drop("choices").values, val))
df = pd.DataFrame(data, columns=["names", "choices"])
print(df)
names choices
0 kostas {'text': 'yes', 'value': 'yes'}
1 kostas {'text': 'no', 'value': 'no'}
2 george {'ch1': 1, 'ch2': 2}
3 george {'ch3': 'ch3'}
This is where I guess you want to go. All that's left is to just modify the column / variable names with your own data.

Processing a large amount of tweets for exploratory data analysis such as number of unique tweets, and histogram of tweet counts per user

I have 14M tweets that are in a single tweet.txt file (given to me) in which the entire JSON of the tweet is one line of the txt file. I want to get some basic statistics such as number of unique tweets, number of unique users, and a historgram of retweet count for each tweet as well as a histogram of a tweets per user. Later I am interested in perhaps more intricate analysis.
I have the following code but it is extremely slow. I left it running for the entire day and it is only at 200,000 tweets processed. Can the current code be fixed somehow so it can be sped up? Is the current idea of creating a pandas dataframe of 14M tweets even a good idea or feasible for exploratory data analysis? My current machine has 32GB RAM and 12 CPUs. If this is not feasible on this machine, I also have access to shared cluster at my university.
import pandas as pd
import json
from pprint import pprint
tweets = open('tweets.txt')
columns = ['coordinates', 'created_at', 'favorite_count', 'favorited', 'tweet_id', 'lang', 'quote_count', 'reply_count', 'retweet_count',
'retweeted', 'text', 'timestamp_ms', 'user_id', 'user_description', 'user_followers_count', 'user_favorite_count',
'user_following_count', 'user_friends_count', 'user_location', 'user_screenname', 'user_statuscount', 'user_profile_image', 'user_name', 'user_verified' ]
#columns =['coordinates', 'created_at']
df = pd.DataFrame()
count = 0
for line in tweets:
count += 1
print(count)
#print(line)
#print(type(line))
tweet_obj = json.loads(line)
#pprint(tweet_obj)
#print(tweet_obj['id'])
#print(tweet_obj['user']['id'])
df = df.append({'coordinates': tweet_obj['coordinates'],
'created_at': tweet_obj['created_at'],
'favorite_count': tweet_obj['favorite_count'],
'favorited': tweet_obj['favorited'],
'tweet_id': tweet_obj['id'],
'lang': tweet_obj['lang'],
'quote_count': tweet_obj['quote_count'],
'reply_count': tweet_obj['reply_count'],
'retweet_count': tweet_obj['retweet_count'],
'retweeted': tweet_obj['retweeted'],
'text': tweet_obj['text'],
'timestamp_ms': tweet_obj['timestamp_ms'],
'user_id': tweet_obj['user']['id'],
'user_description': tweet_obj['user']['description'],
'user_followers_count': tweet_obj['user']['followers_count'],
'user_favorite_count': tweet_obj['user']['favourites_count'],
'user_following': tweet_obj['user']['following'],
'user_friends_count': tweet_obj['user']['friends_count'],
'user_location': tweet_obj['user']['location'],
'user_screen_name': tweet_obj['user']['screen_name'],
'user_statuscount': tweet_obj['user']['statuses_count'],
'user_profile_image': tweet_obj['user']['profile_image_url'],
'user_name': tweet_obj['user']['name'],
'user_verified': tweet_obj['user']['verified']
}, ignore_index=True)
df.to_csv('tweets.csv')
One significant speed increase would be to append the dictionary to a list and not using df.append and then outside the loop, create the dataframe. Something like:
count = 0
l_tweets = []
for line in tweets:
count += 1
tweet_obj = json.loads(line)
#append to a list
l_tweets.append({'coordinates': tweet_obj['coordinates'],
# ... copy same as yours
'user_verified': tweet_obj['user']['verified']
})
df = pd.DataFrame(l_tweets, columns=columns)
Regarding if 14M tweets can be handle by your RAM, I don't really know. On the cluster usually yes, but regarding how to process the data depends on the config of the cluster I think.
Or maybe, if you ensure the order of the elements same as in your list columns, then a list instead of a dictionary would work too:
count = 0
l_tweets = []
for line in tweets:
count += 1
tweet_obj = json.loads(line)
#append to a list
l_tweets.append([tweet_obj['coordinates'], tweet_obj['created_at'],
# ... copy just the values here in the right order
tweet_obj['user']['name'], tweet_obj['user']['verified']
])
df = pd.DataFrame(l_tweets, columns=columns)

Issues Converting Dictionary to DataFrame in Python

I am scraping data from my Facebook Business account and am having issues converting the dictionary I created from the cursor object I got from the FB API connection into a pandas DataFrame. Specifically, using pd.DataFrame(dict) is only returning data from the most recent day of the time series, even though my dict contains the full length of the series.
I have tried different specifications of pd.DataFrame(), but I keep getting the same output.
Obviously I expect pandas to convert the entire dictionary into a df, not just the last chunk... very strange that pd.DataFrame() isn't working for me. Anyone have a solution or has encountered a similar issue before?
params = {
'time_range': {
'since': "2019-08-11",
'until': "2019-09-09"
},
'fields': [
AdsInsights.Field.campaign_id,
AdsInsights.Field.campaign_name,
AdsInsights.Field.adset_name,
AdsInsights.Field.ad_name,
AdsInsights.Field.spend,
AdsInsights.Field.impressions,
AdsInsights.Field.clicks,
AdsInsights.Field.buying_type,
AdsInsights.Field.objective,
AdsInsights.Field.actions
],
'breakdowns': ['country'],
'level': 'ad',
'time_increment': 1
}
#get insights for campaign
campaign = AdCampaign('act_xxxx')
insights = campaign.get_insights(params=params)
print(insights)
#check that your output is a cursor object
type(insights)
#iterate over cursor object to convert into a dictionary
for item in insights:
data = dict(item)
print(data)
df = pd.DataFrame(data)
export_csv = df.to_csv('.../Documents/data.csv', header=True)

Categories