My database has a column where all the cells have a string of data. There are around 15-20 variables, where the information is assigned to the variables with an "=" and separated by a space. The number and names of the variables can differ in the individual cells... The issue I face is that the data is separated by spaces and so are some of the variables. The variable name is in every cell, so I can't just make the headers and add the values to the data frame like a csv. The solution also needs to be able to do this process automatically for all the new data in the database.
Example:
Cell 1: TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520"... RELEASED="1880".
Cell 2: TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655"... MAIN CHARACTER="Ishmael".
I want to convert these strings of data into a structured dataframe like.
TITLE
AUTHOR
PAGES
RELEASED
MAIN
Brothers Karamazov
Fyodor Dostoevsky
520
1880
NaN
Moby Dick
Herman Meville
655
NaN
Ishmael
Any tips on how to move forwards? I have though about converting it into a JSON format by using the replace() function, before turning it into a dataframe, but have not yet succeeded. Any tips or ideas are much appreciated.
Thanks,
I guess this sample is what you need.
import pandas as pd
# Helper function
def str_to_dict(cell) -> dict:
normalized_cell = cell.replace('" ', '\n').replace('"', '').split('\n')
temp = {}
for x in normalized_cell:
key, value = x.split('=')
temp[key] = value
return temp
list_of_cell = [
'TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520" RELEASED="1880"',
'TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655" MAIN CHARACTER="Ishmael"'
]
dataset = [str_to_dict(i) for i in list_of_cell]
print(dataset)
"""
[{'TITLE': 'Brothers Karamazov', 'AUTHOR': 'Fyodor Dostoevsky', 'PAGES': '520', 'RELEASED': '1880'}, {'TITLE': 'Moby Dick', 'AUTHOR': 'Herman Melville', 'PAGES': '655', 'MAIN CHARACTER': 'Ishmael'}]
"""
df = pd.DataFrame(dataset)
df.head()
"""
TITLE AUTHOR PAGES RELEASED MAIN CHARACTER
0 Brothers Karamazov Fyodor Dostoevsky 520 1880 NaN
1 Moby Dick Herman Melville 655 NaN Ishmael
"""
Pandas lib can read them from a .csv file and make a data frame - try this:
import pandas as pd
file = 'xx.csv'
data = pd.read_csv(file)
print(data)
Create a Python dictionary from your database rows.
Then create Pandas dataframe using the function: pandas.DataFrame.from_dict
Something like this:
import pandas as pd
# Assumed data from DB, structure it like this
data = [
{
'TITLE': 'Brothers Karamazov',
'AUTHOR': 'Fyodor Dostoevsky'
}, {
'TITLE': 'Moby Dick',
'AUTHOR': 'Herman Melville'
}
]
# Dataframe as per your requirements
dt = pd.DataFrame.from_dict(data)
Related
I am doing a task that requires scraping. I have a dataset with ids and for each id i need to scrape some new information. This dataset has around 4 million rows. Here is my code:
import pandas as pd
import numpy as np
import semanticscholar as sch
import time
# dataset with ids
df = pd.read_csv('paperIds-1975-2005-2015-2.tsv', sep='\t', names=["id"])
# columns that will be produced
cols = ['id', 'abstract', 'arxivId', 'authors',
'citationVelocity', 'citations',
'corpusId', 'doi', 'fieldsOfStudy',
'influentialCitationCount', 'is_open_access',
'is_publisher_licensed', 'paperId',
'references', 'title', 'topics',
'url', 'venue', 'year']
# a new dataframe that we will append the scraped results
new_df = pd.DataFrame(columns=cols)
# a counter so we know when every 100000 papers are scraped
c = 0
i = 0
while i < df.shape[0]:
try:
paper = sch.paper(df.id[i], timeout=10) # scrape the paper
new_df = new_df.append([df.id[i]]+paper, ignore_index=True) # append to the new dataframe
new_df.to_csv('abstracts_impact.csv', index=False) # save it
if i % 100000 == 0: # to check how much we did
print(c)
c += 1
i += 1
except:
time.sleep(60)
The problem is that the dataset is pretty big and this approach is not working. I left it working for 2 days and it scraped around 100000 ids, and then suddenly just froze and all the data that was saved was just empty rows.
I was thinking that the best solution would be to parallelize and batch processing. I never have done this before and I am not familiar with these concepts. Any help would be appreciated. Thank you!
Okay, so first of all there is no data :( so I am just taking a sample ID from semanticscholar documents. Looking at your code, there I can see plenty of mistakes:
Don't always stick to pd.DataFrame for your work! Dataframe are great, but are also slow! You just need to get the ID from the 'paperIds-1975-2005-2015-2.tsv' so you can either read the file using file.readline() or you can save the data into a list:
data = pd.read_csv('paperIds-1975-2005-2015-2.tsv', sep='\t', names=["id"]).id.values
From the code flow, what I understand is you want to save the scraped data into a single CSV file, right? So, why are you appending the data and saving the file again and again? This makes the code like 100000s time slower!
I really don't understand the purpose of time.sleep(60) you have added. If there is some error, you should print and move on - why wait?
For checking the progress, you can use the tqdm library which shows a nice progress bar for your code!
Taking these into consideration, I have modified your code as follows:
import pandas as pd
import semanticscholar as sch
from tqdm import tqdm as TQ # for progree-bar
data = ['10.1093/mind/lix.236.433', '10.1093/mind/lix.236.433'] # using list or np.ndarray looks more logical!
print(data)
>> ['10.1093/mind/lix.236.433', '10.1093/mind/lix.236.433']
Once you have done this, you can now go and scrape the data. Okay, before that pandas DataFrame is basically a dictionary with advanced features. So, for our purpose, we will first add all the information to the dictionary and then create the dataframe. I personally prefer this process - gives me more control, if there are any changes need to be done.
cols = ['id', 'abstract', 'arxivId', 'authors', 'citationVelocity', 'citations',
'corpusId', 'doi', 'fieldsOfStudy', 'influentialCitationCount', 'is_open_access',
'is_publisher_licensed', 'paperId', 'references', 'title', 'topics', 'url', 'venue', 'year']
outputData = dict((k, []) for k in cols)
print(outputData)
{'id': [],
'abstract': [],
'arxivId': [],
'authors': [],
'citationVelocity': [],
'citations': [],
'corpusId': [],
'doi': [],
'fieldsOfStudy': [],
'influentialCitationCount': [],
'is_open_access': [],
'is_publisher_licensed': [],
'paperId': [],
'references': [],
'title': [],
'topics': [],
'url': [],
'venue': [],
'year': []}
Now you can simply fetch the data and save it into your dataframe as below:
for _paperID in TQ(data):
paper = sch.paper(_paperID, timeout = 10) # scrape the paper
for key in cols:
try:
outputData[key].append(paper.get(key))
except KeyError:
outputData[key].append(None) # if there is no data, append none
print(f"{key} not Found for {_paperID}")
pd.DataFrame(outputData).to_csv('output_file_name.csv', index = False)
This is the output that I have obtained:
I have some students, and I created several dicts to store their information. Each dict has the same keys with different values. The code is as follows.
dict_score = {
'Mike': 100,
'John':98,
'HuxiaoMing':78,
'Mechele':66}
dict_sex = {
'Mike': "male",
'John':"male",
'HuxiaoMing':"female",
'Mechele':"male"}
dict_height = {
'Mike': 170,
'John':198,
'HuxiaoMing':178,
'Mechele':166}
def get_degree(score):
if score>90:
return "good"
elif score<60:
return "bad"
else:
return "normal"
dict_degree = {
'Mike': get_degree(dict_score['Mike']),
'John':get_degree(dict_score['John']),
'HuxiaoMing':get_degree(dict_score['HuxiaoMing']),
'Mechele':get_degree(dict_score['Mechele'])}
......
There are a couple of things which annoy me:
If I want to add new student, I need to change every dict. Sometimes I will forget some dicts.
If I want to include some new features for each student (like there age, or favorite books). I need to create a new dict and type their name again.
Is there any more elegant way to store this information and use it? I hope it could be as easy as I did in Excel.
I would recommend you to use Pandas for this. You can use pandas dataframes to manage data like this with ultimate flexibilty:
import numpy as np
import pandas as pd
dict_score = {
'Mike': 100,
'John':98,
'HuxiaoMing':78,
'Mechele':66}
dict_sex = {
'Mike': "male",
'John':"male",
'HuxiaoMing':"female",
'Mechele':"male"}
dict_height = {
'Mike': 170,
'John':198,
'HuxiaoMing':178,
'Mechele':166}
df = pd.DataFrame({'Score': dict_score,'Sex': dict_sex, 'Height': dict_height}).reset_index()
Here our dataframe (df) will look like:
index Score Sex Height
0 Mike 100 male 170
1 John 98 male 198
2 HuxiaoMing 78 female 178
3 Mechele 66 male 166
Now you can easily apply the get_score function and create new column like this:
df['Degree'] = df['Score'].apply(get_degree)
Here now our df will look like:
index Score Sex Height Degree
0 Mike 100 male 170 good
1 John 98 male 198 good
2 HuxiaoMing 78 female 178 normal
3 Mechele 66 male 166 normal
Let's work more!
To get any column (Each columns are pandas series) use like:
df['COLUMN_NAME']
Now like we want to get average of the scores of the students:
df['Score'].average()
Get descriptive statistics:
df['Score'].describe()
And using .apply() you can apply your own functions with powerful combinations of lambda.
Write a function (once) to operate on all the dictionaries everytime you need to add a student.
def f(name,sex=None,height=None,score=None):
dict_score[name] = score
dict_sex[name] = sex
dict_height[name] = height
dict_degree[name] = get_degree(dict_score[name])}
Make sure you define the function in the same scope as the dictionaries so it can see them.
Workflow: csv -> python -> csv
Below I show you a workflow csv -> python -> csv, which I think would be most convenient. You would load the input data from input.csv and then save the output data as out.csv. This way you can create your input data in e.g. Excel and read the output in Excel, too. The heavy-lifting & any calculations are done in python.
The API would look like this:
sc = StudentContainer("input.csv")
sc.add_student(name="Laura", sex="female", height=155, score=92) # Could also add rows programmatically, if wanted
sc.df.to_csv("out.csv", index=False)
The input.csv would look like this: (edited in Excel)
name,score,sex,height
Mike,100,male,170
John,98,male,198
HuxiaoMing,78,female,178
Mechele,66,male,166
and out.csv like this: (opened in Excel, if wanted)
name,score,sex,height,degree
Mike,100,male,170,good
John,98,male,198,good
HuxiaoMing,78,female,178,normal
Mechele,66,male,166,normal
Laura,92,female,155,good
Code
import pandas as pd
def degree(score):
if score > 90:
return "good"
elif score < 60:
return "bad"
else:
return "normal"
class StudentContainer:
def __init__(self, file=None):
self.df = pd.DataFrame() if file is None else self.from_csv(file)
def add_student(self, **data):
# calculate new columns before adding to self.df
data["degree"] = degree(data["score"])
self.df = self.df.append(data, ignore_index=True)
def from_csv(self, file):
df = pd.read_csv(file)
df["degree"] = df["score"].apply(degree)
return df
StudentContainer
Class for managing a group of students that belong together. You could have different groups.
Attribute cs.df for holding a pandas.DataFrame. An .csv or .xlsx file can be easily made from output with sc.df.to_csv() or sc.df.to_excel()
You could extend this to also have other convenience methods. As an example, there is sc.from_csv() to get the data from csv file.
So I think the comments have gotten at the key issue here, which is that you've oriented your data in such a way that doing what you want (adding a new student) is harder than it would be if the data were oriented differently.
If there's some reason you need to load data in the way you've done, but then subsequently work with names, here's a handy bit of code for you:
import pandas as pd
pd.concat([pd.DataFrame.from_dict(dict_score, orient='index', columns=["score"]),
pd.DataFrame.from_dict(dict_sex, orient='index', columns=["sex"]),
pd.DataFrame.from_dict(dict_height, orient='index', columns=["height"])],axis=1).T.to_dict()
This provides a transformation that you may find useful when adding additional data, and/or moving back and forth:
{'Mike': {'score': 100, 'sex': 'male', 'height': 170},
'John': {'score': 98, 'sex': 'male', 'height': 198},
'HuxiaoMing': {'score': 78, 'sex': 'female', 'height': 178},
'Mechele': {'score': 66, 'sex': 'male', 'height': 166}}
You can move back and forth, like this:
How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.
# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers
By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.
Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13
My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.
I'm trying to import some data from a webpage into a dataframe.
Data: a block of text in the following format
[{"ID":0,"Name":"John","Location":"Chicago","Created":"2017-04-23"}, ... ]
I am successfully making the request to the server and can return the data in text form, but cannot seem to convert this to a DataFrame.
E.g
r = requests.get(url)
people = r.text
print(people)
So from this point, I am a bit confused on how to structure this text as a DataFrame. Most tutorials online seem to demonstrate importing csv, excel or html etc.
If people is a list of dict in string format, you can use json.loads to convert it to a list of dict and then create a DataFrame easily
>>> import json
>>> import pandas as pd
>>> people='[{"ID":0,"Name":"John","Location":"Chicago","Created":"2017-04-23"}]'
>>> json.loads(people)
[{'ID': 0, 'Name': 'John', 'Location': 'Chicago', 'Created': '2017-04-23'}]
>>>
>>> data=json.loads(people)
>>> pd.DataFrame(data)
Created ID Location Name
0 2017-04-23 0 Chicago John