Create datasets based on authors from another dataset - python

I have a dataset in the following format
text author title
-------------------------------------
dt = text0 author0 title0
text1 author1 title1
. . .
. . .
. . .
and I would like to create different separate datasets which contain only texts of one author. For example the dataset names dt1 contains the texts of the author1, the dt2 contains texts of the author2, etc.
I would be grateful if you could help me with this using python.
Update:
dt =
text author title
-------------------------------------------------------------------------
0 I would like to go to the beach George Beach
1 I was in park few days ago Nick Park
2 I would like to go in uni Peter University
3 I have be in the airport at 8 Maria Airport

Please try, this is what I understand you require.
import pandas as pd
data = {
'text' : ['text0', 'text1', 'text2'],
'author': ['author0', 'author1', 'author1'],
'title': ['Comunicación', 'Administración', 'Ventas']
}
df = pd.DataFrame(data)
df1 = df[df["author"]=="author0"]
df2 = df[df["author"]=="author1"]
print(df1)
print(df2)
Update:
import pandas as pd
data = {
'text' : ['text0', 'text1', 'text2'],
'author': ['author0', 'author1', 'author1'],
'title': ['Comunicación', 'Administración', 'Ventas']
}
df = pd.DataFrame(data)
df1 = df[df["author"]=="author0"]
df2 = df[df["author"]=="author1"]
list_author = df['author'].unique().tolist()
for x in list_author:
a = df[df["author"]==x]
print(a)

Related

Storing keyvalue as header and value text as rows using data frame in python using beautiful soup

for imo in imos:
...
...
keys_div= soup.find_all("div", {"class","col-4 keytext"})
values_div = soup.find_all("div",{"class","col-8 valuetext"})
for key, value in zip(keys_div, values_div):
print(key.text + ": " + value.text)
'......
Output:
Ship Name: MAERSK ADRIATIC
Shiptype: Chemical/Products Tanker
IMO/LR No.: 9636632
Gross: 23,297
Call Sign: 9V3388
Deadweight: 37,538
MMSI No.: 566429000
Year of Build: 2012
Flag: Singapore
Status: In Service/Commission
Operator: Handytankers K/S
Shipbuilder: Hyundai Mipo Dockyard Co Ltd
ShipType: Chemical/Products Tanker
Built: 2012
GT: 23,297
Deadweight: 37,538
Length Overall: 184.000
Length (BP): 176.000
Length (Reg): 177.460
Bulbous Bow: Yes
Breadth Extreme: 27.430
Breadth Moulded: 27.400
Draught: 11.500
Depth: 17.200
Keel To Mast Height: 46.900
Displacement: 46565
T/CM: 45.0
This is the output for one imo, i want to store this output in dataframe and write to csv, the csv will have the keytext as header and value text as rows for all the IMO's please help me on how to do it
All you have to do is add the results to a list and then output that list to a dataframe.
import pandas as pd
filepath = r"C\users\test\test_file.csv"
output_data = []
for imo in imos:
keys_div = [i.text for i in soup.find_all("div", {"class","col-4 keytext"})]
values_div = [i.text for i in soup.find_all("div",{"class","col-8 valuetext"})]
dict1 = dict(zip(keys_div, values_div))
output_data.append(dict1)
df = pd.DataFrame(output_data)
df.to_csv(filepath, index=False)

How do I transform a non-CSV text file into a CSV using Python/Pandas?

I have a text file that looks like this:
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
I'd like the file to look like this:
Id Number
Location
Street
Buyer
12345678
1234561791234567090-8.9
999 Street AVE
john doe
12345688
3582561791254567090-8.9
123 Street AVE
Jane doe # buyer % LLC
12345689
8542561791254567090-8.9
854 Street AVE
Jake and Bob: Owner%LLC: Inc
I have tried the following:
# 1 Read text file and ignore bad lines (lines with extra colons thus reading as extra fields).
tr = pd.read_csv('C:\\File Path\\test.txt', sep=':', header=None, error_bad_lines=False)
# 2 Convert into a dataframe/pivot table.
ndf = pd.DataFrame(tr.pivot(index=None, columns=0, values=1))
# 3 Clean up the pivot table to remove NaNs and reset the index (line by line).
nf2 = ndf.apply(lambda x: x.dropna().reset_index(drop=True))
Here is where got the last line (#3): https://stackoverflow.com/a/62481057/10448224
When I do the above and export to CSV the headers are arranged like the following:
(index)
Street
Buyer
Id Number
Location
The data is filled in nicely but at some point the Buyer field becomes inaccurate but the rest of the fields are accurate through the entire DF.
My guesses:
When I run #1 part of my script I get the following errors 507 times:
b'Skipping line 500: expected 2 fields, saw 3\nSkipping line 728: expected 2 fields, saw 3\
At the tail end of the new DF I am missing exactly 507 entries for the Byer field. So I think when I drop my bad lines, the field is pushing my data up.
Pain Points:
The Buyer field will sometimes have extra colons and other odd characters. So when I try to use a colon as a delimiter I run into problems.
I am new to Python and I am very new to using functions. I primarily use Pandas to manipulate data at a somewhat basic level. So in the words of the great Michael Scott: "Explain it to me like I'm five." Many many thanks to anyone willing to help.
Here's what I meant by reading in and using split. Very similar to other answers. Untested and I don't recall if inputline include eol, so I stripped it too.
with open('myfile.txt') as f:
data = [] # holds database
record = {} # holds built up record
for inputline in f:
key,value = inputline.strip().split(':',1)
if key == "Id Number": # new record starting
if len(record):
data.append(record) # write previous record
record = {}
record.update({key:value})
if len(record):
data.append(record) # out final record
df = pd.DataFrame(data)
This is a minimal example that demonstrates the basics:
cat split_test.txt
Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe # buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc
import csv
with open("split_test.txt", "r") as f:
id_val = "Id Number"
list_var = []
for line in f:
split_line = line.strip().split(':')
print(split_line)
if split_line[0] == id_val:
d = {}
d[split_line[0]] = split_line[1]
list_var.append(d)
else:
d.update({split_line[0]: split_line[1]})
list_var
[{'Id Number': ' 12345689',
'Location': ' 8542561791254567090-8.9',
'Street': ' 854 Street AVE',
'Buyer': ' Jake and Bob'},
{'Id Number': ' 12345678',
'Location': ' 1234561791234567090-8.9',
'Street': ' 999 Street AVE',
'Buyer': ' john doe'},
{'Id Number': ' 12345688',
'Location': ' 3582561791254567090-8.9',
'Street': ' 123 Street AVE',
'Buyer': ' Jane doe # buyer % LLC'}]
with open("split_ex.csv", "w") as csv_file:
field_names = list_var[0].keys()
csv_writer = csv.DictWriter(csv_file, fieldnames=field_names)
csv_writer.writeheader()
for row in list_var:
csv_writer.writerow(row)
I would try reading the file line by line, splitting the key-value pairs into a list of dicts to look something like:
data = [
{
"Id Number": 12345678,
"Location": 1234561791234567090-8.9,
...
},
{
"Id Number": ...
}
]
# easy to create the dataframe from here
your_df = pd.DataFrame(data)

How to use values from a PANDAS data frame as filter params in regex?

I would like to use the values from a Pandas df as filter params in a SPARQL query.
By reading the data from an excel file I'm creating the pandas dataframe:
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
Here the resulting dataframe:
oggetto descrizione lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori 6 Bad OpereArteVisiva
1 #iccd3636719# Decorazione plastica. 2 Bad OpereArteVisiva
2 #iccd3641475# Scultura.. Figure: angelo 3 Bad OpereArteVisiva
3 #iccd8282504# Custodia di reliquiario in legno intagliato e ... 8 Good OpereArteVisiva
4 #iccd3019633# Portale. 1 Bad OpereArteVisiva
... ... ... ... ... ...
59995 #iccd2274873# Ciotola media a larga tesa. Decorazione in cob... 35 Good OpereArteVisiva
59996 #iccd11189887# Il medaglione bronzeo, sormontato da un'aquila... 85 Good OpereArteVisiva
59997 #iccd4545324# Tessuto di fondo rosaceo. Disegno a fiori e fo... 49 Good OpereArteVisiva
59998 #iccd2934870# Sculture a tutto tondo in legno dipinto di bia... 28 Good OpereArteVisiva
59999 #iccd2685205# Calice con piede a base circolare e nodo ovoid... 14 Bad OpereArteVisiva
Then I need to use the values from the oggetto column as filter to retrieve for (each record) the relative subject from a SPARQL endpoint.
By using this SPARQL query:
SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
I'm able to filter one single record.
object.type object.value ... subjects.type subjects.value
0 uri http://dati.culturaitalia.it/resource/oai-oaic... ... literal Putto con ghirlanda di fiori|Putto con ghirlan..
Since the dataset is 60k records I would automatize the process by looping through the dataframe and use the value as filter to have a new df with a relative subject col.
oggetto descrizione subject lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori Putto con ghirlanda di fiori|Putto con ghirlan.. 6 Bad OpereArteVisiva
Here the entire script I wrote:
import xlrd
import pandas as pd
from pandas import json_normalize
from SPARQLWrapper import SPARQLWrapper, JSON
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
def query_ci(sparql_query, sparql_service_url):
sparql = SPARQLWrapper(sparql_service_url)
sparql.setQuery(sparql_query)
sparql.setReturnFormat(JSON)
# ask for the result
result = sparql.query().convert()
return json_normalize(result["results"]["bindings"])
sparql_query = """ SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
"""
sparql_service_url = "http://dati.culturaitalia.it/sparql"
result_table = query_ci(sparql_query, sparql_service_url)
print (result_table)
result_table.to_excel("output.xlsx")
Is it possible to do that?

Sorting API response with python to excel or csv. Python

I'm trying to sort out UK Police free API response to a readable format-csv or excel.
Im using Requests library. My initial code is getting the response in a json format:
import requests
r=requests.get('https://data.police.uk/api/crimes-street/all-crime?poly=51.169,-0.633:51.186,-0.5436:51.226,-0.6224&date=2019-12')
r_json=r.json()
for i in j:
for key,value in i.items():
print (key, ":", value)
The code above produces as follows:
category : anti-social-behaviour location_type : Force location : {'latitude': '51.196818', 'street': {'id': 1147343, 'name': 'On or near Parking Area'}, 'longitude': '-0.605146'} context : outcome_status : None persistent_id : id : 79955592 location_subtype : month : 2019-12
How can I create a table with correct headers for the response I get? Headers would be 'category', 'latitude', 'street', 'name', 'longitude', ' month'.
You need to get dipper in dictionary tree to get some data like latitude. Results are collected into collection of lists then loaded into data frame and saved as csv file.
import requests
import pandas as pd
r=requests.get('https://data.police.uk/api/crimes-street/all-crime?poly=51.169,-0.633:51.186,-0.5436:51.226,-0.6224&date=2019-12')
r_json=r.json()
# collect data into list of lists
collected_data = []
for data in r_json:
category = data.get('category')
month = data.get('month')
latitude = ''
longitude = ''
street = ''
for key, value in data.items():
if key == 'location':
latitude = value.get('latitude')
longitude = value.get('longitude')
street = value.get('street').get('name')
collected_data.append([category, latitude, longitude, street, month])
# load data into data frame
df = pd.DataFrame(collected_data, columns = ['Category' , 'Latitude', 'Longitude', 'Street', 'Month'])
# save data frame into csv
df.to_csv('data.csv')

Python 3: XML Tag Value not being written to csv file

My python 3 script takes an xml file and creates a csv file.
Small excerpt of xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<metadata>
<dc>
<title>Golden days for boys and girls, 1895-03-16, v. XVI #17</title>
<subject>Children's literature--Children's periodicals</subject>
<description>Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries</description>
<publisher>James Elverson, 1880-</publisher>
<date>1895-06-15</date>
<type>Text | periodicals</type>
<format>image/jp2</format>
<handle>http://hdl.handle.net/11134/20002:860074494</handle>
<accessionNumber/>
<barcode/>
<identifier>20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870 | hdl:  | http://hdl.handle.net/11134/20002:860074494</identifier>
<rights>These Materials are provided for educational and research purposes only. The University of Connecticut Libraries hold the copyright except where noted. Permission must be obtained in writing from the University of Connecticut Libraries and/or theowner(s) of the copyright to publish reproductions or quotations beyond "fair use." | The collection is open and available for research.</rights>
<creator/>
<relation/>
<coverage/>
<language/>
</dc>
</metadata>
Python3 code:
import csv
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='ctda_set1_uniqueTags.xml')
doc = ET.parse("ctda_set1_uniqueTags.xml")
root = tree.getroot()
oaidc_data = open('ctda_set1_uniqueTags.csv', 'w', encoding='utf-8')
titles = 'dc/title'
subjects = 'dc/subject'
csvwriter = csv.writer(oaidc_data)
oaidc_head = ['Title', 'Subject', 'Description', 'Publisher', 'Date', 'Type', 'Format', 'Handle', 'Accession Number', 'Barcode', 'Identifiers', 'Rights', 'Creator', 'Relation', 'Coverage', 'Language']
count = 0
for member in root.findall('dc'):
if count == 0:
csvwriter.writerow(oaidc_head)
count = count + 1
dcdata = []
titles = member.find('title').text
dcdata.append(titles)
subjects = member.find('subject').text
dcdata.append(subjects)
descriptions = member.find('description').text
dcdata.append(descriptions)
publishers = member.find('publisher').text
dcdata.append(publishers)
dates = member.find('date').text
dcdata.append(dates)
types = member.find('type').text
dcdata.append(types)
formats = member.find('format').text
dcdata.append(formats)
handle = member.find('handle').text
dcdata.append(handle)
accessionNo = member.find('accessionNumber').text
dcdata.append(accessionNo)
barcodes = member.find('barcode').text
dcdata.append(barcodes)
identifiers = member.find('identifier').text
dcdata.append(identifiers)
rt = member.find('rights').text
print(member.find('rights').text)
dcdata.append('rt')
ct = member.find('creator').text
dcdata.append('ct')
rt = member.find('relation').text
dcdata.append('rt')
ce = member.find('coverage').text
dcdata.append('ce')
lang = member.find('language').text
dcdata.append('lang')
csvwriter.writerow(dcdata)
oaidc_data.close()
Everything works as expected except for rt, ce, and lang. What happens is that in the csv, all the data is written with the comma delimiter. For rt, the value is always rt, for ce, ce, lang, lang, etc.
Here's a snippet of the output:
Title,Subject,Description,Publisher,Date,Type,Format,Handle,Accession Number,Barcode,Identifiers,Rights,Creator,Relation,Coverage,Language
"Golden days for boys and girls, 1895-03-16, v. XVI #17",Children's literature--Children's periodicals,"Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries","James Elverson, 1880-",1895-06-15,Text | periodicals,image/jp2,hdl.handle.net/11134/20002:860074494,,,20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870,**rt,ct,rt,ce,lang**
Some of the rights statements get very long - perhaps that's the issue. That's why I added the print(member.find('rights')) to see the output. The text is printed just fine. The text just isn't written to the csv. What I'd like is to have the value or text written for these xml tags. Any help would be appreciated.
Thanks.
Jennifer
In the line dcdata.append('rt') there is no need for the quotes. Try dcdata.append(rt). Similarly, there are unnecessary quotes in the ce and lang lines.

Categories