How to use values from a PANDAS data frame as filter params in regex? - python

I would like to use the values from a Pandas df as filter params in a SPARQL query.
By reading the data from an excel file I'm creating the pandas dataframe:
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
Here the resulting dataframe:
oggetto descrizione lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori 6 Bad OpereArteVisiva
1 #iccd3636719# Decorazione plastica. 2 Bad OpereArteVisiva
2 #iccd3641475# Scultura.. Figure: angelo 3 Bad OpereArteVisiva
3 #iccd8282504# Custodia di reliquiario in legno intagliato e ... 8 Good OpereArteVisiva
4 #iccd3019633# Portale. 1 Bad OpereArteVisiva
... ... ... ... ... ...
59995 #iccd2274873# Ciotola media a larga tesa. Decorazione in cob... 35 Good OpereArteVisiva
59996 #iccd11189887# Il medaglione bronzeo, sormontato da un'aquila... 85 Good OpereArteVisiva
59997 #iccd4545324# Tessuto di fondo rosaceo. Disegno a fiori e fo... 49 Good OpereArteVisiva
59998 #iccd2934870# Sculture a tutto tondo in legno dipinto di bia... 28 Good OpereArteVisiva
59999 #iccd2685205# Calice con piede a base circolare e nodo ovoid... 14 Bad OpereArteVisiva
Then I need to use the values from the oggetto column as filter to retrieve for (each record) the relative subject from a SPARQL endpoint.
By using this SPARQL query:
SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
I'm able to filter one single record.
object.type object.value ... subjects.type subjects.value
0 uri http://dati.culturaitalia.it/resource/oai-oaic... ... literal Putto con ghirlanda di fiori|Putto con ghirlan..
Since the dataset is 60k records I would automatize the process by looping through the dataframe and use the value as filter to have a new df with a relative subject col.
oggetto descrizione subject lenght label tipologia
0 #iccd4580759# Figure: putto. Oggetti: ghirlanda di fiori Putto con ghirlanda di fiori|Putto con ghirlan.. 6 Bad OpereArteVisiva
Here the entire script I wrote:
import xlrd
import pandas as pd
from pandas import json_normalize
from SPARQLWrapper import SPARQLWrapper, JSON
xls = pd.ExcelFile ('excel/dataset_nuovo.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
def query_ci(sparql_query, sparql_service_url):
sparql = SPARQLWrapper(sparql_service_url)
sparql.setQuery(sparql_query)
sparql.setReturnFormat(JSON)
# ask for the result
result = sparql.query().convert()
return json_normalize(result["results"]["bindings"])
sparql_query = """ SELECT ?object ?description (group_concat(?subject;separator="|") as ?subjects)
WHERE { ?object a crm:E22_Man-Made_Object;
crm:P3_has_note ?description;
crm:P129_is_about ?concept;
crm:P2_has_type ?type.
?concept a crm:E28_Conceptual_Object;
rdfs:label ?subject.
filter( regex(str(?object), "#iccd4580759#" ))
}
"""
sparql_service_url = "http://dati.culturaitalia.it/sparql"
result_table = query_ci(sparql_query, sparql_service_url)
print (result_table)
result_table.to_excel("output.xlsx")
Is it possible to do that?

Related

Create datasets based on authors from another dataset

I have a dataset in the following format
text author title
-------------------------------------
dt = text0 author0 title0
text1 author1 title1
. . .
. . .
. . .
and I would like to create different separate datasets which contain only texts of one author. For example the dataset names dt1 contains the texts of the author1, the dt2 contains texts of the author2, etc.
I would be grateful if you could help me with this using python.
Update:
dt =
text author title
-------------------------------------------------------------------------
0 I would like to go to the beach George Beach
1 I was in park few days ago Nick Park
2 I would like to go in uni Peter University
3 I have be in the airport at 8 Maria Airport
Please try, this is what I understand you require.
import pandas as pd
data = {
'text' : ['text0', 'text1', 'text2'],
'author': ['author0', 'author1', 'author1'],
'title': ['Comunicación', 'Administración', 'Ventas']
}
df = pd.DataFrame(data)
df1 = df[df["author"]=="author0"]
df2 = df[df["author"]=="author1"]
print(df1)
print(df2)
Update:
import pandas as pd
data = {
'text' : ['text0', 'text1', 'text2'],
'author': ['author0', 'author1', 'author1'],
'title': ['Comunicación', 'Administración', 'Ventas']
}
df = pd.DataFrame(data)
df1 = df[df["author"]=="author0"]
df2 = df[df["author"]=="author1"]
list_author = df['author'].unique().tolist()
for x in list_author:
a = df[df["author"]==x]
print(a)

Check data column is same or not with Pandas

I want to check data column 'tokenizing' and 'lemmatization' is same or not like the table. But, giving me an error
tokenizing
lemmatization
check
[pergi, untuk, melakukan, penanganan, banjir]
[pergi, untuk, laku, tangan, banjir]
False
[baca, buku, itu, asik]
[baca, buku, itu, asik]
True
from spacy.lang.id import Indonesian
import pandas as pd
nlp = Indonesian()
nlp.add_pipe('lemmatizer')
nlp.initialize()
data = [
'pergi untuk melakukan penanganan banjir',
'baca buku itu asik'
]
df = pd.DataFrame({'text': data})
#Tokenization
def tokenizer(words):
return [token for token in nlp(words)]
#Lemmatization
def lemmatizer(token):
return [lem.lemma_ for lem in token]
df['tokenizing'] = df['text'].apply(tokenizer)
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
#Check similarity
df.to_clipboard(sep='\s\s+')
df['check'] = df['tokenizing'].eq(df['lemmatization'])
df
How to compare?
result before error df.to_clipboard()
text tokenizing lemmatization
0 pergi untuk melakukan penanganan banjir [pergi, untuk, melakukan, penanganan, banjir] [pergi, untuk, laku, tangan, banjir]
1 baca buku itu asik [baca, buku, itu, asik] [baca, buku, itu, asik]
Update
The error is fixed. It is because typo. And after fixed the typo the result is like this the result is all False. What I want is like the table.
Base on your code, you forgot i on df['lemmatizaton'].
So that change
df['lemmatizaton'] = df['tokenizing'].apply(lemmatizer)
to
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
Then it may work.

Pandas - Filter DataFrame with Logics

I have a Pandas DataFrame like this,
Employee ID ActionCode ActionReason ConcatenatedOutput
1 TER DEA TER_DEA
1 RET ABC RET_ABC
1 RET DEF RET_DEF
2 TER DEA TER_DEA
2 ABC ABC ABC_ABC
2 DEF DEF DEF_DEF
3 RET FGH RET_FGH
3 RET EFG RET_EFG
4 PLA ABC PLA_ABC
4 TER DEA TER_DEA
And I want to filter it with the below logics and change it to something like this,
Employee ID ConcatenatedOutput Context
1 RET_ABC RET or TER Found
2 TER_DEA RET or TER Found
3 RET_FGH RET or TER Found
4 PLA_ABC RET or TER Not Found
Logics:-
1) If the first record of an Employee is TER_DEA then we go in to that employee and see if that employee has any other records, If that employee has another RET record, then we pick up the first available RET record or else we stick to TER_DEA record.
2) if the first record of an employee is anything other than TER_DEA then we stick with that record.
3) Context is conditional if it has a RET or TER then we say RET or TER Found, else it is not found.
Note:- The final output will have only one record for an employee ID.
The data below,
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})
I'd recommend you rather go with a Bool in that field. To get the test data I used this:
import pandas as pd
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})
You can then do a group by on the employee ID and and apply a function to perform your specific program logic in there.
def myfunc(data):
if data.iloc[0]['ConcatenatedOutput'] == 'TER_DEA':
if len(data.loc[data['ActionCode'] == 'RET']) > 0:
located_record = data.loc[data['ActionCode'] == 'RET'].iloc[[0]]
else:
located_record = data.iloc[[0]]
else:
located_record = data.iloc[[0]]
located_record['RET or TER Context'] = data['ActionCode'].str.contains('|'.join(['RET', 'TER']))
return located_record
df.groupby(['Employee ID']).apply(myfunc)

pandas dataframe apply function over column creating multiple columns

I have the pandas df below, with a few columns, one of which is ip_addresses
df.head()
my_id someother_id created_at ip_address state
308074 309115 2859690 2014-09-26 22:55:20 67.000.000.000 rejected
308757 309798 2859690 2014-09-30 04:16:56 173.000.000.000 approved
309576 310619 2859690 2014-10-02 20:13:12 173.000.000.000 approved
310347 311390 2859690 2014-10-05 04:16:01 173.000.000.000 approved
311784 312827 2859690 2014-10-10 06:38:39 69.000.000.000 approved
For each ip_address I'm trying to return the description, city, country
I wrote a function below and tried to apply it
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return [description, city, country]
# ---
suspect['ipwhois'] = suspect['ip_address'].apply(returnIP)
My problem is that this returns a list, I want three separate columns.
Any help is greatly appreciated. I'm new to Pandas/Python so if there's a better way to write the function and use Pandas would be very helpful.
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return (description, city, country)
suspect['description'], suspect['city'], suspect['country'] = \
suspect['ip_address'].apply(returnIP)
I was able to solve it with another stackoverflow solution
for n,col in enumerate(cols):
suspect[col] = suspect['ipwhois'].apply(lambda ipwhois: ipwhois[n])
If there's a more elegant way to solve this, please share!

Dealing with special characters in pandas Data Frame´s column Name

I am importing an excel worksheet that has the following columns name:
N° Pedido
1234
6424
4563
The column name ha a special character (°). Because of that, I can´t merge this with another Data Frame or rename the column. I don´t get any error message just the name stays the same. What should I do?
This is the code I am using and the result of the Dataframes:
import pandas as pd
import numpy as np
# Importando Planilhas
CRM = pd.ExcelFile(r'C:\Users\Michel\Desktop\Relatorio de
Vendas\relatorio_vendas_CRM.xlsx', encoding= 'utf-8')
protheus = pd.ExcelFile(r'C:\Users\Michel\Desktop\Relatorio de
Vendas\relatorio_vendas_protheus.xlsx', encoding= 'utf-8')
#transformando em Data Frame
df_crm = CRM.parse('190_pedido_export (33)')
df_protheus = protheus.parse('Relatorio de Pedido de Venda')]
# Transformando Campos em float o protheus
def turn_to_float(x):
return np.float(x)
df_protheus["TES"] = df_protheus["TES"].apply(turn_to_float)
df_protheus["Qtde"] = df_protheus["Qtde"].apply(turn_to_float)
df_protheus["Valor"] = df_protheus["Valor"].apply(turn_to_float)
#Tirando Tes de não venda do protheus
# tirando valores com código errado 6
df_protheus_1 = df_protheus[df_protheus.TES != 513.0]
df_protheus_2 = df_protheus_1[df_protheus_1.TES != 576.0]
**df_crm.columns = df_crm.columns.str.replace('N° Pedido', 'teste')
df_crm.columns**
Orçamento Origem N° Pedido Nº Pedido ERP Estabelecimento Tipo de
Pedido Classificação(Tipo) Aplicação Conta CNPJ/CPF Contato ...
Aprovação Parcial Antecipa Entrega Desconto da Tabela de Preço
Desconto do Cliente Desconto Informado Observações Observações NF Vl
Total Bruto Vl Total Completo
0 20619.0 23125 NaN Optitex 1 - Venda NaN Industrialização/Revenda
XAVIER E ARAUJO LTDA ME 7970626000170 NaN ... N N 0 0 0
Note that I used other codes for the bold part with the same result:
#renomeando tabela para dar Merge
#df_crm['proc'] = df_crm['N\xc2\xb0 Pedido']
#df_crm['N Pedido'] = df_crm['N° Pedido']
#df_crm.drop('N° Pedido',inplace=True,axis=1)
#df_crm
#df_crm['N Pedido'] = df_crm['N° Pedido']
#df.drop('N° Pedido',inplace=True,axis=1)
#df_crm
#df_crm_1 = df_crm.rename(columns={"N°Pedido": "teste"})
#df_crm_1
Thanks for posting the link to the Google Sheet. I downloaded it and loaded it via pandas:
df = pd.read_excel(r'~\relatorio_vendas_CRM.xlsx', encoding = 'utf-8')
df.columns = df.columns.str.replace('°', '')
df.columns = df.columns.str.replace('º', '')
Note that the two replace statements are replacing different characters, although they look very similar.
Help from: Why do I get a SyntaxError for a Unicode escape in my file path?
I was able to copy the values into another column. You could try that
df['N Pedido'] = df['N° Pedido']
df.drop('N° Pedido',inplace=True,axis=1)

Categories