Dataframe group by and multi aggregate - python

Given a dataframe from a pd.read_sql_query structured like:
tabla nombre_razon periodos
Bancos | ALIAGA ORTIZ LILIA ROXANA | [201801,201902]
Bancos | CIELO PLAST EIRL | [201702]
Bancos | COCHACHIN AGUIRRE ELIAS PABLO | [201801,201902,202001]
Bancos | COPLASTICA SAC | [202203, 202102, 202110, 202105, 202206]
Bancos | ECOPET PERU SAC | [201801,201902]
Ventas | ALIAGA ORTIZ LILIA ROXANA | [202201, 202202, 202109, 202107]
Ventas | GRUPO ELIAPAC SAC | [202207, 202209, 202205, 202203, 202109]
Ventas | COPLASTICA SAC | [201801,201902]
Ventas | ECOPET PERU SAC | [201801,201902]
Ventas | KENTHIVAS SAC | [202208, 202201, 202112, 202202]
Compras | ALIAGA ORTIZ LILIA ROXANA | [201801,201902]
Compras | CIELO PLAST EIRL | [202204, 202201, 202202, 202209]
Compras | COCHACHIN AGUIRRE ELIAS PABLO | [201801,201902]
Compras | ECOPET PERU SAC | [202201, 202107, 202108, 202109]
Compras | KENTHIVAS SAC | [201801,201902]
And I would like to transform it to the following List:
[['Bancos','Ventas','Compras'],[['ALIAGA ORTIZ LILIA ROXANA','CIELO PLAST EIRL','COCHACHIN AGUIRRE ELIAS PABLO','COPLASTICA SAC','ECOPET PERU SAC'],['ALIAGA ORTIZ LILIA ROXANA','GRUPO ELIAPAC SAC','COPLASTICA SAC','ECOPET PERU SAC','KENTHIVAS SAC'],['ALIAGA ORTIZ LILIA ROXANA','CIELO PLAST EIRL','COCHACHIN AGUIRRE ELIAS PABLO','ECOPET PERU SAC','KENTHIVAS SAC']],[[['201801','201902'],['201702'],['201801','201902','202001'],['202203','202102','202110','202105', '202206'],['201801','201902']],[['202201','202202','202109','202107'],['202207','202209','202205','202203','202109'],['201801','201902'],['201801','201902'],['202208','202201','202112','202202']],[['201801','201902'],['202204','202201', '202202','202209'],['201801','201902'],['202201','202107','202108','202109'],['201801','201902']]]]
I've tried ways like this:
dataFrame.groupby(['tabla', 'nombre_razon','periodos'])
or
comboGeneral2['periodo_tributario']=comboGeneral2['periodo_tributario'].apply(str)
comboGeneral1=comboGeneral3.groupby('tabla')['nombre_razon','periodo_tributario'].agg(lambda x: list(x)).reset_index()
without success

You can aggregate lists with tranpose, converting to numpy array and then to lists:
import ast
#if necessary
#dataFrame['periodos'] = dataFrame['periodos'].apply(ast.literal_eval)
L = (dataFrame.groupby('tabla', sort=False)[['nombre_razon','periodos']]
.agg(list)
.reset_index()
.T
.to_numpy()
.tolist())
print (L)
[['Bancos', 'Ventas', 'Compras'],
[['ALIAGA ORTIZ LILIA ROXANA', 'CIELO PLAST EIRL',
'COCHACHIN AGUIRRE ELIAS PABLO', 'COPLASTICA SAC', 'ECOPET PERU SAC'],
['ALIAGA ORTIZ LILIA ROXANA', 'GRUPO ELIAPAC SAC', 'COPLASTICA SAC',
'ECOPET PERU SAC', 'KENTHIVAS SAC'],
['ALIAGA ORTIZ LILIA ROXANA', 'CIELO PLAST EIRL',
'COCHACHIN AGUIRRE ELIAS PABLO', 'ECOPET PERU SAC', 'KENTHIVAS SAC']],
[[[201801, 201902], [201702], [201801, 201902, 202001],
[202203, 202102, 202110, 202105, 202206], [201801, 201902]],
[[202201, 202202, 202109, 202107], [202207, 202209, 202205, 202203, 202109],
[201801, 201902], [201801, 201902], [202208, 202201, 202112, 202202]],
[[201801, 201902], [202204, 202201, 202202, 202209], [201801, 201902],
[202201, 202107, 202108, 202109], [201801, 201902]]]]
If need strings periods:
dataFrame['periodos'] = [[str(y) for y in x] for x in dataFrame['periodos']]
L = (dataFrame.groupby('tabla', sort=False)[['nombre_razon','periodos']]
.agg(list)
.reset_index()
.T
.to_numpy()
.tolist())
For verify your expected ouput:
# [['Bancos','Ventas','Compras'],
# [['ALIAGA ORTIZ LILIA ROXANA','CIELO PLAST EIRL',
# 'COCHACHIN AGUIRRE ELIAS PABLO','COPLASTICA SAC','ECOPET PERU SAC'],
# ['ALIAGA ORTIZ LILIA ROXANA','GRUPO ELIAPAC SAC',
# 'COPLASTICA SAC','ECOPET PERU SAC','KENTHIVAS SAC'],
# ['ALIAGA ORTIZ LILIA ROXANA','CIELO PLAST EIRL',
# 'COCHACHIN AGUIRRE ELIAS PABLO','ECOPET PERU SAC','KENTHIVAS SAC']],
# [[['201801','201902'],['201702'],['201801','201902','202001'],
# ['202203','202102','202110','202105', '202206'],['201801','201902']],
# [['202201','202202','202109','202107'],
# ['202207','202209','202205','202203','202109'],['201801','201902'],
# ['201801','201902'],['202208','202201','202112','202202']],
# [['201801','201902'],['202204','202201', '202202','202209'],
# ['201801','201902'],['202201','202107','202108','202109'],
# ['201801','201902']]]]

Related

how to iterate through column values of pyspark dataframe

I have a pyspark dataframe
I want to check each row for the address column and if it contains the substring "india"
then I need to add another column and say true
else false
and also i wanted to check the substring is present in the column value string if yes print yes else no.. this has to iterate for all the rows in dataframe.
like:
if "india" or "karnataka" is in sparkDF["address"]:
print("yes")
else:
print("no")
I'm getting the wrong results as it's checking for each character instead of the substring. How to achieve this?
How to achieve this?
I wasn't able to achieve this
You can utilise contains or like for this
Data Preparation
s = StringIO("""
user,address
rishi,XYZ Bangalore Karnataka
kirthi,ABC Pune India
tushar,ASD Orissa India
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+------+-----------------------+
|user |address |
+------+-----------------------+
|rishi |XYZ Bangalore Karnataka|
|kirthi|ABC Pune India |
|tushar|ASD Orissa India |
+------+-----------------------+
Contains
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).contains("india"))
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|false |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+
Like - Multiple Search Patterns
sparkDF = sparkDF.withColumn('result',F.lower(F.col('address')).like("%india%")
| F.lower(F.col('address')).like("%karnataka%")
)
sparkDF.show(truncate=False)
+------+-----------------------+------+
|user |address |result|
+------+-----------------------+------+
|rishi |XYZ Bangalore Karnataka|true |
|kirthi|ABC Pune India |true |
|tushar|ASD Orissa India |true |
+------+-----------------------+------+

Changing values in a column based on a match

I have a Pandas DataFrame which contains names of brazilians universities, but somethings I have these names in a short way or in a long way (for example, the Universidade Federal do Rio de Janeiro sometimes is identified as UFRJ).
The DataFrame look like this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| UFRJ |
| Universidade de Sao Paulo |
| USP |
| Catholic University of Minas Gerais |
And I have another one which has in separate columns the short name and the long name of SOME (not all) of those universities. Which looks likes this:
| long_name | short_name |
|----------------------------------------|------------|
| Universidade Federal do Rio de Janeiro | UFRJ |
| Universidade de Sao Paulo | USP |
What I want is: substitute all short names by long names, so in this context, the first dataframe would have the college column changed to this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| Universidade Federal do Rio de Janeiro |
| Universidade de Sao Paulo |
| Universidade de Sao Paulo |
| Catholic University of Minas Gerais | <--- note: this one does not have a match, so it stays the same
Is there a way to do that using pandas and numpy (or any other library)?
Use Series.map with replace by second DataFrame, if no match get missing values, so added Series.fillna:
df1['college'] = (df1['college'].map(df2.set_index('short_name')['long_name'])
.fillna(df1['college']))
print (df1)
college
0 Universidade Federal do Rio de Janeiro
1 Universidade Federal do Rio de Janeiro
2 Universidade de Sao Paulo
3 Universidade de Sao Paulo
4 Catholic University of Minas Gerais

Removes text between 2 tags python

I haved scraped data from Wikipedia and created a dataframe. df[0] contains
{{Infobox_President |name = Mohammed Anwar Al Sadat < br / > محمد أنورالسادات |nationality = Al Menofeia, Mesir |image = Anwar Sadat cropped.jpg |order = Presiden Mesir ke-3 |term_start = 20 Oktober 1970 |term_end = 6 Oktober 1981 |predecessor = Gamal Abdel Nasser |successor = Hosni Mubarak |birth_date =|birth_place = Mit Abu Al-Kum, Al-Minufiyah, Mesir |death_place = Kairo, Mesir |death_date =|spouse = Jehan Sadat |party = Persatuan Arab Sosialis < br / > (hingga 1977) < br / > Partai Nasional Demokratik (Mesir)|Partai Nasional Demokratik < br / > (dari 1977) |vicepresident =|constituency =}} Jenderal Besar Mohammed Anwar Al Sadat () adalah seorang tentara dan politikus Mesir. Ia menjabat sebagai Presiden Mesir|Presiden ketiga Mesir pada periode 15 Oktober 1970 hingga terbunuhnya pada 6 Oktober 1981. Oleh dunia Barat ia dianggap sebagai orang yang sangat berpengaruh di Mesir dan di Timur Tengah dalam sejarah modern.
I want to remove:
{{Infobox_President |name = Mohammed Anwar Al Sadat < br / > محمد أنورالسادات |nationality = Al Menofeia, Mesir |image = Anwar Sadat cropped.jpg |order = Presiden Mesir ke-3 |term_start = 20 Oktober 1970 |term_end = 6 Oktober 1981 |predecessor = Gamal Abdel Nasser |successor = Hosni Mubarak |birth_date =|birth_place = Mit Abu Al-Kum, Al-Minufiyah, Mesir |death_place = Kairo, Mesir |death_date =|spouse = Jehan Sadat |party = Persatuan Arab Sosialis < br / > (hingga 1977) < br / > Partai Nasional Demokratik (Mesir)|Partai Nasional Demokratik < br / > (dari 1977) |vicepresident =|constituency =}}
How can I do this? I have tried
df['Body'] = df['Body'].replace('< ref >.< \/ref > | {{.}} | {{.*=}}','', regex = True)
df['Body'] = df['Body'].str.replace('\'\'\' | \n | [ | ] | \'\'','',regex=True)
but it doest work
This shall do the trick
import re
re.sub('^{{.*}}','', text)
you can apply this function to the column of your dataframe and it will transform the column.
You were very close, why it did not work was because of the extra spacing in your regex pattern, | {{.*=}} considers the space behind the curly spaces. As suggested as the other answer you can use the special operator ^ that anchors at the start of the line.
Else to apply a regex replace that matches that exact pattern then remove the whitespaces in your pattern:
text = '{{Infobox_President |name = Mohammed Anwar Al Sadat < br / > محمد أنورالسادات |nationality = Al Menofeia, Mesir |image = Anwar Sadat cropped.jpg |order = Presiden Mesir ke-3 |term_start = 20 Oktober 1970 |term_end = 6 Oktober 1981 |predecessor = Gamal Abdel Nasser |successor = Hosni Mubarak |birth_date =|birth_place = Mit Abu Al-Kum, Al-Minufiyah, Mesir |death_place = Kairo, Mesir |death_date =|spouse = Jehan Sadat |party = Persatuan Arab Sosialis < br / > (hingga 1977) < br / > Partai Nasional Demokratik (Mesir)|Partai Nasional Demokratik < br / > (dari 1977) |vicepresident =|constituency =}} Jenderal Besar Mohammed Anwar Al Sadat () adalah seorang tentara dan politikus Mesir. Ia menjabat sebagai Presiden Mesir|Presiden ketiga Mesir pada periode 15 Oktober 1970 hingga terbunuhnya pada 6 Oktober 1981. Oleh dunia Barat ia dianggap sebagai orang yang sangat berpengaruh di Mesir dan di Timur Tengah dalam sejarah modern.'
df = pd.DataFrame({'text':[text]})
new_df = df.replace('< ref >.< \/ref >|{{.*}}','', regex = True)
new_df.text[0]
Output:
' Jenderal Besar Mohammed Anwar Al Sadat () adalah seorang tentara dan politikus Mesir. Ia menjabat sebagai Presiden Mesir|Presiden ketiga Mesir pada periode 15 Oktober 1970 hingga terbunuhnya pada 6 Oktober 1981. Oleh dunia Barat ia dianggap sebagai orang yang sangat berpengaruh di Mesir dan di Timur Tengah dalam sejarah modern.'

Pandas: How to map the values of a Dataframe to another Dataframe?

I am totally new to Python and just learning with some use cases I have.
I have 2 Data Frames, one is where I need the values in the Country Column, and another is having the values in the column named 'Countries' which needs to be mapped in the main Data Frame referring to the column named 'Data'.
(Please accept my apology if this question has already been answered)
Below is the Main DataFrame:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas |
Divya london Khosla |
new delhi Pragati Kumari |
Will London Turner |
Joseph Mascurenus Bombay |
Jason New York Bourne |
New york Vice Roy |
Joseph Mascurenus new York |
Peter Parker California |
Bruce (istanbul) Wayne |
Below is the Referenced DataFrame:
Data | Countries
-------------- | ---------
las Vegas | US
london | UK
New Delhi | IN
London | UK
bombay | IN
New York | US
New york | US
new York | US
California | US
istanbul | TR
Moscow | RS
Cape Town | SA
And what I want in the result will look like below:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas | US
Divya london Khosla | UK
new delhi Pragati Kumari | IN
Will London Turner | UK
Joseph Mascurenus Bombay | IN
Jason New York Bourne | US
New york Vice Roy | US
Joseph Mascurenus new York | US
Peter Parker California | US
Bruce (istanbul) Wayne | TR
Please note, Both the dataframes are not same in size.
I though of using map or Fuzzywuzzy method but couldn't really achieved the result.
Find the country key that matches in the reference dataframe and extract it.
regex = '(' + ')|('.join(ref_df['Data']) + ')'
df['key'] = df['Name Data'].str.extract(regex, flags=re.I).bfill(axis=1)[0]
>>> df
Name Data key
0 Arjun Kumar Reddy las Vegas las Vegas
1 Bruce (istanbul) Wayne istanbul
2 Joseph Mascurenus new York new York
>>> ref_df
Data Country
0 las Vegas US
1 new York US
2 istanbul TR
Merge both the dataframes on key extracted.
pd.merge(df, ref_df, left_on='key', right_on='Data')
Name Data key Data Country
0 Arjun Kumar Reddy las Vegas las Vegas las Vegas US
1 Bruce (istanbul) Wayne istanbul istanbul TR
2 Joseph Mascurenus new York new York new York US
It looks like everything is sorted so you can merge on index
mdf.merge(rdf, left_index=True, right_index=True)

PySpark find if pattern in one column is present in another column

I've two pyspark data frames. One contain FullAddress field(say col1) and another data frame contains name of city/town/suburb in one of the columns(say col2). I want to compare col2 with col1 and return col2 if there is a match.
Additionally, the suburb name could be a list of suburb name.
Dataframe1 that contains full address
+--------+--------+----------------------------------------------------------+
|Postcode|District|City/ Town/ Suburb |
+--------+--------+----------------------------------------------------------+
|2000 |Sydney |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks |
|2001 |Sydney |Sydney |
|2113 |Sydney |North Ryde |
+--------+--------+----------------------------------------------------------+
+-----------------------------------------------------------+
|FullAddress |
+-----------------------------------------------------------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |
| HAY STREET HAYMARKET 2000, NSW, Australia |
| SMART STREET FAIRFIELD 2165, NSW, Australia |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |
+-----------------------------------------------------------+
I would like to have something like this
+-----------------------------------------------------------++-----------+
|FullAddress |suburb |
+-----------------------------------------------------------++-----------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE |
| HAY STREET HAYMARKET 2000, NSW, Australia |HAYMARKET |
| SMART STREET FAIRFIELD 2165, NSW, Australia |NULL |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |SYDNEY |
+-----------------------------------------------------------++-----------+
There are two DataFrames -
DataFrame 1: DataFrame containing the complete address.
DataFrame 2: DataFrame containing the base data - Postcode, District & City / Town / Suburb.
The aim of the problem is to extract the appropriate suburb for DataFrame 1 from DataFrame 2. Though OP has not explicitly specified the key on which we can join the two DataFrames, but Postcode only seems to be the reasonable choice.
# Importing requisite functions
from pyspark.sql.functions import col,regexp_extract,split,udf
from pyspark.sql.types import StringType
Let's create the DataFrame 1 as df. In this DataFrame we need to extract the Postcode. In Australia, all the post codes are 4 digit long, so we use regexp_extract() to extract 4 digit number from the string column.
df = sqlContext.createDataFrame([('BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia ',),
('HAY STREET HAYMARKET 2000, NSW, Australia',),
('SMART STREET FAIRFIELD 2165, NSW, Australia',),
('CLARENCE STREET SYDNEY 2000, NSW, Australia',)],
('FullAddress',))
df = df.withColumn('Postcode', regexp_extract('FullAddress', "(\\d{4})" , 1 ))
df.show(truncate=False)
+---------------------------------------------+--------+
|FullAddress |Postcode|
+---------------------------------------------+--------+
|BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |2113 |
|HAY STREET HAYMARKET 2000, NSW, Australia |2000 |
|SMART STREET FAIRFIELD 2165, NSW, Australia |2165 |
|CLARENCE STREET SYDNEY 2000, NSW, Australia |2000 |
+---------------------------------------------+--------+
Now, that we have extracted the Postcode, we have created the key to join the two DataFrames. Let's create the DataFrame 2, from which we need to extract respective suburb.
df_City_Town_Suburb = sqlContext.createDataFrame([(2000,'Sydney','Dawes Point, Haymarket, Millers Point, Sydney, The Rocks'),
(2001,'Sydney','Sydney'),(2113,'Sydney','North Ryde')],
('Postcode','District','City_Town_Suburb'))
df_City_Town_Suburb.show(truncate=False)
+--------+--------+--------------------------------------------------------+
|Postcode|District|City_Town_Suburb |
+--------+--------+--------------------------------------------------------+
|2000 |Sydney |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2001 |Sydney |Sydney |
|2113 |Sydney |North Ryde |
+--------+--------+--------------------------------------------------------+
Joining the two DataFrames with left join -
df = df.join(df_City_Town_Suburb.select('Postcode','City_Town_Suburb'), ['Postcode'],how='left')
df.show(truncate=False)
+--------+---------------------------------------------+--------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+--------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |North Ryde |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |Dawes Point, Haymarket, Millers Point, Sydney, The Rocks|
+--------+---------------------------------------------+--------------------------------------------------------+
Splitting the column City_Town_Suburb into an array using split() function -
df = df.select('Postcode','FullAddress',split(col("City_Town_Suburb"), ",\s*").alias("City_Town_Suburb"))
df.show(truncate=False)
+--------+---------------------------------------------+----------------------------------------------------------+
|Postcode|FullAddress |City_Town_Suburb |
+--------+---------------------------------------------+----------------------------------------------------------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |[North Ryde] |
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |[Dawes Point, Haymarket, Millers Point, Sydney, The Rocks]|
+--------+---------------------------------------------+----------------------------------------------------------+
Finally creating a UDF to check each and every element of the array City_Town_Suburb if it exists in the column FullAddress. If there exists a one, we return that immediately, else None is returned.
def suburb(FullAddress,City_Town_Suburb):
# Check for the case where there is no Array, otherwise we will get an Error
if City_Town_Suburb == None:
return None
# Checking each and every Array element if it exists in 'FullAddress',
# and if a match is found, it's immediately returned.
for sub in City_Town_Suburb:
if sub.strip().upper() in FullAddress:
return sub.upper()
return None
suburb_udf = udf(suburb,StringType())
Applying this UDF -
df = df.withColumn('suburb', suburb_udf(col('FullAddress'),col('City_Town_Suburb'))).drop('City_Town_Suburb')
df.show(truncate=False)
+--------+---------------------------------------------+----------+
|Postcode|FullAddress |suburb |
+--------+---------------------------------------------+----------+
|2113 |BADAJOZ ROAD NORTH RYDE 2113, NSW, Australia |NORTH RYDE|
|2165 |SMART STREET FAIRFIELD 2165, NSW, Australia |null |
|2000 |HAY STREET HAYMARKET 2000, NSW, Australia |HAYMARKET |
|2000 |CLARENCE STREET SYDNEY 2000, NSW, Australia |SYDNEY |
+--------+---------------------------------------------+----------+

Categories