Pyspark partitionBy: How do I partition my data and then select columns

Pyspark partitionBy: How do I partition my data and then select columns - python

I have the following data:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
I want to partition the data by 'col1', but I don't want the 'col1' variable to be in the final data. Is this possible?
The below code would partition by col1, but how do I ensure 'col1' doesn't appear in the final data?
from pyspark.sql.functions import *
df.write.partitionBy("col1").mode("overwrite").csv("file_path/example.csv",
header=True)
Final data would be two files that look like:
d1 = {'col2': [3], 'col3': [5]}
df1 = pd.DataFrame(data=d1)
d2 = {'col2': [4], 'col3': [6]}
df2 = pd.DataFrame(data=d2)
Seems simple, but i can't figure out how I can partition the data, but leave the variable used to partition out of the final csv?
Thanks

ONce you partition the data using
df.write.partitionBy("col1").mode("overwrite").csv("file_path/example.csv", header=True)
There will be partitions based on your col1.
Now while reading the dataframe you can specify which columns you want to use like:
df=spark.read.csv('path').select('col2','col3')

Below is the code for spark 2.4.0 using scala api-
val df = sqlContext.createDataFrame(sc.parallelize(Seq(Row(1,3,5),Row(2,4,6))),
StructType(Seq.range(1,4).map(f => StructField("col" + f, DataTypes.IntegerType))))
df.write.partitionBy("col1")
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("/<path>/test")
It creates 2 files as below-
col1=1 with actual partition file as below-
col2,col3
3,5
col2=2 with actual partition file as below-
col2,col3
4,6
same for col2=2
I'm not seeing col1 in the file.
in python-
from pyspark.sql import Row
df = spark.createDataFrame([Row(col1=[1, 2], col1=[3, 4], col3=[5, 6])])
df.write.partitionBy('col1').mode('overwrite').csv(os.path.join(tempfile.mkdtemp(), 'data'))
api doc - https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Related

Pandas: for matching row indices - update dataframe values with values from other dataframe with a different column size

I'm struggling with updating values from a dataframe with values from another dataframe using the row index as key. Dataframes are not identical in terms of number of columns so updating can only occur for matching columns. Using the code below it would mean that df3 yields the same result as df4. However df3 returns a None object.
Anyone who can put me in the right direction? It doesn't seem very complicated but I can't seem to get it right
ps. In reality the 2 dataframes are a lot larger than the ones in this example (both in terms of rows and columns)
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
df3 = df1.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)
```

pandas.DataFrame.update returns None. The method directly changes calling object.
source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
for your example this means two things.
update returns none. hence df3=none
df1 got changed when df3 = df1.update(df2) gets called. In your case df1 would look like df4 from that point on.
to write df3 and leave df1 untouched this could be done:
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
#using deep=False if df1 should not get affected by the update method
df3 = df1.copy(deep=False)
df3.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)

Remove duplicate values in each row of the column

I have a Data Frame, which has a column that shows repeated values. It was the result of an inverse "explode" operation... trello_dataframe = trello_dataframe.groupby(['Card ID', 'ID List'], as_index=True).agg({'Member (Full Name)': lambda x: x.tolist()})
How do I remove duplicate values in each row of the column?
I attach more information: https://prnt.sc/RjGazPcMBX47
I would like to have the data frame like this: https://prnt.sc/y0VjKuewp872
Thanks in advance!

You will need to target the column and with a np.unique
import pandas as pd
import numpy as np
data = {
'Column1' : ['A', 'B', 'C'],
'Column2' : [[5, 0, 5, 0, 5], [5,0,5], [5]]
}
df = pd.DataFrame(data)
df['Column2'] = df['Column2'].apply(lambda x : np.unique(x))
df

How to perform operations over arrays in a pandas dataframe efficiently?

I've got a pandas DataFrame that contains NumPy arrays in some columns:
import numpy as np, pandas as pd
data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
I need to store a large frame like this one in a CSV file, but the arrays have to be strings that look like this:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
What I'm currently doing to achieve this result is to iterate over each column and each row of the DataFrame, but my solution doesn't seem efficient.
This is my current solution:
pd.options.mode.chained_assignment = None
array_columns = [column for column in df.columns if isinstance(df[column].iloc[0], np.ndarray)]
for index, row in df.iterrows():
for column in array_columns:
# Here 'tuple' is only used to replace brackets for parenthesis
df[column][index] = str(tuple(row[column]))
I tried using apply, although I've heard it's usually not an efficient alternative:
def array_to_str(array):
return str(tuple(array))
df[array_columns] = df[array_columns].apply(array_to_str)
But my arrays become NaN:
col1 col2 col3
0 NaN NaN 9
1 NaN NaN 10
I tried other similar solutions, but the error:
ValueError: Must have equal len keys and value when setting with an iterable
appeared quite often.
Is there a more efficient way of performing the same operation? My real dataframes can contain many columns and thousands of rows.

Try this:
tupcols = ['col1', 'col2']
df[tupcols] = df[tupcols].apply(lambda col: col.apply(tuple)).astype('str')
df.to_csv()

You would need to convert the arrays into tuple for the correct representation. In order to do so, you can apply tuple function on columns with object dtype.
to_save = df.apply(lambda x: x.map(lambda y: tuple(y)) if x.dtype=='object' else x)
to_save.to_csv(index=False)
Output:
col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10
Note: This would be dangerous if you have other columns, e.g. string type.

data = {'col1': [np.array([1, 2]), np.array([3, 4])],
'col2': [np.array([5, 6]), np.array([7, 8])],
'col3': [9, 10]}
df = pd.DataFrame(data)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: tuple(x))
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: ''' "{}" '''.format(x))
col1 col2 col3
0 "(1, 2)" "(5, 6)" 9
1 "(3, 4)" "(7, 8)" 10

Select columns of pandas dataframe using a dictionary list value

I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'

Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5

According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]

Initializing an empty DataFrame and appending rows

Different from creating an empty dataframe and populating rows later , I have many many dataframes that needs to be concatenated.
If there were only two data frames, I can do this:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df1.append(df2, ignore_index=True)
Imagine I have millions of df that needs to be appended/concatenated each time I read a new file into a DataFrame object.
But when I tried to initialize an empty dataframe and then adding the new dataframes through a loop:
import pandas as pd
alldf = pd.DataFrame(, columns=list('AB'))
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
alldf.append(df, ignore_index=True)
This would return an empty alldf with only the header row, e.g.
alldf = pd.DataFrame(columns=list('AB'))
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
for df in [df1, df2]:
alldf.append(df, ignore_index=True)

df.concat() over an array of dataframes is probably the way to go, especially for clean CSVs. But in case you suspect your CSVs are either dirty or could get recognized by read_csv() with mixed types between files, you may want to explicity create each dataframe in a loop.
You can initialize a dataframe for the first file, and then each subsequent file start with an empty dataframe based on the first.
df2 = pd.DataFrame(data=None, columns=df1.columns,index=df1.index)
This takes the structure of dataframe df1 but no data, and create df2. If you want to force data type on columns, then you can do it to df1 when it is created, before its structure is copied.
more details

From #DSM comment, this works:
import pandas as pd
dfs = []
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
dfs(df)
alldf = pd.concat(dfs)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark partitionBy: How do I partition my data and then select columns - python

ONce you partition the data using df.write.partitionBy("col1").mode("overwrite").csv("file_path/example.csv", header=True) There will be partitions based on your col1. Now while reading the dataframe you can specify which columns you want to use like: df=spark.read.csv('path').select('col2','col3')

Related

Pandas: for matching row indices - update dataframe values with values from other dataframe with a different column size

Remove duplicate values in each row of the column

How to perform operations over arrays in a pandas dataframe efficiently?

Select columns of pandas dataframe using a dictionary list value

Initializing an empty DataFrame and appending rows

Categories

Resources