Remove duplicate values in each row of the column - python

I have a Data Frame, which has a column that shows repeated values. It was the result of an inverse "explode" operation... trello_dataframe = trello_dataframe.groupby(['Card ID', 'ID List'], as_index=True).agg({'Member (Full Name)': lambda x: x.tolist()})
How do I remove duplicate values in each row of the column?
I attach more information: https://prnt.sc/RjGazPcMBX47
I would like to have the data frame like this: https://prnt.sc/y0VjKuewp872
Thanks in advance!

You will need to target the column and with a np.unique
import pandas as pd
import numpy as np
data = {
'Column1' : ['A', 'B', 'C'],
'Column2' : [[5, 0, 5, 0, 5], [5,0,5], [5]]
}
df = pd.DataFrame(data)
df['Column2'] = df['Column2'].apply(lambda x : np.unique(x))
df

Related

Pyspark partitionBy: How do I partition my data and then select columns

I have the following data:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
I want to partition the data by 'col1', but I don't want the 'col1' variable to be in the final data. Is this possible?
The below code would partition by col1, but how do I ensure 'col1' doesn't appear in the final data?
from pyspark.sql.functions import *
df.write.partitionBy("col1").mode("overwrite").csv("file_path/example.csv",
header=True)
Final data would be two files that look like:
d1 = {'col2': [3], 'col3': [5]}
df1 = pd.DataFrame(data=d1)
d2 = {'col2': [4], 'col3': [6]}
df2 = pd.DataFrame(data=d2)
Seems simple, but i can't figure out how I can partition the data, but leave the variable used to partition out of the final csv?
Thanks
ONce you partition the data using
df.write.partitionBy("col1").mode("overwrite").csv("file_path/example.csv", header=True)
There will be partitions based on your col1.
Now while reading the dataframe you can specify which columns you want to use like:
df=spark.read.csv('path').select('col2','col3')
Below is the code for spark 2.4.0 using scala api-
val df = sqlContext.createDataFrame(sc.parallelize(Seq(Row(1,3,5),Row(2,4,6))),
StructType(Seq.range(1,4).map(f => StructField("col" + f, DataTypes.IntegerType))))
df.write.partitionBy("col1")
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("/<path>/test")
It creates 2 files as below-
col1=1 with actual partition file as below-
col2,col3
3,5
col2=2 with actual partition file as below-
col2,col3
4,6
same for col2=2
I'm not seeing col1 in the file.
in python-
from pyspark.sql import Row
df = spark.createDataFrame([Row(col1=[1, 2], col1=[3, 4], col3=[5, 6])])
df.write.partitionBy('col1').mode('overwrite').csv(os.path.join(tempfile.mkdtemp(), 'data'))
api doc - https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Multilevel index won't go away

I have a dataframe, which consists of summary statistics of another dataframe:
df = sample[['Place','Lifeexp']]
df = df.groupby('Place').agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values([('Lifeexp', 'count')], ascending=False)
When looking at the structure, the dataframe has a multi index, which makes plot creations difficult:
df.columns
MultiIndex(levels=[['Lifeexp', 'Place'], ['count', 'mean', 'max', 'min', '']],
labels=[[1, 0, 0, 0, 0], [4, 0, 1, 2, 3]])
I tried the solutions of different questions here (e.g. this), but somehow don't get the desired result. I want df to have Place, count, mean,max, min as column names and delete Lifeexp so that I can create easy plots e.g. df.plot.bar(x = "Place", y = 'count')
I think solution should be simplify define column after groupby for prevent MultiIndex in columns:
df = df.groupby('Place')['Lifeexp'].agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values('count', ascending=False)

Select columns of pandas dataframe using a dictionary list value

I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'
Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5
According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]

deleting pandas dataframe column

I am trying to delete the column of a pandas dataframe and I get the following error: ValueError: labels [' 5'] not contained in axis. However my print df.columns returnsInt64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64'). See bellow the code as well:
df = pd.read_csv(StringIO(data),skiprows=186,sep=";",header=None)
#df.drop(' 5', inplace=True)
b= df.columns.tolist()
print df.columns
df=df.drop(0,axis=1) #to delete/remove single columns
df=df.drop([0,2,3,4,5,6],axis=1) #To delete columns at 0 2,3,4,5,6
df.drop(columns=['B', 'C']) #to Delete columns by name

expand each row to multiple rows in pandas using dataframe.apply (similar to MapReduce)

Here's a simplified version of my problem.
I have a DataFrame that has start and end locations of trips.
I want to end up with a DataFrame that has for each station
the number of arrivals and departures.
I am familiar with MapReduce-like workflows, where in the
Map phase I can take in one row and output multiple rows,
and then aggregate over all rows in the reduce phase.
Here's the code that I have now, that DOES NOT work.
import pandas as pd
import numpy as np
def expand_row(row):
return pd.Series(
{ 'station': [row['start_station'], row['end_station']],
'departures': [1, 0],
'arrivals': [0, 1],
},
)
trips = pd.DataFrame({
'start_station': ['a', 'c'],
'end_station': ['b', 'a'],
})
expanded = df.apply(expand_row, axis=1)
aggregated = expanded.groupby('station').aggregate(np.sum)
What I want as my final DataFrame is
desired_df = pd.DataFrame({
'station': ['a', 'b', 'c'],
'departures': [1, 0, 1],
'arrivals': [1, 1, 0]
})
desired_df.index = desired_df.pop('station')
Many thanks.
import pandas as pd
trips = pd.DataFrame({
'start_station': ['a', 'c'],
'end_station': ['b', 'a'],
})
trips.apply(pd.value_counts).fillna(0)
the result is:
end_station start_station
a 1 1
b 1 0
c 0 1

Categories