Dropping duplicate values in a column - python

i have a frame like;
df = pd.DataFrame({'America':["24,23,24,24","10","AA,AA, XY"]})
tried to convert it to a list, set etc.. but coudnt handle
how can i drop the duplicates

Use custom function with split and set:
df['America'] = df['America'].apply(lambda x: set(x.split(',')))
Another solution is use list comprehension:
df['America'] = [set(x.split(',')) for x in df['America']]
print (df)
America
0 {23, 24}
1 {10}
2 {AA, XY}

This is one approach using str.split.
Ex:
import pandas as pd
df = pd.DataFrame({'America':["24,23,24,24","10","AA,AA, XY"]})
print(df["America"].str.split(",").apply(set))
Output:
0 {24, 23}
1 {10}
2 {AA, XY}
Name: America, dtype: object

Related

Aggregating pandas series into a DataFrame and visual representation

I have three pandas Series, called: Col_data, C_PV_data and C_elec_data. Each one has these values:
Col_data:
0 625814.205486
1 782267.756857
2 938721.308229
Name: 7, dtype: object
C_PV_data:
0 2039032.206909
1 2548790.258636
2 3058548.310363
Name: 3, dtype: object
C_elec_data:
0 1337523.743009
1 1671904.678761
2 2006285.614513
Name: 0, dtype: object
I would like to aggregate them into a single DataFrame, to export that DataFrame to a .xlsx file, in which each column is called as the variable. For instance:
Col_data
C_PV_data
C_elec_data
625814.205486
2039032.206909
1337523.743009
782267.756857
2548790.258636
1671904.678761
938721.308229
3058548.310363
2006285.614513
Finally, I would like to represent each column with a graph in which the central value is a line, and two dots over that line, for the lowest and hights value. For instance, the graph would be something like this:
Sure, here you go:
Init
Col_data = pd.Series([
625814.205486,
782267.756857,
938721.308229])
C_PV_data = pd.Series([
2039032.206909,
2548790.258636,
3058548.310363])
C_elec_data = pd.Series([
1337523.743009,
1671904.678761,
2006285.614513])
As a df
df = pd.concat(
[Col_data, C_PV_data, C_elec_data], axis=1,
keys=['Col_data', 'C_PV_data', 'C_elec_data'])
>>> df
Col_data C_PV_data C_elec_data
0 625814.205486 2.039032e+06 1.337524e+06
1 782267.756857 2.548790e+06 1.671905e+06
2 938721.308229 3.058548e+06 2.006286e+06
Side note: I always dislike repeats. The following alternative to the above is DRY (Don't Repeat Yourself), but less clear perhaps:
keys = ['Col_data', 'C_PV_data', 'C_elec_data']
d = locals() # just for DRY...
df = pd.concat([d[k] for k in keys], axis=1, keys=keys)
To xlsx
Assuming you have openpyxl installed:
df.to_excel('foo.xlsx', index=False)
Box plot
Edit: (and save as PNG)
ax = df.loc[[0,1,1,1,2]].plot.box()
ax.figure.savefig('costs.png')

Parallel Processing using Multiprocessing in Python

I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:
I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:
def split_country(data):
d_list = []
for index, row in data.iterrows():
for value in str(row['Country']).split(','):
d_list.append({'Name':row['Name'],
'value':value})
data = data.append(d_list, ignore_index=True)
data = data.groupby('Name')['value'].value_counts()
data = data.unstack(level=-1).fillna(0)
return (data)
The final output is something like this:
I'm trying to parallelize the above process by passing my dataframe (df) using the following:
import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])
But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help
multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
For a test DataFrame with 1M rows, the following code took 1.54 seconds.
First, use pandas.DataFrame.explode on the column of lists
If the column is strings, first use ast.literal_eval to convert them to list type
df.countries = df.countries.apply(ast.literal_eval)
If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum
import pandas as pd
from ast import literal_eval
# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}
# create the dataframe
df = pd.DataFrame(data)
# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)
# explode the lists
df = df.explode('countries')
# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()
# display(df_counts)
name Canada China UK USA
0 Jack 0 1 1 0
1 James 1 0 0 1
2 John 0 0 1 1

Column in pandas dataframe has lists as values. How do I create a version of this column but with only the first value in the list?

My current solution is below:
prices_real = []
for item in sparkline['prices']:
prices_real.append(item[0])
sparkline['prices_real'] = prices_real
But I'm wondering if there is an easier way or a method I don't know about?
There are 2 aspects to your problem:
Extracting the first (and only) element of each list within your series.
Converting your series to numeric.
So you can use the str accessor followed by pd.to_numeric:
df = pd.DataFrame({'x': [['0.12312'], ['-5.32454'], ['0.563412'], ['-3.918324']]})
df['x'] = pd.to_numeric(df['x'].str[0])
print(df, df.dtypes, sep='\n'*2)
x
0 0.123120
1 -5.324540
2 0.563412
3 -3.918324
x float64
dtype: object
You can use pandas.Series.apply:
sparkline = pd.DataFrame({"prices": [[1], [4]]})
sparkline
# prices
# 0 [1]
# 1 [4]
sparkline["prices"] = sparkline["prices"].apply(lambda x: x[0])
sparkline
# prices
# 0 1
# 1 4

Dictionary in Pandas DataFrame, how to split the columns

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Change one value based on another value in pandas

I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data.
Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.
In Stata it looks like this:
replace FirstName = "Matt" if ID==103
replace LastName = "Jones" if ID==103
So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.
In Pandas, I'm trying something like this
df = read_csv("test.csv")
for i in df['ID']:
if i ==103:
...
Not sure where to go from here. Any ideas?
One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.
Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.
import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"
As mentioned in the comments, you can also do the assignment to both columns in one shot:
df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'
Note that you'll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas.
Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:
import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"
You can use map, it can map vales from a dictonairy or even a custom function.
Suppose this is your df:
ID First_Name Last_Name
0 103 a b
1 104 c d
Create the dicts:
fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}
And map:
df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)
The result will be:
ID First_Name Last_Name
0 103 Matt Jones
1 104 Mr X
Or use a custom function:
names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])
The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:
Creating a new column using data from other columns
Given the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
In[1]:
Out[1]:
animal type age
----------------------
0 dog hound 5
1 cat ragdoll 1
Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the + applies to scalars and not 'primitive' values:
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
In [2]: df
Out[2]:
animal type age description
-------------------------------------------------
0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.
Modifying an existing column with conditionals
Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:
# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')
In [3]: df
Out[3]:
animal type age
-------------------------------------
0 dog, hound, 5 years hound 5
1 cat, ragdoll, 1 year ragdoll 1
Modifying multiple columns with conditionals
A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:
def transform_row(r):
r.animal = 'wild ' + r.type
r.type = r.animal + ' creature'
r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
return r
df.apply(transform_row, axis=1)
In[4]:
Out[4]:
animal type age
----------------------------------------
0 wild hound dog creature 5 years
1 wild ragdoll cat creature 1 year
In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.
This question might still be visited often enough that it's worth offering an addendum to Mr Kassies' answer. The dict built-in class can be sub-classed so that a default is returned for 'missing' keys. This mechanism works well for pandas. But see below.
In this way it's possible to avoid key errors.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
... def __missing__(self, key):
... return ''
...
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
The same thing can be done more simply in the following way. The use of the 'default' argument for the get method of a dict object makes it unnecessary to subclass a dict.
>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
ID Surname
0 101 Mohanty
1 201
2 301 Drake
3 401
df['FirstName']=df['ID'].apply(lambda x: 'Matt' if x==103 else '')
df['LastName']=df['ID'].apply(lambda x: 'Jones' if x==103 else '')
In case someone is looking for a way to change the values of multiple rows based on some logical condition of each row itself, using .apply() with a function is the way to go.
df = pd.DataFrame({'col_a':[0,0], 'col_b':[1,2]})
col_a col_b
0 0 1
1 0 2
def func(row):
if row.col_a == 0 and row.col_b <= 1:
row.col_a = -1
row.col_b = -1
return row
df.apply(func, axis=1)
col_a col_b
0 -1 -1 # Modified row
1 0 2
Although .apply() is typically used to add a new row/column to a dataframe, it can be used to modify the values of existing rows/columns.
I found it much easier to debut by printing out where each row meets the condition:
for n in df.columns:
if(np.where(df[n] == 103)):
print(n)
print(df[df[n] == 103].index)

Categories