Removing certain items in a pandas.core.series - python

Given a pandas.core.series, is it possible to remove certain items from each index in and retain the ones that i want to keep? For example:
country=['China','USA','Asia','Brazil']
contents={'Tags': ['China;Panda;Brazil;Plug','USA;China;Asia','Brazil;Peanut']}
df=pd.DataFrame(contents)
tags=df["Tags"]
I wish to discard the values that are not in the country list and keep the rest of the values. So, for tags[0] the result should be [China,Brazil]. tags[1] all the values remain and tags[2] the result should be [Brazil]
tags = tags.str.split(';')
I have attempted to split the ; inbetween each value but i am uncertain of how to proceed.

[[val for val in a_list if val in country] for a_list in df.Tags.str.split(";")]
For each list after splitting, keep only the values that are in country list
to get
[['China', 'Brazil'], ['USA', 'China', 'Asia'], ['Brazil']]

Exploding and regrouping is also an option:
import pandas as pd
country = ['China', 'USA', 'Asia', 'Brazil']
contents = {'Tags': ['China;Panda;Brazil;Plug',
'USA;China;Asia',
'Brazil;Peanut']}
df = pd.DataFrame(contents)
e = df.Tags.str.split(';').explode()
tags = e[e.isin(country)].groupby(level=0).agg(list).tolist()
print(tags)
tags:
[['China', 'Brazil'], ['USA', 'China', 'Asia'], ['Brazil']]
An option with merge rather than isin:
import pandas as pd
country = ['China', 'USA', 'Asia', 'Brazil']
contents = {'Tags': ['China;Panda;Brazil;Plug',
'USA;China;Asia',
'Brazil;Peanut']}
df = pd.DataFrame(contents)
e = df.Tags.str.split(';').explode().reset_index()
tags = e.merge(
pd.DataFrame(country, columns=['Tags']),
on='Tags'
).groupby('index')['Tags'].agg(list).tolist()
print(tags)
tags:
[['China', 'Brazil'], ['China', 'USA', 'Asia'], ['Brazil']]

You can create a function to do the parsing and copy the information to a new column. So, for example, something like:
from functools import partial
import numpy as np
def pars_func(countries,content_str):
ret_str=''
for tag in content_str.split(';'):
if tag in country:
ret_str=ret_str+';'+tag
if ret_str== '':
return np.nan
else:
return ret_str[1:].split(';')
df['Tags']=df['Tags'].apply(partial(pars_func,country))

Related

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

how to apply a class function to replace NaN for mean within a subset of pandas df columns?

The class is composed of a set of attributes and functions including:
Attributes:
df : a pandas dataframe.
numerical_feature_names: df columns with a numeric value.
label_column_names: df string columns to be grouped.
Functions:
mean(nums): takes a list of numbers as input and returns the mean
fill_na(df, numerical_feature_names, label_columns): takes class attributes as inputs and returns a transformed df.
And here's the class:
class PLUMBER():
def __init__(self):
################# attributes ################
self.df=df
# specify label and numerical features names:
self.numerical_feature_names=numerical_feature_names
self.label_column_names=label_column_names
##################### mean ##############################
def mean(self, nums):
total=0.0
for num in nums:
total=total+num
return total/len(nums)
############ fill the numerical features ##################
def fill_na(self, df, numerical_feature_names, label_column_names):
# declaring parameters:
df=self.df
numerical_feature_names=self.numerical_feature_names
label_column_names=self.label_column_names
# now replacing NaN with group mean
for numerical_feature_name in numerical_feature_names:
df[numerical_feature_name]=df.groupby([label_column_names]).transform(lambda x: x.fillna(self.mean(x)))
return df
When trying to apply it to a pandas df:
if __name__=="__main__":
# initialize class
plumber=PLUMBER()
# replace NaN with group mean
df=plumber.fill_na(df=df, numerical_feature_names=numerical_feature_names, label_column_names=label_column_names)
The next error arises:
ValueError: Grouper and axis must be same length
data and class parameters
import pandas as pd
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[np.nan, 450, 299, np.nan, 19, 29],
'age':[np.nan, 30, 28, np.nan, 29, 18]}
df=pd.DataFrame(d)
# headers
column_names=df.columns.values.tolist()
column_names= [column_name.strip() for column_name in column_names]
# label_column_names (to be grouped)
label_column_names=['country', 'level', 'job title']
# numerical_features:
numerical_feature_names = [x for x in column_names if x not in label_column_names]
numerical_feature_names.remove('month')
How could I change the class in order to get the transformed df (i.e. the one that replaces np.nan with it's group mean)?
First the error is because label_column_names is already a list, so in the groupby you don't need the [] around it. so it should be df.groupby(label_column_names)... instead of df.groupby([label_column_names])...
Now, to actually solve you problem, in the function fill_na of your class, replace the loop for (you don't need it actually) by
df[numerical_feature_names] = (
df[numerical_feature_names]
.fillna(
df.groupby(label_column_names)
[numerical_feature_names].transform('mean')
)
)
in which you fillna the columns numerical_feature_names by the result of the groupy.tranform with the mean of these columns

Finding the min of a column across multiple lists in python

I need to find the minimum and maximum of a given a column from a csv file and currently the value is a string but I need it to be an integer, right now my output after I have split all the lines into lists looks like this
['FRA', 'Europe', 'France', '14/06/2020', '390', '10\n']
['FRA', 'Europe', 'France', '11/06/2020', '364', '27\n']
['FRA', 'Europe', 'France', '12/06/2020', '802', '28\n']
['FRA', 'Europe', 'France', '13/06/2020', '497', '24\n']
And from that line along with its many others I want to find the minimum of the
5th column and currently when I do
min(column[4])
It just gives the min of each individual list which is just the number in that column rather than grouping them all up and getting that minimum.
P.S: I am very new to python and coding in general, I also have to do this without any importing of modules.
For you Azro.
def main(csvfile,country,analysis):
infile = csvfile
datafile = open(infile, "r")
country = country.capitalize()
if analysis == "statistics":
for line in datafile.readlines():
column = line.split(",")
if column[2] == country:
You may use pandas that allows to read csv file and manipulate them as DataFrame, then it's very easy to retrieve a min/max from a column
import pandas as pd
df = pd.read_csv("test.txt", sep=',')
mini = df['colName'].min()
maxi = df['colName'].max()
print(mini, maxi)
Then if you have already read your data in a list of lists, you max use builtin min and max
# use rstrip() when reading line, to remove leading \n
values = [
['FRA', 'Europe', 'France', '14/06/2020', '390', '10'],
['FRA', 'Europe', 'France', '14/06/2020', '395', '10']
]
mini = min(values, key=lambda x: int(x[4]))[4]
maxi = max(values, key=lambda x: int(x[4]))[4]
Take a look at the library pandas and especially the DataFrame class. This is probably the go-to method for handling .csv files and tabular data in general.
Essentially, your code would be something like this:
import pandas as pd
df = pd.read_csv('my_file.csv') # Construct a DataFrame from a csv file
print(df.columns) # check to see which column names the dataframe has
print(df['My Column'].min())
print(df['My Column'].max())
There are shorter ways to do this. But this example goes step by step:
# After you read a CSV file, you'll have a bunch of rows.
rows = [
['A', '390', '...'],
['B', '750', '...'],
['C', '207', '...'],
]
# Grab a column that you want.
col = [row[1] for row in rows]
# Convert strings to integers.
vals = [int(s) for s in col]
# Print max.
print(max(vals))

Add same value to multiple sets of rows. The value changes based on condition

I have a dataframe that is dynamically created.
I create my first set of rows as:
df['tourist_spots'] = pd.Series(<A list of tourist spots in a city>)
To this df I add:
df['city'] = <City Name>
So far so good. A bunch of rows are created with the same city name for multiple tourist spots.
I want to add a new city. So I do:
df['tourist_spots'].append(pd.Series(<new data>))
Now, when I append a new city with:
df['city'].append('new city')
the previously updated city data is gone. It is as if every time the rows are replaced and not appended.
Here's an example of what I want:
Step 1:
df['tourist_spot'] = pd.Series('Golden State Bridge' + a bunch of other spots)
For all the rows created by the above data I want:
df['city'] = 'San Francisco'
Step 2:
df['tourist_spot'].append(pd.Series('Times Square' + a bunch of other spots)
For all the rows created by the above data, I want:
df['city'] = 'New York'
How can I achieve this?
Use dictionary to add rows to your data frame, it is faster method.
Here is an e.g.
STEP 1
Create dictionary:
dict_df = [{'tourist_spots': 'Jones LLC', 'City': 'Boston'},
{'tourist_spots': 'Alpha Co', 'City': 'Boston'},
{'tourist_spots': 'Blue Inc', 'City': 'Singapore' }]
STEP2
Convert dictionary to dataframe:
df = pd.DataFrame(dict_df)
STEP3
Add new entries to dataframe in dictionary format:
df = df.append({'tourist_spots': 'New_Blue', 'City': 'Singapore'}, ignore_index=True)
References:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

Pythonic way to handle complex nested dict structures

My codebase relies on managing data thats currently in a very deeply nested dictionary. Example:
'USA': {
'Texas': {
'Austin': {
'2017-01-01': 169,
'2017-02-01': 231
},
'Houston': {
'2017-01-01': 265,
'2017-02-01': 310
}
}
This extends for multiple countries, states/regions, cities, and dates.
I encounter a problem when trying to access values since I need to have a deeply nested for-loop to iterate over each country, state, city, and date to apply some kind of operation. I'm looking for some kind of alternative.
Assuming the nested dict structure is the same, is there an alternative to so many loops? Perhaps using map, reduce or lambda?
Is there a better way to store all of this data without using nested dicts?
You can use a Pandas DataFrame object (Pandas Dataframe Documentation), that can store your data in a tabular format, similar to a spreadsheet. In that case, your DataFrame should have a column to represent each key in your nested data (one column for Country, another for State, and so on).
Pandas DataFrames also accounts for filtering, grouping and another useful operations based on your records (rows) for each column. Let's say you want to filter your data to return only the rows from Texas that happened after '2018-02-01' (df is your DataFrame). This could be achieved with something like this:
df[df['State'] == 'Texas' & df['Date'] > '2018-02-01']
To build these DataFrame objects, you could start from your data formatted as a collection of records:
data = [['USA', 'Texas', 'Austin', '2017-01-01', 169],
['USA', 'Texas', 'Austin', '2017-02-01', 231],
['USA', 'Texas', 'Houston', '2017-01-01', 265],
['USA', 'Texas', 'Houston', '2017-02-01', 310]]
and then build them like this:
df = DataFrame(data, columns=['Country', 'State', 'City', 'Date', 'Value'])
If DataFrame objects are not an option, and you do not want to use nested loops, you could also access inner data using list comprehensions with nested predicates and filters:
[
d[country][state][city][date]
for country in d.keys()
for state in d[country].keys()
for city in d[country][state].keys()
for date in d[country][state][city].keys()
if country == 'USA' and state == 'Texas' and city == 'Houston'
]
However, I can not see much difference in that approach over the nested loops, and there is a penalty in code readability, imho.
Using the collection of records approach pointed earlier (data), instead of a nested structure, you could filter your rows using:
[r for r in data if r[2] == 'Houston']
For improved readability, you could use a list of namedtuple objects as your list of records. Your data would be:
from collections import namedtuple
record = namedtuple('Record', 'country state city date value')
data = [
record('USA', 'Texas', 'Austin', '2017-01-01', 169),
record('USA', 'Texas', 'Austin', '2017-02-01', 231),
record('USA', 'Texas', 'Houston', '2017-01-01', 265),
record('USA', 'Texas', 'Houston', '2017-02-01', 310)
]
and your filtering would be improved, eg.:
Getting specific records
[r for r in data if r.city == 'Houston']
returning
[
Record(country='USA', state='Texas', city='Houston', date='2017-01-01', value=265),
Record(country='USA', state='Texas', city='Houston', date='2017-02-01', value=310)
]
Getting only the values for those specific records
[r.value for r in data if r.city == 'Houston']
returning
[265, 310]
This last approach can also deal with custom object instances, considering that namedtuple objects can store them easily.
You can create a class, implementing overloading methods, and use recursion:
d = {'USA': {
'Texas': {
'Austin': {
'2017-01-01': 169,
'2017-02-01': 231
},
'Houston': {
'2017-01-01': 265,
'2017-02-01': 310
}
}
}
}
class StateData:
def __init__(self, structure):
self.structure = structure
self.levels = {'country':0, 'state':1, 'city':2, 'date':3}
def get_level(self, d, target, current= 0):
total_listing = [((a, b) if target == 3 else a) if current == target else self.get_level(b, target, current + 1) for a, b in d.items()]
return [i for b in total_listing for i in b] if all(isinstance(i, list) for i in total_listing) else total_listing
def __getitem__(self, val):
return self.get_level(self.structure, self.levels[val])
s = StateData(d)
print(s['city'])
print(s['date'])
Output:
['Austin', 'Houston']
[('2017-01-01', 169), ('2017-02-01', 231), ('2017-01-01', 265), ('2017-02-01', 310)]
It may be best to store your data as a list of lists, which will then make it possible for you to group according to the needs of each individual operation. For instance:
state_data = [['USA', 'Texas', 'Austin', '2017-01-01', 169],
['USA', 'Texas', 'Austin', '2017-02-01', 231],
['USA', 'Houston', '2017-01-01', 265],
['USA', 'Houston', '2017-02-01', 310]]
There's also this.
It allows you to flatten a dict into a pandas dataframe like this:
from pandas.io.json import json_normalize
d = json.load(f)
# Parent node of dict d is 'programs'
n = json_normalize(d['programs'])

Categories