Faster way to iterate over columns in pandas - python

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!

You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')

You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

Related

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Removing certain items in a pandas.core.series

Given a pandas.core.series, is it possible to remove certain items from each index in and retain the ones that i want to keep? For example:
country=['China','USA','Asia','Brazil']
contents={'Tags': ['China;Panda;Brazil;Plug','USA;China;Asia','Brazil;Peanut']}
df=pd.DataFrame(contents)
tags=df["Tags"]
I wish to discard the values that are not in the country list and keep the rest of the values. So, for tags[0] the result should be [China,Brazil]. tags[1] all the values remain and tags[2] the result should be [Brazil]
tags = tags.str.split(';')
I have attempted to split the ; inbetween each value but i am uncertain of how to proceed.
[[val for val in a_list if val in country] for a_list in df.Tags.str.split(";")]
For each list after splitting, keep only the values that are in country list
to get
[['China', 'Brazil'], ['USA', 'China', 'Asia'], ['Brazil']]
Exploding and regrouping is also an option:
import pandas as pd
country = ['China', 'USA', 'Asia', 'Brazil']
contents = {'Tags': ['China;Panda;Brazil;Plug',
'USA;China;Asia',
'Brazil;Peanut']}
df = pd.DataFrame(contents)
e = df.Tags.str.split(';').explode()
tags = e[e.isin(country)].groupby(level=0).agg(list).tolist()
print(tags)
tags:
[['China', 'Brazil'], ['USA', 'China', 'Asia'], ['Brazil']]
An option with merge rather than isin:
import pandas as pd
country = ['China', 'USA', 'Asia', 'Brazil']
contents = {'Tags': ['China;Panda;Brazil;Plug',
'USA;China;Asia',
'Brazil;Peanut']}
df = pd.DataFrame(contents)
e = df.Tags.str.split(';').explode().reset_index()
tags = e.merge(
pd.DataFrame(country, columns=['Tags']),
on='Tags'
).groupby('index')['Tags'].agg(list).tolist()
print(tags)
tags:
[['China', 'Brazil'], ['China', 'USA', 'Asia'], ['Brazil']]
You can create a function to do the parsing and copy the information to a new column. So, for example, something like:
from functools import partial
import numpy as np
def pars_func(countries,content_str):
ret_str=''
for tag in content_str.split(';'):
if tag in country:
ret_str=ret_str+';'+tag
if ret_str== '':
return np.nan
else:
return ret_str[1:].split(';')
df['Tags']=df['Tags'].apply(partial(pars_func,country))

Finding the min of a column across multiple lists in python

I need to find the minimum and maximum of a given a column from a csv file and currently the value is a string but I need it to be an integer, right now my output after I have split all the lines into lists looks like this
['FRA', 'Europe', 'France', '14/06/2020', '390', '10\n']
['FRA', 'Europe', 'France', '11/06/2020', '364', '27\n']
['FRA', 'Europe', 'France', '12/06/2020', '802', '28\n']
['FRA', 'Europe', 'France', '13/06/2020', '497', '24\n']
And from that line along with its many others I want to find the minimum of the
5th column and currently when I do
min(column[4])
It just gives the min of each individual list which is just the number in that column rather than grouping them all up and getting that minimum.
P.S: I am very new to python and coding in general, I also have to do this without any importing of modules.
For you Azro.
def main(csvfile,country,analysis):
infile = csvfile
datafile = open(infile, "r")
country = country.capitalize()
if analysis == "statistics":
for line in datafile.readlines():
column = line.split(",")
if column[2] == country:
You may use pandas that allows to read csv file and manipulate them as DataFrame, then it's very easy to retrieve a min/max from a column
import pandas as pd
df = pd.read_csv("test.txt", sep=',')
mini = df['colName'].min()
maxi = df['colName'].max()
print(mini, maxi)
Then if you have already read your data in a list of lists, you max use builtin min and max
# use rstrip() when reading line, to remove leading \n
values = [
['FRA', 'Europe', 'France', '14/06/2020', '390', '10'],
['FRA', 'Europe', 'France', '14/06/2020', '395', '10']
]
mini = min(values, key=lambda x: int(x[4]))[4]
maxi = max(values, key=lambda x: int(x[4]))[4]
Take a look at the library pandas and especially the DataFrame class. This is probably the go-to method for handling .csv files and tabular data in general.
Essentially, your code would be something like this:
import pandas as pd
df = pd.read_csv('my_file.csv') # Construct a DataFrame from a csv file
print(df.columns) # check to see which column names the dataframe has
print(df['My Column'].min())
print(df['My Column'].max())
There are shorter ways to do this. But this example goes step by step:
# After you read a CSV file, you'll have a bunch of rows.
rows = [
['A', '390', '...'],
['B', '750', '...'],
['C', '207', '...'],
]
# Grab a column that you want.
col = [row[1] for row in rows]
# Convert strings to integers.
vals = [int(s) for s in col]
# Print max.
print(max(vals))

Python list of dictionaries aggregate values

Here is an example input:
[{'name':'susan', 'wins': 1, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}
{'name':'susan', 'wins':1, 'team':'team1'}]
Desired output
[{'name':'susan', 'wins':2, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}]
I have lots of the dictionaries and want to only add, the 'win' value, based on the 'name' value,
and keep the 'team' values
I've tried to use Counter, but the result was
{'name':'all the names added toghther',
'wins': 'all the wins added toghther'
}
I was able to use defaultdict which seemed to work
result = defaultdict(int)
for d in data:
result[d['name']] += d['wins'])
but the results was something like
{'susan': 2, 'jack':1}
Here it added the values correctly but didn't keep the 'team' key
I guess I'm confused about defaultdict and how it works.
any help very appreciated.
Did you consider using pandas?
import pandas as pd
dicts = [
{'name':'susan', 'wins': 1, 'team': 'team1'},
{'name':'jack', 'wins':1, 'team':'team2'},
{'name':'susan', 'wins':1, 'team':'team1'},
]
agg_by = ["name", "team"]
df = pd.DataFrame(dicts)
df = df.groupby(agg_by)["wins"].apply(sum)
df = df.reset_index()
aggregated_dict = df.to_dict("records")

Python pandas: appending information from a dictionary to rows while looping through dataframe

I would like to know a better way to append information to a dataframe while in a loop, specifically, to add COLUMNS of information to a dataframe from a dictionary. The code below technically works, but in subsequent analyses I would like to preserve the data classifications of numpy/pandas to be able to efficiently classify missing data or odd values as np.nan or null. Any tips would be great.
raw_data = {'first_name': ['John', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 17, 16, 24, '']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age'])
headers = df.columns.values
count = 0
adults = {'John':True,'Molly':False}
for index, row in df.iterrows():
count += 1
if str(row['first_name']) in adults:
adult = adults[str(row['first_name'])]
else:
adult = 'null'
headers = np.append(headers,'ADULT')
vals = np.append(row.values,adult)
if count == 1:
print ','.join(headers.tolist())
print str(vals.tolist()).replace('[','').replace(']','').replace("'","")
else:
print str(vals.tolist()).replace('[','').replace(']','').replace("'","")
Output:
first_name,last_name,age,ADULT
John, Miller, 42, True
Molly, Jacobson, 20, True
Tina, Ali, 16, NA
Jake, Milner, 24, NA
Amy, Cooze, , NA
Instead of loop, I think you can simply use lambda with if and else condition:
df['ADULT'] = df['first_name'].apply(lambda v: adults[v] if v in adults else np.nan)
print(df.to_csv(index=False, na_rep='NA'))
# Output is:
# first_name,last_name,age,ADULT
# John,Miller,42,True
# Molly,Jacobson,17,False
# Tina,Ali,16,NA
# Jake,Milner,24,NA
# Amy,Cooze,,NA
In above, adults[val] if val in adults else np.nan simply looks for if val i.e. first_name for each row is in dictionary, if it is then value is kept for new column else np.nan
You can use to_csv to print in above format, here without specifying filename, it converts to string with comma separated and na_rep specifies string to use for missing values.

Categories