Sum up data based on multiple factors in a dataframe - python

As you can see, I do have a data frame of different stores with multiple departments (1-99, but varying). I do want to sum up the revenue of all departments for each shop for each week.
Is there a more elegant way than using for loops and if statements? I'm using python with pandas.
Here's a photo of the table :
merged = walmart.merge(stores, how='left').merge(features, how='left')
testing_merged = testing.merge(stores, how='left').merge(features, how='left')
df = pd.DataFrame(data={"Store": merged.Store, "Dept": merged.Dept, "Date": merged.Date, "Weekly_Sales": merged.Weekly_Sales, "IsHoliday": merged.IsHoliday,
"Type": merged.Type, "Size": merged.Size, "Temperatur": merged.Temperature, "Fuel_Price": merged.Fuel_Price,
"MarkDown1": merged.MarkDown1, "MarkDown2": merged.MarkDown2, "MarkDown3": merged.MarkDown3, "MarkDown4": merged.MarkDown4,
"MarkDown5": merged.MarkDown5, "CPI": merged.CPI, "Unemployment": merged.Unemployment})

Using a group by function like the folowing
df.groupby(['departments', 'shop','week'])
Documentation : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Related

Detecting duplicates in pandas when a column contains lists

Is there a reasonable way to detect duplicates in a Pandas dataframe when a column contains lists or numpy nd arrays, like the example below? I know I could convert the lists into strings, but the act of converting back and forth feels... wrong. Plus, lists seem more legible and convenient given ~how I got here (online code) and where I'm going after.
import pandas as pd
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"ingredients": [
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredC"],
["ingredA", "ingredB", "ingredD"],
["ingredA", "ingredB", "ingredD", "ingredE"],
["ingredB", "ingredC", "ingredF"],
],
}
)
# Traditional find duplicates
# df[df.duplicated(keep=False)]
# Avoiding pandas duplicated function (question 70596016 solution)
i = [hash(tuple(i.values())) for i in df.to_dict(orient="records")]
j = [i.count(k) > 1 for k in i]
df[j]
Both methods (the latter from this alternative find duplicates answer) result in
TypeError: unhashable type: 'list'.
They would work, of course, if the dataframe looked like this:
df = pd.DataFrame(
{
"author": ["Jefe98", "Jefe98", "Alex", "Alex", "Qbert"],
"date": [1423112400, 1423112400, 1603112400, 1423115600, 1663526834],
"recipe": [
"recipeC",
"recipeC",
"recipeD",
"recipeE",
"recipeF",
],
}
)
Which made me wonder if something like integer encoding might be reasonable? It's not that different from converting to/from strings, but at least it's legible. Alternatively, suggestions for converting to a single string of ingredients per row directly from the starting dataframe in the code link above would be appreciated (i.e., avoiding lists altogether).
With map tuple
out = df[df.assign(rating = df['rating'].map(tuple)).duplicated(keep=False)]
Out[295]:
author date rating
0 Jefe98 1423112400 [ingredA, ingredB, ingredC]
1 Jefe98 1423112400 [ingredA, ingredB, ingredC]

Remove duplicate values while group by in pandas data frame

given input data frame
Required output
I am able to achieve this using groupby fn as df_ = (df.groupby("entity_label", sort=True)["entity_text"].apply(tuple).reset_index(name="entity_text")), but duplicates are still there in the output tuple
You can use SeriesGroupBy.unique() to get the unique values of entity_text before applying tuple to the list, as follows:
(df.groupby("entity_label", sort=False)["entity_text"]
.unique()
.apply(tuple)
.reset_index(name="entity_text")
)
Result:
entity_label entity_text
0 job_title (Full Stack Developer, Senior Data Scientist, Python Developer)
1 country (India, Malaysia, Australia)
Try this:
import pandas as pd
df = pd.DataFrame({'entity_label':["job_title", "job_title","job_title","job_title", "country", "country", "country", "country", "country"],
'entity_text':["full stack developer", "senior data scientiest","python developer","python developer", "Inida", "Malaysia", "India", "Australia", "Australia"],})
df.drop_duplicates(inplace=True)
df['entity_text'] = df.groupby('entity_label')['entity_text'].transform(lambda x: ','.join(x))
df.drop_duplicates().reset_index().drop(['index'], axis='columns')
output:
entity_label entity_text
0 job_title full stack developer,senior data scientiest,py...
1 country Inida,Malaysia,India,Australia

Convert two CSV tables with one-to-many relation to JSON with embedded list of subdocuments

I have two CSV files which have one-to-many relation between them.
main.csv:
"main_id","name"
"1","foobar"
attributes.csv:
"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"
I would like to convert this to JSON of this structure:
[
{
"main_id": "1",
"name": "foobar",
"attributes": [
{
"id": "100",
"name": "color",
"value": "red",
"updated_at": "2020-10-10"
},
{
"id": "101",
"name": "shape",
"value": "square",
"updated_at": "2020-10-10"
},
{
"id": "103",
"name": "size",
"value": "small",
"updated_at": "2020-10-10"
}
]
}
]
I tried using Python and Pandas like:
import pandas
def transform_group(group):
group.reset_index(inplace=True)
group.drop('main_id', axis='columns', inplace=True)
return group.to_dict(orient='records')
main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)
attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"
main = main.merge(
right=attributes,
on='main_id',
how='left',
validate='m:1',
copy=False,
)
main.to_json('out.json', orient='records', indent=2)
It works. But the issue is that it does not seem to scale. When running on my whole dataset I have, I can load individual CSV files without problems, but when trying to modify data structure before calling to_json, memory usage explodes.
So is there a more efficient way to do this transformation? Maybe there is some Pandas feature I am missing? Or is there some other library to use? Moreover, use of apply seems to be pretty slow here.
This is a tough problem and we have all felt your pain.
There are three ways I would attack this problem. First, groupby is slower if you allow pandas to do the break out.
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame({'id': np.random.randint(0, 100, 5000),
'name': np.random.randint(0, 100, 5000)})
now if you do the standard groupby
groups = []
for k, rows in df.groupby('id'):
groups.append(rows)
you will find that
groups = defaultdict(lambda: [])
for id, name in df.values:
groups[id].append((id, name))
is about 3 times faster.
The second method is I would use change it to use Dask and the dask parallelization. A discussion about dask is what is dask and how is it different from pandas.
The third is algorithmic. Load up the main file and then by ID, then only load the data for that ID, having multiple bites at what is in memory and what is in disk, then saving out a partial result as it becomes available.
So in my case I was able to load original tables in memory, but doing embedding exploded the size so that it did not fit memory anymore. So I ended up still using Pandas to load CSV files, but then I iteratively generate row by row and saving each row into a separate JSON. This means I do not have a large data structure in the memory for one large JSON.
Another important realization was that it is important to make the related column an index, and that it has to be sorted, so that querying it is fast (because generally there are duplicate entries in the related column).
I made the following two helper functions:
def get_related_dict(related_table, label):
assert related_table.index.is_unique
if pandas.isna(label):
return None
row = related_table.loc[label]
assert isinstance(row, pandas.Series), label
result = row.to_dict()
result[related_table.index.name] = label
return result
def get_related_list(related_table, label):
# Important to be more performant when selecting non-unique labels.
assert related_table.index.is_monotonic_increasing
try:
# We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
return related_table.loc[[label], :].to_dict(orient='records')
except KeyError:
return []
And then I do:
main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)
# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')
columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
assert len(columns) == len(row)
data = dict(zip(columns, row))
data['attributes'] = get_related_list(attributes, data['main_id'])
json.dump(data, sys.stdout, indent=2)
sys.stdout.write("\n")

Getting a unique count within a range of months in a Pandas DataFrame

I have a Dataframe which contains customer names, amount of orders and the date in which they ordered.
I want to know, how many customers I had in a range of months. So, count the unique customer names within June to October, to put an example.
I have tried Cust_per_month = raw_data[['Customer']].groupby(raw_data.PDate.dt.month).nunique()
But this returns a series with counts for each individual month, whereas I need to know in ranges, from June to October, then from June to December.
I was thinking about creating a condition where it would only count a customer if any of the integers related to the months I'm interested in would show up, but this seems pretty clunky in my book.
I would mask the original DataFrame then calculate. groupby is more useful with unique, non-overlapping groups, or with a fixed window (groupby.rolling), neither of which are applicable here.
Sample Data
import string
import pandas
import numpy
np.random.seed(42)
raw_data = pd.DataFrame({'PDate': pd.date_range('2010-01-01', freq='45D', periods=50),
'Customer': np.random.choice(list(string.ascii_lowercase), 50)})
Code
m1 = raw_data.PDate.dt.month.between(6, 10, inclusive=True) # [June, October]
m2 = raw_data.PDate.dt.month.between(6, 12, inclusive=True) # [June, December]
raw_data[m1].Customer.nunique()
# 14
raw_data[m2].Customer.nunique()
# 17
If you need a more generic approach, you can use something like this:
# Create list of ranges you want to see (it can also be a dataframe, using
# the df.iterrows function in the next step
month_range_list = [
{"name": "June - October",
"lower": 6,
"upper": 10},
{"name": "June - December",
"lower": 6,
"upper": 12}
]
# Expand the list into all elements within the range you defined (you can also use
# dates, and then expand them with datetime.timedelta)
range_df = pd.concat([
pd.DataFrame({ "name": [ p["name"] ], "month": [ month ] })
for p in month_range_list
for month in range(p["lower"], p["upper"]+1)
])
# Merge the range_df with your raw_data, and calculate for each grouping the
# number of unique customers
raw_data \
.assign(month=raw_data.calendar_date.dt.month) \
.merge(range_df, on="month") \
.groupby("name") \
["Customer"] \
.nunique()

Merge Multiple CSV with different column name but same definition

I have different sources(CSV) of similar data set which i want to merge into single data and write it to my DB. Since data is coming from different sources, they use different headers in their CSV, i want to merge these columns with logical meaning.
So far, i have tried reading all headers first and re reading the files to first get all the data in a single data frame and then doing if else to merge the columns together with same meaning. Ideally I would like to create a mapping file with all possible column names per column and then read CSV using that mapping. The data is not ordered or sorted between files. Number of columns might be different too but they all have the columns i am interested in.
Sample data:
File 1:
id, name, total_amount...
1, "test", 123 ..
File 2:
member_id, tot_amnt, name
2, "test2", 1234 ..
i want this to look like
id, name, total_amount...
1, "test", 123...
2, "test2", 1234...
...
I can't think of an elegant way to do this, would be great to get some direction or help with this.
Thanks
Use skiprows and header=None to skip the header, names to specify your own list of column names, and concat to merge into a single df. i.e.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['a','b','c']),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['a','b','c'])]
)
Edit: If the different files differ only by column order you can specify different column orders to names and if you want to select a subset of columns use usecols. But you need to do this mapping in advance, either by probing the file, or some other rule.
This requires mapping files to handlers somehow
i.e.
file1.csv
id, name, total_amount
1, "test", 123
file2.csv
member_id, tot_amnt, ignore, name
2, 1234, -1, "test2"
The following selects the common 3 columns and renames / reorders.
import pandas as pd
pd.concat([
pd.read_csv('file1.csv',skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
pd.read_csv('file2.csv',skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])],
sort=False
)
Edit 2:
And a nice way to apply this is to use lambda's and maps - i.e.
parsers = {
"schema1": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
"schema2": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])
}
map = {
"file2.csv": "schema2",
"file1.csv": "schema1"}
pd.concat([parsers[v](k) for k,v in map.items()], sort=False)
This is what i ended up doing and found to be the cleanest solution. Thanks David your help.
dict1= {'member_number': 'id', 'full name': 'name', …}
dict2= {'member_id': 'id', 'name': 'name', …}
parsers = {
"schema1": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())),
"schema2": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys()))
}
map = {
'schema1': (a_file.csv,dict1),
'schema2': (b_file.csv,dict2)
}
total = []
for k,v in map.items():
d = parsers[k](v[0], v[1])
d.rename(columns=v[1], inplace=True)
total.append(d)
final_df = pd.concat(total, sort=False)

Categories