I'm new to Pandas and I'd like to know what I'm doing wrong in the following example.
I found an example here explaining how to get a data frame after applying a group by instead of a series.
df1 = pd.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Baires", "Caracas", "Baires", "Caracas"] })
df1['size'] = df1.groupby(['City']).transform(np.size)
df1.dtypes #Why is size an object? shouldn't it be an integer?
df1[['size']] = df1[['size']].astype(int) #convert to integer
df1['avera'] = df1.groupby(['City'])['size'].transform(np.mean) #group by again
Basically, I want to apply the same transformation to a huge data set I'm working on now, but I'm getting an error message:
budgetbid['meanpb']=budgetbid.groupby(['jobid'])['probudget'].transform(np.mean) #can't upload this data for the sake of explanation
ValueError: Length mismatch: Expected axis has 5564 elements, new values have 78421 elements
Thus, my questions are:
How can I overcome this error?
Why do I get an object type when apply group by with size instead of an integer type?
Let us say that I want to get a data frame from df1 with unique cities and their respective count(*). I know I can do something like
newdf=df1.groupby(['City']).size()
Unfortunately, this is a series, but I want a data frame with two columns, City and the brand new variable, let's say countcity. How can I get a data frame from a group-by operation like the one in this example?
Could you give me an example of a select distinct equivalence here in pandas?
Question 2: Why does df1['size'] have dtype object?
groupby/transform returns a DataFrame with a dtype for each column which is compatible with both the original column's dtype and the result of the transformation. Since Name has dtype object,
df1.groupby(['City']).transform(np.size)
is converted to dtype object as well.
I'm not sure why transform is coded to work like this; there might be some usecase which demands this to ensure correctness in some sense.
Questions 1 & 3: Why do I get ValueError: Length mismatch and how can I avoid it
There are probably NaNs in the column being grouped. For example, suppose we change one of the values in City to NaN:
df2 = pd.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : [np.nan, "Seattle", "Baires", "Caracas", "Baires", "Caracas"] })
grouped = df2.groupby(['City'])
then
In [86]: df2.groupby(['City']).transform(np.size)
ValueError: Length mismatch: Expected axis has 5 elements, new values have 6 elements
Groupby does not group the NaNs:
In [88]: [city for city, grp in df2.groupby(['City'])]
Out[88]: ['Baires', 'Caracas', 'Seattle']
To work around this, use groupby/agg:
countcity = grouped.agg('count').rename(columns={'Name':'countcity'})
# countcity
# City
# Baires 2
# Caracas 2
# Seattle 1
and then merge the result back into df2:
result = pd.merge(df2, countcity, left_on=['City'], right_index=True, how='outer')
print(result)
yields
City Name countcity
0 NaN Alice NaN
1 Seattle Bob 1
2 Baires Mallory 2
4 Baires Bob 2
3 Caracas Mallory 2
5 Caracas Mallory 2
Question 4: Do you mean what is the Pandas equivalent of the SQL select distinct statement?
If so, perhaps you are looking for
Series.unique
or perhaps iterate through the keys in the Groupby object, as was done in
[city for city, grp in df2.groupby(['City'])]
3.)
Just call pd.Dataframe() again:
newdf = pd.Dataframe(df1.City.value_counts())
or
newdf = pd.Dataframe(groupby(['City']).size())
4.) I think the select distinct euqivalent would just be using more than one column in your groupby. So for example,
df1.groupby(['City', 'Name']).size() would return the groupby object:
City Name
Baires Bob 1
Mallory 1
Caracas Mallory 2
Seattle Alice 1
Bob 1
dtype: int64
Related
Say I have two tables:
Table A:
state value
0 A 100
Table B:
state 100 200
0 A 1 4
1 B 2 5
2 C 3 6
I want to create a new field for Table A called "Factor" that returns the respective value from Table B:
state value factor
0 A 100 1
How would I do this in Python/Pandas?
In Excel, I would do: INDEX('Table B'!B2:C4, MATCH('Table A'!A2, 'Table B'!A:A, 0), MATCH('Table A'!B2, 'Table B'!B1:C1, 0))
Pivot dfA from wide to long format using melt. Ensure the column names get converted to a numeric type.
merge the longform data to dfA
melted = dfB.melt(id_vars=['state'], var_name='value', value_name='factor')
melted['value'] = melted['value'].astype(int)
dfA = dfA.merge(melted, on=['state', 'value'])
Result:
state value factor
0 A 100 1
This maybe feels like overkill for this example, but could be helpful for larger lookups.
You should try to use loc for this. Please try the following example:
Assuming Table A (dfA) and Table B (dfB) are both pandas dataframes:
>>> A = {
'state' : ['A'],
'value' : [100],
}
>>> B = {
'state' : ['A','B','C'],
'100' : [1,2,3],
'200' : [4,5,6]
}
>>> dfA = pd.DataFrame(A)
>>> dfB = pd.DataFrame(B)
which gives you your tables as pandas dataframes
>>> dfA
state value
0 A 100
>>> dfB
state 100 200
0 A 1 4
1 B 2 5
2 C 3 6
Then we search a specific column in the dataframe for a specific value and return the value of a different column from that same row.
extracted_value = list(dfB.loc[dfB['state'] == 'A', '100'])[0]
We create a column and set that value for that column dfA['factor'] = extracted_value
and we now have that extracted value in the appropriate column
>>> dfA
state value factor
0 A 100 1
Now I'm sure you want to do this in a loop to address a particular list of values and I hope this helps you start. Additionally, using dfA['factor'] = extracted_value is going to set every value in that column to extracted_value and you might want to have different values per row. To accomplish that you're going to want to use .loc again to add values based on the row index.
Assume I have the following simple pandas DataFrame:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"country": ["Netherlands", "Germany", "United_States", "England", "Canada"]})
and a dictionary with abbreviations for the values in the country column:
abr = {"Netherlands": "NL",
"Germany": "GE",
"United_States": "US",
"England": "EN",
"Canada": "CA"
}
I want to change the values in the country column of the DataFrame to the lookup values in the dictionary. The result would look like this:
id country
0 1 NE
1 2 GE
2 3 US
3 4 EN
4 5 CA
I tried to do it using
df["country"] = abr[df["country"]]
but that gives the following error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I understand why this error happens (the code tries to hash an object instead of the string value in the column), but is there a way to solve this?
You can use pandas function replace() especially thought for these scenarios. Careful not to confuse it with python's built-in .str.repace() which doesn't take dictionaries.
Try with:
df['country'] = df['country'].replace(abr)
df["country"] = df["country"].map(abr)
print(df)
Prints:
id country
0 1 NL
1 2 GE
2 3 US
3 4 EN
4 5 CA
I am new to python and hence will appreciate any help on this!
Suppose i have a bunch of columns in a dataset with categorical values. Let's say Gender, marital status, etc.
While doing input validation of the dataset, i need to check if the values of columns are within an acceptable range.
For instance, if the column is gender, accpetable values as male, female. Suppose column is marital status, acceptable values are single, married, divorced.
if for instance, the user inputs a dataset with values for these variables outside the acceptable range, i need to write a function to point it out.
How do i do this?
suppose i create a static acceptable value mapping list like below, for all datasets:
dataset variable acceptable_values
demographics gender male,female
demographics marital status single,married,divorced
purchase region south,east,west,north
Ideally code should go through all variables in all datasets listed in above mapping file and see if the values are in the "acceptable_values" list
Suppose below are new datasets, the code show throw an output saying,
unacceptable values found for dataset: demographics, for variable: gender - Boy,Other,missing,(blank)
unacceptable values found for dataset: demographics, for variable: maritalstatus- separated
demographics:
id gender maritalstatus
1 male single
2 male single
3 Boy single
4 Other married
5 missing divorced
6 (blank) separated
Let me know how this can be achieved. Looks fairly complicated to my understanding
It would be great if the code could convert the "new"/"unacceptable" values to NaN or 0 or something like that as well, but this is good to have.
You could do something like the following, where we assume that you're storing your data frames in a dictionary called df_dict, and the collection of accepted values in a data frame called df_accepted:
# First, use the dataset and variable name as indices in df_accepted
# to make it easier to perform lookups
df_accepted.set_index(['dataset', 'variable'], inplace=True)
# Loop over all data frames
for name, df in df_dict.items():
# Loop over all columns in the current data frame
for c in df:
# Find the indices for the given column for which the values
# does /not/ belong to the list of accepted values for this column.
try:
mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
# Print the values that did not belong to the list
print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
# Convert them into NaNs
df[c][mask] = np.nan
except KeyError:
print(f'Skipping validation of {c} in {name}')
With your given input:
In [200]: df_accepted
Out[200]:
dataset variable acceptable_values
0 demographics gender male,female
1 demographics maritalstatus single,married,divorced
2 purchase region south,east,west,north
In [201]: df_dict['demographics']
Out[201]:
gender maritalstatus
id
1 male single
2 male single
3 Boy single
4 Other married
5 missing divorced
6 (blank) separated
In [202]: df_dict['purchase']
Out[202]:
region count
0 south 60
1 west 90210
2 north-east 10
In [203]: df_accepted.set_index(['dataset', 'variable'], inplace=True)
...:
...: for name, df in df_dict.items():
...: for c in df:
...: try:
...: mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
...: print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
...: df[c][mask] = np.nan
...: except KeyError:
...: print(f'Skipping validation of {c} in {name}')
...:
Bad values for gender in demographics: Boy, Other, missing, (blank)
Bad values for maritalstatus in demographics: separated
Bad values for region in purchase: north-east
Skipping validation of count in purchase
In [204]: df_accepted
Out[204]:
acceptable_values
dataset variable
demographics gender male,female
maritalstatus single,married,divorced
purchase region south,east,west,north
In [205]: df_dict['demographics']
Out[205]:
gender maritalstatus
id
1 male single
2 male single
3 NaN single
4 NaN married
5 NaN divorced
6 NaN NaN
In [206]: df_dict['purchase']
Out[206]:
region count
0 south 60
1 west 90210
2 NaN 10
There might be a simpler way to do this, but this solution works:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['region', 'number'], data=[['north',0],['south',-4],['hello',15]])
valid_values = {'region': {'north','south','west','east'}}
df = df.apply(lambda column:
column.apply(lambda x: x if x in valid_values[column.name] else np.nan)
if column.name in valid_values else column)
I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')
I have a data frame with two columns :
state total_sales
AL 16714
AR 6498
AZ 107296
CA 33717
Now I want to map the strings in state column to int from 1 to N(where N is the no of rows,here 4 ) based on increasing order of values in total_sales . Result should be stored in another column (say label). That is, wanted a result like this :
state total_sales label
AL 16714 3
AR 6498 4
AZ 107296 1
CA 33717 2
Please suggest a vectorised implementation .
You can use rank with cast to int:
df['label'] = df['total_sales'].rank(method='dense', ascending=False).astype(int)
print (df)
state total_sales label
0 AL 16714 3
1 AR 6498 4
2 AZ 107296 1
3 CA 33717 2
One option for converting a column of values to integers is pandas.Categorical.
This actually groups identical values, which in a case like this, where all values are unique each "group" has only one value. The resulting object has a codes attribute, which is a Numpy array of integers indicating which group each input value is in.
Applied to this problem, if you have
In [12]: data = pd.DataFrame({
'state': ['AL', 'AR', 'AZ', 'CA'],
'total_sales': [16714, 6498, 107296, 33717]
})
you can add the label column as described using
In [13]: data['label'] = len(data) - pd.Categorical(data.total_sales, ordered=True).codes
In [14]: print(data)
state total_sales label
0 AL 16714 3
1 AR 6498 4
2 AZ 107296 1
3 CA 33717 2
For this example it is not as fast as jezrael's answer but it has a wide range of applications and it was faster for a larger series I was encoding to integers. It should be noted that if there are two identical values in the total_sales column it will assign them the same label.