I want to rename a few columns and set inplace = False then I am able to view the result. Why set inplace = True will result in None (I have refresh the runtime to run inplace=True)?
ckd_renamed = ckd_data.rename(
columns = {"id": "Id",
"age": "Age",
"bp": "blood_pressure"},
inplace=True)
The dataset that I use is
"https://raw.githubusercontent.com/dphi-official/Datasets/master/Chronic%20Kidney%20Disease%20(CKD)%20Dataset/ChronicKidneyDisease.csv"
Because when you set inplace=True, there is no new copy created, and ckd_data is modified, well, in-place.
The convention of returning None for in-place modifications stems from how e.g. Python's .sort() works.
After
ckd_data.rename(columns = {"id": "Id", "age": "Age", "bp": "blood_pressure"}, inplace=True)
you'd continue to refer to the dataframe, now with modified column names, as ckd_data.
EDIT: If you're using Colab and you want the cell to also display the new value, you should be able to add an additional expression statement with just the name of the variable.
ckd_data.rename(columns = {"id": "Id", "age": "Age", "bp": "blood_pressure"}, inplace=True)
ckd_data
Related
Is it possible to validate a column based on another column using Pandera?
My dataframe looks like this:
df = pd.DataFrame({
"Name": ["Thomas","",""],
"Address": ["Address 1", "Address 1", "Address 3"],
"Zip": ["65989", "65989", "65954"],
"External": [False, True, False],
})
I would like to validate the "Name" column based on the "External" column. If external = True then Name can be empty. In this example, the third record should be invalid because external = False and the name is missing.
A related question here suggests to use the wide checks. However, in this way all the columns of the third record are evaluated as invalid (Name, Address, Zip and External) but I need only the Name to be invalid and ignore the rest.
schema_ = pa.DataFrameSchema({
"Name": pa.Column(str),
"Address": pa.Column(str),
"Zip": pa.Column(str),
"External": pa.Column(str),
},
checks=[pa.Check.is_external()])
#extensions.register_check_method(check_type="element_wise",)
def is_external(pandas_obj: pd.Series):
if (pandas_obj["external"] == True) and (len(pandas_obj["Name"]))<1:
return False
else:
return True
I also tried something like this in the schema:
"Name": pa.Column(str, checks=[
pa.Check.is_external(df["external"]),
]),
But in this case all the values are passed to the function "[False, True, False]" and I am not sure how to compare them to the corresponding values of the Name column.
Is it possible to make this kind of checks in Pandera ?
Thanks in advance!
I have not seen any posts about this on here. I have a data frame with some data that i would like to replace the values with the values found in a dictionary. This could simply be done with .replace but I want to keep this dynamic and reference the df column names using a paired dictionary map.
import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
replace_dict={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict():
return
df_map={
"letters": [replace_dict],
"State": [replace_dict]
}
#replace the df values with the replace_dict values
I hope this makes sense but to explain more i want to replace the data under columns 'letters' and 'State' with the values found in replace_dict but referencing the column names from the keys found in df_map. I know this is overcomplicated for this example but i want to provide an easier example to understand.
I need help making the function 'replace_dict' to do the operations above.
Expected output is:
data=[['ABC', 'IN']]
df=pd.DataFrame(data,columns=['letters','State'])
by creating a function and then running the function with something along these lines
for i in df_map:
for j in df_map[i]:
df= j(i, df)
how would i create a function to run these operations? I have not seen anyone try to do this with multiple dictionary keys in the replace_dict
I'd keep the replace_dict keys the same as the column names.
def map_from_dict(data: pd.DataFrame, cols: list, mapping: dict) -> pd.DataFrame:
return pd.DataFrame([data[x].map(mapping.get(x)) for x in cols]).transpose()
df = pd.DataFrame({
"letters": ["Alphabet"],
"states": ["Indiana"]
})
replace_dict = {
"states": {"Illinois": "IL", "Indiana": "IN"},
"letters": {"Alphabet": "ABC", "Alphabet end": "XYZ"}
}
final_df = map_from_dict(df, ["letters", "states"], replace_dict)
print(final_df)
letters states
0 ABC IN
import pandas as pd
data=[['Alphabet', 'Indiana']]
df=pd.DataFrame(data,columns=['letters','State'])
dict_={
"states":
{"Illinois": "IL", "Indiana": "IN"},
"abc":
{"Alphabet":"ABC", "Alphabet end":"XYZ"}}
def replace_dict(df, dict_):
for d in dict_.values():
for val in d:
for c in df.columns:
df[c][df[c]==val] = d[val]
return df
df = replace_dict(df, dict_)
suppose I have a dataframe column values as shown below
print(df.columns)
Index([ "CustomerID", "CompanyName", "ContactName", "ContactTitle",
"Address", "City", "Region", "PostalCode",
"Country", "Phone", "Fax"],
dtype='object')
I want to remove quotes from the values.
This is what I have tried:
for cols in list(df.columns):
new_col = str(cols).replace('"','').replace('"','')
cols.rename(index={cols: new_col})
Any solution will be helpfull.
Its a string type for .columns, If you remove quote its invalid.
I have two CSV files which have one-to-many relation between them.
main.csv:
"main_id","name"
"1","foobar"
attributes.csv:
"id","main_id","name","value","updated_at"
"100","1","color","red","2020-10-10"
"101","1","shape","square","2020-10-10"
"102","1","size","small","2020-10-10"
I would like to convert this to JSON of this structure:
[
{
"main_id": "1",
"name": "foobar",
"attributes": [
{
"id": "100",
"name": "color",
"value": "red",
"updated_at": "2020-10-10"
},
{
"id": "101",
"name": "shape",
"value": "square",
"updated_at": "2020-10-10"
},
{
"id": "103",
"name": "size",
"value": "small",
"updated_at": "2020-10-10"
}
]
}
]
I tried using Python and Pandas like:
import pandas
def transform_group(group):
group.reset_index(inplace=True)
group.drop('main_id', axis='columns', inplace=True)
return group.to_dict(orient='records')
main = pandas.read_csv('main.csv')
attributes = pandas.read_csv('attributes.csv', index_col=0)
attributes = attributes.groupby('main_id').apply(transform_group)
attributes.name = "attributes"
main = main.merge(
right=attributes,
on='main_id',
how='left',
validate='m:1',
copy=False,
)
main.to_json('out.json', orient='records', indent=2)
It works. But the issue is that it does not seem to scale. When running on my whole dataset I have, I can load individual CSV files without problems, but when trying to modify data structure before calling to_json, memory usage explodes.
So is there a more efficient way to do this transformation? Maybe there is some Pandas feature I am missing? Or is there some other library to use? Moreover, use of apply seems to be pretty slow here.
This is a tough problem and we have all felt your pain.
There are three ways I would attack this problem. First, groupby is slower if you allow pandas to do the break out.
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame({'id': np.random.randint(0, 100, 5000),
'name': np.random.randint(0, 100, 5000)})
now if you do the standard groupby
groups = []
for k, rows in df.groupby('id'):
groups.append(rows)
you will find that
groups = defaultdict(lambda: [])
for id, name in df.values:
groups[id].append((id, name))
is about 3 times faster.
The second method is I would use change it to use Dask and the dask parallelization. A discussion about dask is what is dask and how is it different from pandas.
The third is algorithmic. Load up the main file and then by ID, then only load the data for that ID, having multiple bites at what is in memory and what is in disk, then saving out a partial result as it becomes available.
So in my case I was able to load original tables in memory, but doing embedding exploded the size so that it did not fit memory anymore. So I ended up still using Pandas to load CSV files, but then I iteratively generate row by row and saving each row into a separate JSON. This means I do not have a large data structure in the memory for one large JSON.
Another important realization was that it is important to make the related column an index, and that it has to be sorted, so that querying it is fast (because generally there are duplicate entries in the related column).
I made the following two helper functions:
def get_related_dict(related_table, label):
assert related_table.index.is_unique
if pandas.isna(label):
return None
row = related_table.loc[label]
assert isinstance(row, pandas.Series), label
result = row.to_dict()
result[related_table.index.name] = label
return result
def get_related_list(related_table, label):
# Important to be more performant when selecting non-unique labels.
assert related_table.index.is_monotonic_increasing
try:
# We use this syntax for always get a DataFrame and not a Series when there is only one row matching.
return related_table.loc[[label], :].to_dict(orient='records')
except KeyError:
return []
And then I do:
main = pandas.read_csv('main.csv', index_col=0)
attributes = pandas.read_csv('attributes.csv', index_col=1)
# We sort index to be more performant when selecting non-unique labels. We use stable sort.
attributes.sort_index(inplace=True, kind='mergesort')
columns = [main.index.name] + list(main.columns)
for row in main.itertuples(index=True, name=None):
assert len(columns) == len(row)
data = dict(zip(columns, row))
data['attributes'] = get_related_list(attributes, data['main_id'])
json.dump(data, sys.stdout, indent=2)
sys.stdout.write("\n")
I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]