How to obtain output files labeled with dictionary keys - python

I am a python/pandas user and I have a multiple dataframe like df1, df2,df3....
I want to name them as A, B, C, ... thus I wrote as below.
df_dict = {"A":df1, "B":df2,'C':df3,....}
Each dataframe has "Price" column and I want to know the output from the following formula.
frequency=df.groupby("Price").size()/len(df)
I made the following definition and want to obtain outputs from each dataframe.
def Price_frequency(df,keys=["Price"]):
frequency=df.groupby(keys).size()/len(df)
return frequency.reset_index().to_csv("Output_%s.txt" %(df),sep='\t')
As a first trial, I did
Price_frequency(df1,keys=["Price"])
but this did not work. It seems %s is wrong.
Ideally, I want output files named as "Output_A.txt", "Output_B.txt"...
If you could help me, I would be grateful for that very much.

A couple of points:
%s requires you to input a string. But in Python 3.6+ you can use formatted string literals, which you may find more readable.
Your function doesn't need to return anything here. You are using it to output csv files in a loop. Don't feel the need to add a return statement if it doesn't serve a purpose.
So you can do the following:
def price_frequency(df_dict, df_name, keys=['Price']):
frequency = df_dict[df_name].groupby(keys).size() / len(df_dict[df_name].index)
frequency.reset_index().to_csv(f'Output_{df_name}.txt', sep='\t')
df_dict = {'A': df1, 'B': df2, 'C': df3}
for df_name in df:
price_frequency(df_dict, df_name, keys=['Price'])

Iterating through columns will get output.
def Price_frequency(df):
for col in df.columns[2:]
frequency=df.groupby(col).size()/len(df)
return frequency.reset_index().to_csv("Output_%s.txt" %(col),sep='\t')
Reference: Pandas: Iterate through columns and starting at one column
Note: haven't gotten to test this yet

Related

Selecting specific columns in where condition using Pandas

I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?
I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)

Merge two data-frames with only one column different. Need to append that column in the new dataframe. Please check below for detailed view

Little new to Python, I am trying to merge two data-frame with columns similar. 2nd data-frame consists of 1 column different need to append that in new data-frame.
Detailed view of dataframes
Code Used :
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id')
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id', how='outer')
Getting Output csv as
Unnamed: 0 Id_x Number_x Class_x Section_x Place_x Name_x Executed_Date_x Version_x Value PartDateTime_x Cycles_x Id_y Mumber_y Class_y Section_y Place_y Name_y Executed_Date_y Version_y Value_data PartDateTime_y Cycles_y
whereas i dont want _x & _y i wanted the output to be :
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
If i use df2=pd.concat([df,df1],axis=0,ignore_index=True)
then i will get values in the below mentioned format in all columns except Value_data; whereas Value_data would be empty column.
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
Please help me with a solution for this. Thanks for your time.
I think easiest path is to make a temporary df, let's call it df_temp2 , which is a copy of df_2, with renamed column, then append it to df_1
df_temp2 = df_2.copy()
df_temp2.columns = ['..','..', .... 'value' ...]
then
df_total = df_1.append(df_temp2)
This provides you a total DataFrame with all the rows of DF_1 and DF_2. 'append()' method supports a few arguments, check the docs for more details.
--- Added --------
One other possible approach is to use pd.concat() function, which can work in the same way ad .append() method, like this
result = pd.concat([df_1, df_temp2])
In your case the two approaches would lead to similar performances. You can consider append() as a method written on top of pd.concat() but it is applied to a DF itself.
Full docs about concat() here: pd.Concat() docs
Hope this was helpful.
import pandas as pd
df =pd.read_csv('C:/Users/output_2.csv')
df1 pd.read_csv('C:/Users/output_1.csv')
df1_temp=df1[['Id','Cycles','Value_data']].copy()
df3=pd.merge(df,df1_temp,on = ['Id','Cycles'], how='inner')
df3=df3.drop(columns="Unnamed: 0")
df3.to_csv('C:/Users/output.csv')
This worked

efficiently mapping values in pandas from a 2nd dataframe

I'm looking to best understand how to use a 2nd file/dataframe to efficiently map values when these values are provided as encoded and there is a label I want to map to it. Think of this 2nd file as a data dictionary that translates the values in the first dataframe.
For example
import pandas as pd
dataset = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')
data_dictionary = pd.DataFrame({'columnname' : ['vs','vs', 'am','am'], 'code' : [0,1,0,1], 'label':['vs_is_0','vs_is_1','am_is_0','am_is_1'] })
Now, I want to be able replace the values in the 'columnname' in the first dataset according to the mapping 'code' with the accurate 'label'. If a value is found in one and not the other, nothing happens.
Currently my approach is as follows but I feel it is very ineffecient and suboptimal. Keep in mind I could have 30-40 columns each with 2-200 values I'd want replaced with this vlookup like replacement:
for each_colname in dataset.columns.tolist():
lookup_values = data_dictionary.query("columnname=={}".format(each_colname))
# and then doing a merge...
Any help is much appreciated!
First you can create a mapper dict and then apply this to your dataset.
mapper = (
data_dictionary.groupby('columnname')
.apply(lambda x: dict(x.values.tolist()))
.to_dict()
)
for e in mapper.keys():
df[e] = df[e].map(mapper[e]).combine_first(df[e])
Update to handle mismatching datatypes:
mapper = (
data_dictionary.groupby('columnname')
.apply(lambda x: dict(x.astype(str).values.tolist()))
.to_dict()
)
for e in mapper.keys():
df[e] = df[e].astype(str).map(mapper[e]).combine_first(df[e])

Rename last column in a dataframe passed along in method chain

How can I rename the last column in a dataframe, that was passed along in a method chain? Think about the following example (the real use case is more complex). How can the rename function refer to the dataframe that it processes (which is different from the "table" dataframe? Is there something like the following? Unfortunately "self" does not exist.
result = table.iloc[:,2:-1].rename(columns={self.columns[-1]: "Text"})
Use pipe():
result = table.iloc[:,2:-1].pipe(lambda df: df.rename(columns={df.columns[-1]: "Text"}))
I think that you can just do the following:
result = table.iloc[:,2:-1]
result.columns = result.columns[:-1] + ["Text"]

Changing all the indexes of all dataframes in a dict

I have a dictionary with a list of dataframes, each with a column that is in datetime format (column name "Datetime Format"). I am attempting to set the index of each dataframe to be that column, and am having difficulty.
I've simplified the issue and tried to find a solution, my technique is not sticking:
def test_func(dataframe):
dataframe = dataframe.set_index('Datetime Format')
return dataframe
test_dict = {'DF_1': df1, 'DF_2': df2}
for k, v in test_dict.items():
v = test_func(v)
Upon looking at the resulting test_dict, or each individual dataframe (df1 and df2), I was not successful at setting the indexes to be the 'Datetime Format' column.
I know when I do:
df1 = df1.set_index('Datetime Format')
it works correctly.
Please advise as to how to get this to function through a list (or dict in this case).
Thank you!
The set_index function returns a new DataFrame by default, which is why your changes aren't sticking.
There are two ways around this: you could re-assign the dict value with the DataFrame returned by the function.
for k, v in test_dict.items():
test_dict[k] = test_func(v)
Or you could pass the inplace argument when calling set_index.
def test_func(dataframe):
dataframe = dataframe.set_index('Datetime Format', inplace=True)
This will modify the original DataFrame, without creating a new version.

Categories