I have 4 columns in CSV and I want to set CSV as parameter to a function in python. The 'key' should be my first column in CSV.
df = pd.DataFrame({'Country': ['US','France','Germany'],'daycount':['Actual360','Actual365','ActaulFixed'],'frequency':['Annual','Semi','Quart'], 'calendar':['United','FRA','Ger'})
From the above data frame I want to set parameter to the following variables, based on 'Country' as key in the dataframe and it should populate the corresponding values in following variables. I need some function or loop through which I can populate values. These values will further used in next program.
day_count = Actual360
comp_frequency = Annual
gl_calendar = UnitedStates
If I understood correctly:
def retrieve_value(attribute, country, df): #input attribute and country as str
return df.loc[df['Country'] == country, attribute].iloc[0]
Ex:
retrieve_value('daycount', 'Germany', df) -> 'ActualFixed'
This?
def populate(df, country):
day_count=df[df['Country']==country]['daycount'][0]
comp_frequency=df[df['Country']==country]['frequency'][0]
gl_calendar=df[df['Country']==country]['calendar'][0]
return (day_count, comp_frequency, gl_calendar)
populate(df,'US')
Out: ('Actual360', 'Annual', 'United')
I'm not sure I got your question, let me try to reformulate it.
You have a pandas DataFrame with 4 columns, one of which (Country) acts as an index (=primary key in DB language). You would like to iterate on all the rows, and retrieve for each row the corresponding values in the other 3 columns.
If I didn't betray your intent, here is a code that'll do the job. Note that DataFrame.set_index(<column_name>) function, it tells pandas that this column should be used to index the rows (instead of the default numeric one).
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'Country': ['US','France','Germany'],'daycount':['Actual360','Actual365','ActaulFixed'],'frequency':['Annual','Semi','Quart'], 'calendar':['United','FRA','Ger']}).set_index('Country')
In [3]: df
Out[3]:
daycount frequency calendar
Country
US Actual360 Annual United
France Actual365 Semi FRA
Germany ActaulFixed Quart Ger
In [4]: for country, attributes in df.iterrows():
...: day_count = attributes['daycount']
...: comp_frequency = attributes['frequency']
...: # idem for the last value
...: print(f"{country} -> {day_count}, {comp_frequency}")
...:
US -> Actual360, Annual
France -> Actual365, Semi
Germany -> ActaulFixed, Quart
In [5]: df.loc['US', 'daycount'] # use df.loc[<country>, <attribute>] to retrieve specific value
Out[5]: 'Actual360'
Related
I'm using numpy where with multiple conditions to assign a category based on a text string a transaction description.
Part of the code is below
`import numpy as np
conditions = [
df2['description'].str.contains('AEGON', na=False),
df2['description'].str.contains('IB/PVV', na=False),
df2['description'].str.contains('Picnic', na=False),
df2['description'].str.contains('Jumbo', na=False),
]
values = [
'Hypotheek',
'Hypotheek',
'Boodschappen',
'Boodschappen']
df2['Classificatie'] = np.select(conditions, values, default='unknown')
I have many conditions which - only partly shown here.
I want to create a table / dataframe in stead of including every seperate condition and value in the code. So for instance the following dataframe:
import pandas as pd
Conditions = {'Condition': ['AEGON','IB/PVV','Picnic','Jumbo'],
'Value': ['Hypotheek','Hypotheek','Boodschappen','Boodschappen']
}
df_conditions = pd.DataFrame(Conditions, columns= ['Condition','Value'])
How can I adjust the condition to look for (in the str.contains) a text string as listed in df_condictions['condition'] and to apply the Value column to df2['Classificatie']?
The values are already a list in the variable explorer, but I can't find a way to have the str.contains to look for a value in a list / dataframe.
Desired outcome:
In [3]: iwantthis
Out[3]:
Description Classificatie
0 groceries Jumbo on date boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen
The first column is the input data frame, te second column is what I'm looking for.
Please note that my current code already allows me to create this column, but I want to use another more automated way using de df_condtions table.
I'm not yet really familiair with Python and I can't find anything online.
Try:
import re
df_conditions["Condition"] = df_conditions["Condition"].str.lower()
df_conditions = df_conditions.set_index("Condition")
tmp = df["Description"].str.extract(
"(" + "|".join(re.escape(c) for c in df_conditions.index) + ")",
flags=re.I,
)
df["Classificatie"] = tmp[0].str.lower().map(df_conditions["Value"])
print(df)
Prints:
Description Classificatie
0 groceries Jumbo on date Boodschappen
1 mortgage payment Aegon. Hypotheek
2 transfer picnic. Boodschappen
I have 2 dataframes:
df1
ID Type
2456-AA Coolant
2457-AA Elec
df2
ID Task
[2456-AA, 5656-BB] Check AC
[2456-AA, 2457-AA] Check Equip.
I'm trying return the matched ID's 'Type' from df1 to df2. With the result looking something like this:
df2
ID Task Type
[2456-AA, 5656-BB] Check AC [Coolant]
[2456-AA, 2457-AA] Check Equip. [Coolant , Elec]
I tried the following for loop. I udnerstand it isn't the fastest but i'm struggling to workout a faster alternative:
def type_identifier(type):
df = df1.copy()
device_type = []
for value in df1.ID:
for x in type:
if x == value:
device_type.append(df1.Type.tolist())
else:
None
return device_type
df2['test'] = df2['ID'].apply(lambda x: type_identifier(x))
Could somebody help me out? and also refer me to material that could help me to better approach problems like these?
Thank you,
Use the to_dict of pandas to convert df1 to a dictionary, so we can efficiently translate id to type.
Then, apply lamda that for each ID in df2 converts it to the right type, and assign it to test column as you wished.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID':['2456-AA', '2457-AA'],
'Type':['Coolant', 'Elec']})
df2 = pd.DataFrame({'ID':[['2456-AA', '5656-BB'], ['2456-AA', '2457-AA']],
'Task':['Check AC', 'Check Equip.']})
# Use to dict to convert df1 ids to types
id_to_type = df1.set_index('ID').to_dict()['Type']
# {'2456-AA': 'Coolant', '2457-AA': 'Elec'}
print(id_to_type)
# Apply lamda that for each `ID` in `df2` converts it to the right type
df2['test'] = df2['ID'].apply(lambda x: [id_to_type[t] for t in x if t in id_to_type])
print(df2)
Output:
ID Task test
0 [2456-AA, 5656-BB] Check AC [Coolant]
1 [2456-AA, 2457-AA] Check Equip. [Coolant, Elec]
I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:
I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:
def split_country(data):
d_list = []
for index, row in data.iterrows():
for value in str(row['Country']).split(','):
d_list.append({'Name':row['Name'],
'value':value})
data = data.append(d_list, ignore_index=True)
data = data.groupby('Name')['value'].value_counts()
data = data.unstack(level=-1).fillna(0)
return (data)
The final output is something like this:
I'm trying to parallelize the above process by passing my dataframe (df) using the following:
import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])
But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help
multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
For a test DataFrame with 1M rows, the following code took 1.54 seconds.
First, use pandas.DataFrame.explode on the column of lists
If the column is strings, first use ast.literal_eval to convert them to list type
df.countries = df.countries.apply(ast.literal_eval)
If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum
import pandas as pd
from ast import literal_eval
# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}
# create the dataframe
df = pd.DataFrame(data)
# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)
# explode the lists
df = df.explode('countries')
# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()
# display(df_counts)
name Canada China UK USA
0 Jack 0 1 1 0
1 James 1 0 0 1
2 John 0 0 1 1
Values in Pandas dataframe is mixed and shifted.But each column has its own characteristics for values in it. How can I rearrange values in their own position?
'floor_no' have to contain values with ' / ' substring in it.
'room_count' is maximum 2 values digit long.
sq_m_count' have to contain ' m²' substring in it.
'price_sq' have to contain ' USD/m²' in it.
'bs_state' have to contain one of 'Have' or 'Do not have' values.
Adding part of pandas dataframe.
Consider the following approach:
In [90]: dfs = []
In [91]: url = 'https://ru.bina.az/items/565674'
In [92]: dfs.append(pd.read_html(url)[0].set_index(0).T)
In [93]: url = 'https://ru.bina.az/items/551883'
In [94]: dfs.append(pd.read_html(url)[0].set_index(0).T)
In [95]: df = pd.concat(dfs, ignore_index=True)
In [96]: df
Out[96]:
0 Категория Площадь Количество комнат Купчая
0 Дом / Вилла 376 м² 6 есть
1 Дом / Вилла 605 м² 6 нет
I figured out solution that is bit "+18 and perverty"
I wrote a loop that looks if each of these columns contain some sting that identifies columnt that it belongs to and copies this value to new column. Then i simply subsituted new with old one.
I did this with each of 'mixed' columns. This code filled my needs and fixed all problem. I understand how 'perverted' code is and will write a function that is much shorter and professional.
for index in bina_az_df.itertuples():
bina_az_df.loc[bina_az_df['bs_state'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['bs_state']
bina_az_df.loc[bina_az_df['sq_m_count'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['sq_m_count']
bina_az_df.loc[bina_az_df['floor_no'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['floor_no']
bina_az_df.loc[bina_az_df['price_sq'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['price_sq']
bina_az_df.loc[bina_az_df['room_count'].str.contains(" m²|sot"),'new_sq_m_count'] = bina_az_df['room_count']
bina_az_df['sq_m_count'] = bina_az_df['new_sq_m_count'] # Substitutes
del bina_az_df['new_sq_m_count'] # deletes unnecesary temp column
How do I get the name of a DataFrame and print it as a string?
Example:
boston (var name assigned to a csv file)
import pandas as pd
boston = pd.read_csv('boston.csv')
print('The winner is team A based on the %s table.) % boston
You can name the dataframe with the following, and then call the name wherever you like:
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
>>>
Ones
Sometimes df.name doesn't work.
you might get an error message:
'DataFrame' object has no attribute 'name'
try the below function:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
From here what I understand DataFrames are:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
And Series are:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series have a name attribute which can be accessed like so:
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 0.541
1 -1.175
2 0.129
3 0.043
4 -0.429
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
EDIT: Based on OP's comments, I think OP was looking for something like:
>>> df = pd.DataFrame(...)
>>> df.name = 'df' # making a custom attribute that DataFrame doesn't intrinsically have
>>> print(df.name)
'df'
DataFrames don't have names, but you have an (experimental) attribute dictionary you can use. For example:
df.attrs['name'] = "My name" # Can be retrieved later
attributes are retained through some operations.
Here is a sample function:
'df.name = file` : Sixth line in the code below
def df_list():
filename_list = current_stage_files(PATH)
df_list = []
for file in filename_list:
df = pd.read_csv(PATH+file)
df.name = file
df_list.append(df)
return df_list
I am working on a module for feature analysis and I had the same need as yours, as I would like to generate a report with the name of the pandas.Dataframe being analyzed. To solve this, I used the same solution presented by #scohe001 and #LeopardShark, originally in https://stackoverflow.com/a/18425523/8508275, implemented with the inspect library:
import inspect
def aux_retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var]
Note the additional .f_back term since I intend to call it from another function:
def header_generator(df):
print('--------- Feature Analyzer ----------')
print('Dataframe name: "{}"'.format(aux_retrieve_name(df)))
print('Memory usage: {:03.2f} MB'.format(df.memory_usage(deep=True).sum() / 1024 ** 2))
return
Running this code with a given dataframe, I get the following output:
header_generator(trial_dataframe)
--------- Feature Analyzer ----------
Dataframe name: "trial_dataframe"
Memory usage: 63.08 MB