I have a python function that cleans up my dataframe(replaces whitespaces with _ and adds _ if column begins with a number):
These dataframes were jsons that have been converted to dataframes to easily work with them.
def prepare_json(df):
df = df.rename(lambda x: '_' + x if re.match('([0-9])\w+',x) else x, axis=1)
df = df.rename(lambda x: x.replace(' ', '_'), axis=1)
return df
This works for simple jsons like the following:
{"123asd":"test","test json":"test"}
Output:
{"_123asd":"test","test_json":"test"}
However when i try it with a more complex dataframe it does not work anymore.
Here is an exampe:
{"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
Trying it out with this i get the following error when trying to write the dataframe to bigquery:
Invalid field name "97A". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long. with loading dataframe
maybe my solution helps you:
I convert your dictionary to a string
find all keys of dictionary with regex
replace spaces in keys by _ and add _ before keys start with digit
convert string to the dictionary with ast.literal_eval(dict_string)
try this:
import re
import ast
from copy import deepcopy
def my_replace(match):
return match.group()[0] + match.group()[1] + "_" + match.group()[2]
dct = {"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
keys = re.findall("{\'.*?\': | \'.*?\': ", str(dct))
keys_bfr_chng = deepcopy(keys)
keys = [re.sub("\s+(?=\w)", '_', key) for key in keys]
keys = [re.sub(r"{\'\d", my_replace, key) for key in keys]
dct = str(dct)
for i in range(len(keys)):
dct = dct.replace(keys_bfr_chng[i], keys[i])
dct = ast.literal_eval(dct)
print(dct)
type(dct)
output:
{'SETDET': [{'SETPRTY': [{'DEAG': {'_95R': [{'Data_Source_Scheme': 'SCOM', 'Proprietary_Code': 'CH123456'}]}}, {'SAFE': {'_97A': [{'Account_Number': '123456789'}]}, 'SELL': {'_95P': [{'Identifier_Code_Location_Code': 'AB', 'Identifier_Code_Country_Code': 'AB', 'Identifier_Code_Logical_Terminal': 'XXX', 'Identifier_Code_Bank_Code': 'ABCD'}]}}, {'PSET': {'_95P': [{'Identifier_Code_Location_Code': 'ZZ', 'Identifier_Code_Country_Code': 'CH', 'Identifier_Code_Logical_Terminal': '', 'Identifier_Code_Bank_Code': 'INSE'}]}}], 'SETR': {'_22F': [{'Data_Source_Scheme': '', 'Indicator': 'TRAD'}]}}], 'TRADDET': [{'Other': {'_35B': [{'Identification_of_Security': 'CH0012138530', 'Description_of_Security': 'CREDIT SUISSE GROUP'}]}, 'SETT': {'_98A': [{'Date': '20181127'}]}, 'TRAD': {'_98A': [{'Date': '20181123'}]}}], 'FIAC': [{'SAFE': {'_97A': [{'Account_Number': '0123-1234567-05-001'}]}, 'SETT': {'_36B': [{'Quantity': '10,', 'Quantity_Type_Code': 'UNIT'}]}}], 'GENL': [{'SEME': {'_20C': [{'Reference': '1234567890123456'}]}, 'Other': {'_23G': [{'Subfunction': '', 'Function': 'NEWM'}]}, 'PREP': {'_98C': [{'Date': '20181123', 'Time': '165256'}]}}]}
dict
Related
I have a dataframe that looks like this, with 1 string column and 1 int column.
import random
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
I would like to get at the very end a dictionnary of lists that store all values of column B groupby A, like this :
What I made to achieve this to used a groupby to get number of occurences for column_B :
group_by = my_df.groupby(['column_A','column_B'])['column_B'].count().unstack().fillna(0).T
group_by
And then use some list comprehensions to create by hand my lists for each column_A and add them to the dictionnary.
Is there anyway to get more directly using a groupby ?
I am not aware of a method that is able to achieve that within the groupby statement. But I think you could try something like this alternatively:
import random
import pandas as pd
columns=['EG','EC','FI', 'ED', 'EB', 'FB', 'FCY', 'ECY', 'FG', 'FUR', 'E', '\[ED']
choices_str = random.choices(columns, k=200)
choices_int = random.choices(range(1, 8), k=200)
my_df = pd.DataFrame({'column_A': choices_str, 'column_B': choices_int})
final_dict = {val: my_df.loc[my_df['column_A'] == val, 'column_B'].values.tolist() for val in my_df['column_A'].unique()}
This dict-comprehension is a one-liner and takes all column_B values that correspond to a specific column_A value and assigns them to the dict stored in a list with column_A values as keys.
new to Python, trying to work out what the following line is doing in Python any help would be greatly appreciated
new = old.rename(index={element: (re.sub(' Equity', '', element)) for element in old.index.tolist()})
Assume that source CSV file has the following content:
c1,c2
Abc,A,10
Equity,B,20
Cex,C,30
Dim,D,40
If you run
old = pd.read_csv('input.csv', index_col=[0])
then old will have the following content:
c1 c2
Abc A 10
Equity B 20
Cex C 30
Dim D 40
Let's look at each part of your code.
old.index.tolist() contains: ['Abc', ' Equity', 'Cex', 'Dim'].
When you run {element: re.sub(' Equity', '', element) for element in old.index}
(a dictionary comprehension), you will get:
{'Abc': 'Abc', ' Equity': '', 'Cex': 'Cex', 'Dim': 'Dim'}
so each value is equal to its key with one exception: The value for
' Equity' key is an empty string.
Note that neither tolist() not parentheses surrounding re.sub(...)
are not needed (the result is the same).
And the last step:
new = old.rename(index=...) changes the index in old,
substituting ' Equity' with an empty string and the result is saved
under new variable.
That's all.
Assuming old is pandas DataFrame, the code is renaming the index (see rename) by removing the word Equity from each of the strings on it, for example:
import pandas as pd
import re
old = pd.DataFrame(list(enumerate(['Some Equity', 'No Equity', 'foo', 'foobar'])), columns=['id', 'equity'])
old = old.set_index('equity')
print(old)
Output (Before)
id
equity
Some Equity 0
No Equity 1
foo 2
foobar 3
Then if you run:
new = old.rename(index={element: (re.sub(' Equity', '', element)) for element in old.index.tolist()})
Output (After)
id
equity
Some 0
No 1
foo 2
foobar 3
The following expression, is known as dictionary comprehension:
{element: (re.sub(' Equity', '', element)) for element in old.index.tolist()}
for the data of the example above creates the following dictionary:
{'Some Equity': 'Some', 'No Equity': 'No', 'foo': 'foo', 'foobar': 'foobar'}
I have a German csv file that was incorrectly encoded. I want to convert the characters back to utf-8 using a dictionary. I thought what I was doing was correct, but when I print the DF, nothing has changed. Here's my code:
DATA_DIR = 'C:\\...'
translations = {
'ö': 'oe',
'ü': 'ue',
'ß': 'ss',
'ä': 'ae',
'€': '€',
'Ä': 'Ae',
'Ö': 'Oe',
'Ü': 'Ue'
}
def cleanup():
for file in os.listdir(os.path.join(DATA_DIR)):
if not file.lower().endswith('.csv'):
continue
data_utf = pd.read_csv(os.path.join(DATA_DIR, file), header=3, index_col=None, skiprows=0-2)
data_utf.replace(translations, inplace=True)
print(data_utf)
if __name__ == '__main__':
cleanup()
I also tried
for before, after in translations.items():
data_utf.replace(before, after)
within the function, and directly putting the translations in the replace itself. This process works if I specify the column in which to replace the characters, however. What do I need to do to apply these translations to the whole dataframe, as well as to the dataframe column headers? Thanks!
Add regex=True for replace in substrings, for columns is possible convert values to Series by Index.to_series and then use replace:
data_utf = pd.DataFrame({'raÜing':['ösaüs','Ä dd Ö','ÖÄ']})
translations = {
'ö': 'oe',
'ü': 'ue',
'ß': 'ss',
'ä': 'ae',
'€': '€',
'Ä': 'Ae',
'Ö': 'Oe',
'Ü': 'Ue'
}
data_utf.replace(translations, inplace=True, regex=True)
data_utf.columns = data_utf.columns.to_series().replace(translations, regex=True)
print (data_utf)
raUeing
0 oesaues
1 Ae dd Oe
2 OeAe
How would you implement the following using pandas?
part 1:
I want to create a new conditional column in input_dataframe. Each row in input_dataframe will be matched against a regex. If at lease one element in the row matches, than the element for this row in the new column will contain the matched value(s).
part 2: A more complete version would be:
The source of the regex is the value of each element originating form another series. (i.e. I want to know if each row in input_dataframe contains a value(s) form the passed series.
part 3: An even more complete version would be:
Instead of passing a series, I'd pass another Dataframe, regex_dataframe. For each column in it, I would implement the same process as part 2 above. (Thus, The result would be a new column in the input_dataframe for each column in the regex_dataframe.)
example input:
input_df = pd.DataFrame({
'a':['hose','dog','baby'],
'b':['banana','avocado','mango'],
'c':['horse','dog','cat'],
'd':['chease','cucumber','orange']
})
example regex_dataframe:
regex_dataframe = pd.DataFrame({
'e':['ho','ddddd','ccccccc'],
'f':['wwwwww','ado','kkkkkkkk'],
'g':['fffff','mmmmmmm','cat'],
'i':['heas','ber','aaaaaaaa']
})
example result:
result_dataframe = pd.DataFrame({
'a': ['hose', 'dog', 'baby'],
'b': ['banana', 'avocado', 'mango'],
'c': ['horse', 'dog', 'cat'],
'd': ['chease', 'cucumber', 'orange'],
'e': ['ho', '', ''],
'f': ['', 'ado', ''],
'g': ['', '', 'cat'],
'i': ['heas', 'ber', '']
})
Rendered:
First of all, rename regex_dataframe so individual cells correspond to each other in both dataframes.
input_df = pd.DataFrame({
'a':['hose','dog','baby'],
'b':['banana','avocado','mango'],
'c':['horse','dog','cat'],
'd':['chease','cucumber','orange']
})
regex_dataframe = pd.DataFrame({
'a':['ho','ddddd','ccccccc'],
'b':['wwwwww','ado','kkkkkkkk'],
'c':['fffff','mmmmmmm','cat'],
'd':['heas','ber','aaaaaaaa']
})
Apply the method DataFrame.combine(other, func, fill_value=None, overwrite=True) to to get pairs of corresponding columns (which are Series).
Apply Series.combine(other, func, fill_value=nan) to get pairs of corresponding cells.
Apply regex to the cells.
import re
def process_cell(text, reg):
res = re.search(reg, text)
return res.group() if res else ''
def process_column(col_t, col_r):
return col_t.combine(col_r, lambda text, reg: process_cell(text, reg))
input_df.combine(regex_dataframe, lambda col_t, col_r: process_column(col_t, col_r))
I'm trying to get unique values from the column 'name' for every distinct value in column 'gender'.
Here's sample data:
sample input_file_data:
index,name,gender,alive
1,Adam,Male,Y
2,Bella,Female,N
3,Marc,Male,Y
1,Adam,Male,N
I could get it when I give a value corresponding to 'gender' like for example, gave "Male" in the code below:
filtered_data = filter(lambda person: person["gender"] == "Male", input_file_data)
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in filtered_data)
countt = [rec[gender] for rec in reader]
final1 = input_file_name + ".txt", "gender", "Male"
output1 = str(final1).replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
final2 = set(re.findall(r"name': '(.*?)'", str(filtered_data)))
final_count = len(final2)
output = str(final_count) + " occurrences", str(final2)
output2 = output1, str(output)
output_final = str(output2).replace('\\', "").replace('"',"").replace(']"', "]").replace("set", "").replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
output_final = output_final + "\n"
current output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc]
Expected output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc], Female, 1 occurrences [Bella]
which should show up all the unique occurrences of names, for every distinct gender value (without hardcoding). Also I do not want to use Pandas. Any help is highly appreciated.
PS- I have multiple files and not all files have the same columns. So I can't hardcode them. Also, all the files have a 'name' column, but not all files have a 'gender' column. And this script should work for any other column like 'index' or 'alive' or anything else for that matter and not just gender.
I would use the csv module along with the defaultdict from collections for this. Say this is stored in a file called test.csv:
>>> import csv
>>> from collections import defaultdict
>>> with open('test.csv', 'rb') as fin: data = list(csv.reader(fin))[1:]
>>> gender_dict = defaultdict(set)
>>> for idx, name, gender, alive in data:
gender_dict[gender].add(name)
>>> gender_dict
defaultdict(<type 'set'>, {'Male': ['Adam', 'Marc'], 'Female': ['Bella']})
You now have a dictionary. Each key is a unique value from the gender column. Each value is a set, so you'll only get unique items. Notice that we added 'Adam' twice, but only see one in the resulting set.
You don't need defaultdict, but it allows you to use less boilerplate code to check if a key exists.
EDIT: It might help to have better visibility into the data itself. Given your code, I can make the following assumptions:
input_file_data is an iterable (list, tuple, something like that) containing dictionaries.
Each dictionary contains a 'gender' key. If it didn't include at least 'gender', you would get a key error when trying to filter it.
Each dictionary has a 'name' key, it looks like.
Rather than doing all of that regex, what about this?
>>> gender_dict = {'Male': set(), 'Female': set()}
>>> for item in input_file_data:
gender_dict[item['gender']].add(item['name'])
You can use item.get('name') instead of item['name'] if not every entry will have a name.
Edit #2: Ok, the first thing you need to do is get your data into a consistent state. We can absolutely get to a point where you have a column name (gender, index, alive, whatever you want) and a set of unique names corresponding to those columns. Something like this:
data_dict = {'gender':
{'Male': ['Adam', 'Marc'],
'Female': ['Bella']}
'alive':
{'Y': ['Adam', 'Marc'],
'N': ['Bella', 'Adam']}
'index':
{1: ['Adam'],
2: ['Bella'],
3: ['Marc']}
}
If that's what you want, you could try this:
>>> data_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))
>>> for element in input_file_data:
for key, value in element.items():
if key != 'name':
data_dict[key][value].add(element[name])
That should get you what you want, I think? I can't test as I don't have your data, but give it a try.