Python: .Rename & index - What is this doing? - python

new to Python, trying to work out what the following line is doing in Python any help would be greatly appreciated
new = old.rename(index={element: (re.sub(' Equity', '', element)) for element in old.index.tolist()})

Assume that source CSV file has the following content:
c1,c2
Abc,A,10
Equity,B,20
Cex,C,30
Dim,D,40
If you run
old = pd.read_csv('input.csv', index_col=[0])
then old will have the following content:
c1 c2
Abc A 10
Equity B 20
Cex C 30
Dim D 40
Let's look at each part of your code.
old.index.tolist() contains: ['Abc', ' Equity', 'Cex', 'Dim'].
When you run {element: re.sub(' Equity', '', element) for element in old.index}
(a dictionary comprehension), you will get:
{'Abc': 'Abc', ' Equity': '', 'Cex': 'Cex', 'Dim': 'Dim'}
so each value is equal to its key with one exception: The value for
' Equity' key is an empty string.
Note that neither tolist() not parentheses surrounding re.sub(...)
are not needed (the result is the same).
And the last step:
new = old.rename(index=...) changes the index in old,
substituting ' Equity' with an empty string and the result is saved
under new variable.
That's all.

Assuming old is pandas DataFrame, the code is renaming the index (see rename) by removing the word Equity from each of the strings on it, for example:
import pandas as pd
import re
old = pd.DataFrame(list(enumerate(['Some Equity', 'No Equity', 'foo', 'foobar'])), columns=['id', 'equity'])
old = old.set_index('equity')
print(old)
Output (Before)
id
equity
Some Equity 0
No Equity 1
foo 2
foobar 3
Then if you run:
new = old.rename(index={element: (re.sub(' Equity', '', element)) for element in old.index.tolist()})
Output (After)
id
equity
Some 0
No 1
foo 2
foobar 3
The following expression, is known as dictionary comprehension:
{element: (re.sub(' Equity', '', element)) for element in old.index.tolist()}
for the data of the example above creates the following dictionary:
{'Some Equity': 'Some', 'No Equity': 'No', 'foo': 'foo', 'foobar': 'foobar'}

Related

How to read through keys in a dictionary and check if they are in the values of a column then assign another column with the values of the dictionary?

I am attempting to read through keys in a dictionary and if these keys are in the values of column A then I want column B to be filled with the values of the dictionary that match that key/keys.
For example;
Column A
Column B
KSTRSHRASA
NaN
someWord-hi
NaN
Dictionary:
dict = {'ks': 'Killos',
'RAS': 'Round Point System',
'hi': 'Hello World',
'vc': 'VVCCR'}
Resulting in the below dataframe:
Column A
Column B
KSTRSHRASA
['killos', 'Round Point System']
someWord-hi
['Hello World']
I have tried to do the following:
for k,v in dict.items():
if k.lower() in str(dataframe['Column A']).lower():
v
dataframe['Column B'] = [v]
but it results in the below error: ValueError: Length of values (1) does not match length
of index (6483)
I wrote a code that can perform this if the input is one text but I am unable to apply it to the entire column. I have also tried to convert the dictionary to a dataframe and do it that way but still no luck.
The code I wrote for a singular text input:
text = str(input('Please insert hostname here: '))
results = list()
for k,v in codes.items():
if k.lower() in text.lower():
v
results.append(v)
else:
continue
#print(v)
print ('The type of this device is most likely a/an: ',results)
so the question is:
Is there a way to have python read through the keys in the dictionary and return its corresponding values in Columb B of a dataframe if those keys are contained (exist in) the items in Column A?
This is what I currently have.
dataframe = {'Column A': ['KSTRSHRASA', 'someWord-hi'], 'Column B': [np.nan, np.nan]}
df = pd.DataFrame(dataframe)
dictionary = {'ks': 'Killos',
'RAS': 'Round Point System',
'hi': 'Hello World',
'vc': 'VVCCR'}
# convert dictionary key to lower case
dictionary = {k.lower(): v for k, v in dictionary.items()}
# keep a copy of original dataframe
df_original = df.copy()
# convert dataframe column A to lower case
df['Column A'] = df['Column A'].str.lower()
# if dataframe column A contains dictionary key, append it to column B as a list
df['Column B'] = df['Column A'].apply(lambda x: [dictionary[i] for i in dictionary if i in x])
# convert Column A back to original case
df['Column A'] = df_original['Column A']
OUTPUT:
Column A Column B
0 KSTRSHRASA [Killos, Round Point System]
1 someWord-hi [Hello World]
Hope this helps!

Extract Keys from a string representation of dictionaries stored within a pandas dataframe

I have the following dataframe which contains a string representations of dictionaries in every row of the columns summary_in and summary out:
import pandas as pd
df_vals = [[0,
'Person1',
"['xyz', 'abc', 'Jim']",
"['jkl', 'efg', 'Smith']",
1134,
1180,
46,
'sample text',
"{'xyz_key': ['xyz', 756.0], 'abc_key': ['abc', 378.0], 'Jim_key': ['Jim', 0]}",
"{'jkl_key': ['jkl', 395.0], 'efg_key': ['efg', 785.0], 'Smith_key': ['Smith', 0]}"],
[1,
'Person2',
"['lmn', 'opq', 'Mick']",
"['rst', 'uvw', 'Smith']",
1134,
1180,
46,
'sample tex2',
"{'lmn_key': ['lmn', 756.0], 'opq_key': ['opq', 378.0], 'Mick_key': ['Mick', 0]}",
"{'rst_key': ['rst', 395.0], 'uvw_key': ['uvw', 785.0], 'Smith_key': ['Smith', 0]}"]]
df = pd.DataFrame(data=df_vals, columns =['row','Person','in','out','val1','val2','diff','note','summary_in','summary_out'] )
df
What I am trying to do it iterate over every row in the dataframe to print each key that exists in the summary_in for each Person row
After running this code to test datatypes:
#create dict of column
dict_from_dataframe = df['summary_in'].to_dict()
print(type(dict_from_dataframe))
for k in dict_from_dataframe.items():
d = k[1]
print(type(d))
print(d)
I get the following output that shows once i hit the next level, the dictionary (d)is now a string and cannot be accessed as would normally be with a dictionary:
<class 'dict'>
<class 'str'>
{'xyz_key': ['xyz', 756.0], 'abc_key': ['abc', 378.0], 'Jim_key': ['Jim', 0]}
<class 'str'>
{'lmn_key': ['lmn', 756.0], 'opq_key': ['opq', 378.0], 'Mick_key': ['Mick', 0]}
Any ideas on what I have done wrong here?
My expected output is to loop over the df to print the following
Person1
xyz_key
abc_key
Jim_key
Person2
lmn_key
opq_key
Mick_key
Any help would be much appreciated! Thanks
IIUC, you could use a custom function. You need to convert the string representation to dictionary with ast.literal_eval.
from ast import literal_eval
def print_infos(s):
print(s['Person'])
d = literal_eval(s['summary_in'])
for k in d:
print(k)
for _, r in df.iterrows():
print_infos(r)
output:
Person1
xyz_key
abc_key
Jim_key
Person2
lmn_key
opq_key
Mick_key

How to rename columns of a nested dataframe?

I have a python function that cleans up my dataframe(replaces whitespaces with _ and adds _ if column begins with a number):
These dataframes were jsons that have been converted to dataframes to easily work with them.
def prepare_json(df):
df = df.rename(lambda x: '_' + x if re.match('([0-9])\w+',x) else x, axis=1)
df = df.rename(lambda x: x.replace(' ', '_'), axis=1)
return df
This works for simple jsons like the following:
{"123asd":"test","test json":"test"}
Output:
{"_123asd":"test","test_json":"test"}
However when i try it with a more complex dataframe it does not work anymore.
Here is an exampe:
{"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
Trying it out with this i get the following error when trying to write the dataframe to bigquery:
Invalid field name "97A". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long. with loading dataframe
maybe my solution helps you:
I convert your dictionary to a string
find all keys of dictionary with regex
replace spaces in keys by _ and add _ before keys start with digit
convert string to the dictionary with ast.literal_eval(dict_string)
try this:
import re
import ast
from copy import deepcopy
def my_replace(match):
return match.group()[0] + match.group()[1] + "_" + match.group()[2]
dct = {"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
keys = re.findall("{\'.*?\': | \'.*?\': ", str(dct))
keys_bfr_chng = deepcopy(keys)
keys = [re.sub("\s+(?=\w)", '_', key) for key in keys]
keys = [re.sub(r"{\'\d", my_replace, key) for key in keys]
dct = str(dct)
for i in range(len(keys)):
dct = dct.replace(keys_bfr_chng[i], keys[i])
dct = ast.literal_eval(dct)
print(dct)
type(dct)
output:
{'SETDET': [{'SETPRTY': [{'DEAG': {'_95R': [{'Data_Source_Scheme': 'SCOM', 'Proprietary_Code': 'CH123456'}]}}, {'SAFE': {'_97A': [{'Account_Number': '123456789'}]}, 'SELL': {'_95P': [{'Identifier_Code_Location_Code': 'AB', 'Identifier_Code_Country_Code': 'AB', 'Identifier_Code_Logical_Terminal': 'XXX', 'Identifier_Code_Bank_Code': 'ABCD'}]}}, {'PSET': {'_95P': [{'Identifier_Code_Location_Code': 'ZZ', 'Identifier_Code_Country_Code': 'CH', 'Identifier_Code_Logical_Terminal': '', 'Identifier_Code_Bank_Code': 'INSE'}]}}], 'SETR': {'_22F': [{'Data_Source_Scheme': '', 'Indicator': 'TRAD'}]}}], 'TRADDET': [{'Other': {'_35B': [{'Identification_of_Security': 'CH0012138530', 'Description_of_Security': 'CREDIT SUISSE GROUP'}]}, 'SETT': {'_98A': [{'Date': '20181127'}]}, 'TRAD': {'_98A': [{'Date': '20181123'}]}}], 'FIAC': [{'SAFE': {'_97A': [{'Account_Number': '0123-1234567-05-001'}]}, 'SETT': {'_36B': [{'Quantity': '10,', 'Quantity_Type_Code': 'UNIT'}]}}], 'GENL': [{'SEME': {'_20C': [{'Reference': '1234567890123456'}]}, 'Other': {'_23G': [{'Subfunction': '', 'Function': 'NEWM'}]}, 'PREP': {'_98C': [{'Date': '20181123', 'Time': '165256'}]}}]}
dict

Aggregate strings across consecutive non-NaN cells in pandas column but not across whole column

I am working on an nlp problem where I have to analyze strangely formatted excel files.
There is one column with text, where each document spans multiple cells. Documents themselves are separated by empty cells. There are other columns with scores that I want to predict from the text data.
This is what it looks like
I have imported the sheets to a pandas dataframe and now I am trying to aggregate the cells belonging to each document while preserving the scores.
This is the goal state
I have started to play around with nested loops, but I feel like it is much more complicated than necessary.
How would you approach this? Each document covers a different number of cells and documents are separated by different numbers of empty cells. To make it more complicated the scores in the columns to the right are sometimes in the same row as the first and sometimes in the same row as the last cell of the corresponding document.
I would greatly appreciate your help! There must be a simple solution.
Just a simple example how it could work:
import pandas as pd
# setting up the DataFrame with sample data
df = pd.DataFrame({'Document': ['This is ', 'first', None, 'This is ', 'second', `None, 'this ', 'is ', 'third'],`
'Score': [None, 1, None, None, 2, None, None, 3, None]})
result_df = pd.DataFrame({'Document':[], 'Score':[]})
doc = ''
for index, row in df.iterrows():
if pd.notnull(row['Score']):
#any not NaN value within processed document is score
score = row['Score']
if row['Document']:
#build doc string until the line is not NaN
doc += row['Document']
else:
result_df = result_df.append({'Document':doc, 'Score':score}, ignore_index=True)
doc = ''
if doc:
#when the last line (Document) is not NaN save/print results also:
result_df = result_df.append({'Document':doc, 'Score':score}, ignore_index=True)
Output (result_df):
Document Score
0 This is first 1.0
1 This is second 2.0
2 This is third 3.0
Using #Lukas setup:
df = pd.DataFrame({'Document': ['This is ', 'first', None, 'This is ', 'second', None, 'this ', 'is ', 'third'],
'Score': [None, 1, None, None, 2, None, None, 3, None]})
df.groupby(df['Document'].isna().cumsum(), as_index=False)\
.apply(lambda x : pd.Series([''.join(x['Document'].dropna()),
x.loc[x['Score'].notna(), 'Score'].values[0]],
index=['Document','Score']))
Output:
Document Score
0 This is first 1.0
1 This is second 2.0
2 this is third 3.0

joining a list and a list of lists in python

I have what should be a simple problem but 3 hours into trying different things and I cant solve it.
I have a pymysql returning me results from a query. I cant share the exact example but this straw man should do.
cur.execute("select name, address, phonenum from contacts")
This returns results perfectly which i grab with
results = cur.fetchall()
and then convert to a list object exactly as I want it
data = list(results)
Unfortunately this doesn't include the header but you can get it with cur.description (which contains metadata including but not limited to the header). I push this into a list
Header=[]
for n in cur.description:
header.append(str((n[0])))
so my header looks like:
['name','address','phonenum']
and my results look like:
[['Tom','dublin','12345'],['Bob','Kerry','56789']]
I want to create a dataframe in pandas and then pivot it but it needs column headers to work properly. I had previously been importing a completed csv into a pandas DF which included the header so this all worked smoothly but now i need to get this data direct from the DB so I was thinking, that's easy, I just join the two lists and hey presto I have what I am looking for, but when i try to append I actually wind up with this:
['name','address','phonenum',['Tom','dublin','12345'],['Bob','Kerry','56789']]
when i need this
[['name','address','phonenum'],['Tom','dublin','12345'],['Bob','Kerry','56789']]
Anyone any ideas?
Much appreciated!
Addition of lists concatenates contents:
In [17]: [1] + [2,3]
Out[17]: [1, 2, 3]
This is true even if the contents are themselves lists:
In [18]: [[1]] + [[2],[3]]
Out[18]: [[1], [2], [3]]
So:
In [13]: header = ['name','address','phonenum']
In [14]: data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
In [15]: [header] + data
Out[15]:
[['name', 'address', 'phonenum'],
['Tom', 'dublin', '12345'],
['Bob', 'Kerry', '56789']]
In [16]: pd.DataFrame(data, columns=header)
Out[16]:
name address phonenum
0 Tom dublin 12345
1 Bob Kerry 56789
Note that loading a DataFrame with data from a database can also be done with pandas.read_sql.
is that what you are looking for?
first = ['name','address','phonenum']
second = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
second = [first] + second
print second
'[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]'
Other possibilities:
You could insert it into data location 0 as a list
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header)
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]
But if you are going to manipulate header variable you can shallow copy it
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header[:])
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]

Categories