Assistance with splitting data frame to new columns - python

I'm having trouble splitting a data frame by _ and creating new columns from it.
The original strand
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
my current code
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
Output:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
I would really like to split up 'section' in a way that puts it in new columns based on '_'
I've tried so many different variations of regex to split 'section' and all of them either gave me headings with no fill or they added columns after section and text, which isn't useful. I should also add theres going to be around 100,000 observations.
Desired result:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
Any guidance would be appreciated.

If you always know the number of splits, you can do something like:
import pandas as pd
df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })
# Split column by "_"
items = df["a"].str.split("_")
# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)
# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)
# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)
That way, the dataframe will turn into
a b c d
0 test_a_b b a test
1 test2_c_d d c test2
Since you want the columns to be on a certain order, you can reorder the dataframe's columns as below:
>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
d c b a
0 test a b test_a_b
1 test2 c d test2_c_d

Related

Openpyxl and Binary Search

The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.
You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

Check if a string is present in multiple lists

I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan

How to create a column in a data frame based on the values of another two columns?

I am pre-formatting some data for a tax filing and I am using python to automate some of the excel work. I have a data frame with three columns: Account; Opposite Account; Amount. I only have the names of the opposite account and the values, but the values for the same pair of account - opposite account should be exactly the same. For example:
Account Opposite Acc. Amount
Cash -240.56
Supplies 240.56
Dentist -10.45
Gum 10.45
From that, I can deduce that Cash is the opposite of Supplies and Dentist is the opposite to Gum, so I would like my output to be:
Account Opposite Acc. Amount
Supplies Cash -240.56
Cash Supplies 240.56
Gum Dentist -10.45
Dentist Gum 10.45
Right now I doing this manually by using str.contains
df = df.assign(en_accounts = df['Opposite Acc.'])
df['Account'] = df['Account'].fillna("0")
df.loc[df['Account'].str.contains('Cash'), 'Account'] = 'Supplies'
But there are many variables and I wonder if there is a way to automate this process in python. One strategy could be: if two rows add up to 0, the accounts are a match --> therefore when item A (such as supplies) happens in "Opposite Acc.", item B (such as Cash) is put in the same row but in "Account".
This is what I have so far:
df['Amount'] = np.abs(df["Amount"])
c1 = df['Amount']
c2 = df['Opposing Acc.']
for i in range(1,len(c1)-1):
p = c1[i-1]
x = c1[i]
n = c1[i+1]
if p == x:
for i in range(1,len(c2)-1):
a = c2[i-1]
df.loc[df['en_account']] = a
But I get the following error: "None of [Index[....]\n dtype='object', length=28554)] are in the [index]"

Detection of variable length pattern in pandas dataframe column

The last 2 columns of a timeseries indexed dataframe identify the start ('A' or 'AA' or 'AAA'), end ('F' or 'FF' or 'FFF') and duration (number of rows between start and end) of a physical process, they look like this:
and the A-F sequences or the n sequences between them are of variable length.
How can I identify these patterns and for each of them calculate averages of other columns for the corresponding rows?
What I, very badly, tried to do is the following:
import pandas as pd
import xlrd
##### EXCEL LOAD
filepath= 'H:\\CCGT GE startup.xlsx'
df = pd.read_excel(filepath,sheet_name='Sheet1',header=0,skiprows=0,parse_cols='A:CO',index_col=0)
df = df.sort_index() # set increasing time index, source data is time decreasing
gas=[]
for i,row in df.iterrows():
if df['FLAG STARTUP TG1'] is not 'n':
while 'F' not in df['FLAG STARTUP TG1']:
gas.append(df['PORTATA GREZZA TG1 - m3/h'])
gas.append(i)
But the script gets stuck on the first if (doesn't match the 'n' condition and keeps appending the same row,i pair). Additionally, my method is also wrong in excluding the last 'F' row that still pertains to the same process and should be considered as part of it!
p.s. the first 1000 rows df is here http://www.filedropper.com/ccgtgestartup1000
p.p.s. Besides not working, my method is also wrong in excluding the last 'F' row that still pertains to the same process and should be considered as part of it!
p.p.p.s. The 2 columns refer to 2 different processes/machines and are unrelated (almost, more on this later), I want to do the same analysis on both (they will refer to different columns' averages). The first "A" string marks the beginning of the process and gets repeated until the last timestamp that gets marked with an 'F' string. in the original file the timestamps are descending and that's why i used the sort_index() method. The string length depends on other columns values but the obvious FLAG columns correlation is only in the 3 character strings 'AAA'&'FFF' because this should occur only if the the 2 processes start in +-1 timestamp from each other.
This is how I managed to get the desired results (N.B. I later decided that only the single character 'A'-->'F' sequences are of interest)
import pandas as pd
import numpy as np
##### EXCEL LOAD
filepath= 'H:\\CCGT GE startup.xlsx'
df = pd.read_excel(filepath,sheet_name='Sheet1',header=0,skiprows=0,parse_cols='A:CO',index_col=0)
df = df.sort_index() # set increasing time index, source data is time decreasing
tg1 = pd.DataFrame(index=df.index.copy(),columns=['counter','flag','gas','p','raw_p','tv_p','lhv','fs'])
k = 0
for i,row in df.iterrows():
if 'A' == str(row['FLAG STARTUP TG1']):
tg1.ix[i,'flag']=row['FLAG STARTUP TG1']
tg1.ix[i,'gas']=row['Portata gas naturale']
tg1.ix[i,'counter']=k
tg1.ix[i,'fs']=row['1FIRED START COUNT - N°']
tg1.ix[i,'p']=row['POTENZA ATTIVA MONTANTE 1 SU 400 KV - MW']
tg1.ix[i,'raw_p']=row['POTENZA ATTIVA MONTANTE 1 SU 15 KV - MW']
tg1.ix[i,'tv_p']=row['POTENZA ATTIVA MONTANTE TV - MW']
tg1.ix[i,'lhv']=row['LHV - MJ/Sm3']
elif 'F' == str(row['FLAG STARTUP TG1']):
tg1.ix[i,'flag']=row['FLAG STARTUP TG1']
tg1.ix[i,'gas']=row['Portata gas naturale']
tg1.ix[i,'counter']=k
tg1.ix[i,'fs']=row['1FIRED START COUNT - N°']
tg1.ix[i,'p']=row['POTENZA ATTIVA MONTANTE 1 SU 400 KV - MW']
tg1.ix[i,'raw_p']=row['POTENZA ATTIVA MONTANTE 1 SU 15 KV - MW']
tg1.ix[i,'tv_p']=row['POTENZA ATTIVA MONTANTE TV - MW']
tg1.ix[i,'lhv']=row['LHV - MJ/Sm3']
k+=1
tg1 = tg1.dropna(axis=0)
tg1 = tg1[tg1['gas'] != 0] #data where gas flow measurement is missing is dropped
tg1 = tg1.convert_objects(convert_numeric=True)
#timestamp count for each startup for duration calculation
counts = pd.DataFrame(tg1['counter'].value_counts(),columns=['duration'])
counts['start']=counts.index
counts = counts.set_index(np.arange(len(tg1['counter'].value_counts())))
tg1 = tg1.merge(counts,how='inner',left_on='counter',right_on='start')
# filter out non pertinent startups (too long or too short)
tg1 = tg1[tg1['duration'].isin([6,7])]
#calculate thermal input per start (process)
table = tg1.groupby(['counter']).mean()
table['t_in']=table.apply((lambda row: row['gas']*row['duration']*0.25*row['lhv']/3600),axis=1)
Any improvements and suggestions to do the calculations in the iteration and avoid all the "prep- work" after it are welcome.

Categories