Match string value to dataframe value and add to string - python

I have a string of column names and their datatype called cols below:
_LOAD_DATETIME datetime,
_LOAD_FILENAME string,
_LOAD_FILE_ROW_NUMBER int,
_LOAD_FILE_TIMESTAMP datetime,
ID int
Next I make a df from a gsheet I'm reading from the below:
import pandas as pd
output = [['table_name', 'schema_name', 'column_name', 'data_type', 'null?', 'default', 'kind', 'expression', 'comment', 'database_name', 'autoincrement', 'DateTime Comment Added'], ['ACCOUNT', 'SO', '_LOAD_DATETIME', '{"type":"TIMESTAMP_LTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', '', 'V'], ['ACCOUNT', 'SO', '_LOAD_FILENAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'VE'], ['B_ACCOUNT', 'SO', '_LOAD_FILE_ROW_NUMBER', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'V'], ['ACCOUNT', 'SO', '_LOAD_FILE_TIMESTAMP', '{"type":"TIMESTAMP_NTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', 'TEST', 'VE', '', '2022-02-16'], ['ACCOUNT', 'SO', 'ID', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":false,"fixed":false}', 'NOT_NULL', '', 'COLUMN', '', 'ID of Account', 'V', '', '2022-02-16'],]
df = pd.DataFrame(output)
df.columns = df.iloc[0]
df = df[1:]
last_2_days = '2022-02-15'
query_list = []
for index, row in df.iterrows():
if row['comment'] is not None and row['comment'] != '' and (row['DateTime Comment Added'] >= last_2_days):
comment_data = row['column_name'], row['comment']
query_list.append(comment_data)
when I print out query_list it looks like this, which is getting the correct data since I only want to get the column_name and comment when the DateTime Comment Added column is within the last 2 days of today:
[('_LOAD_FILE_TIMESTAMP', 'TEST'), ('ID', 'ID of Account')]
What I want to do next (and I'm having trouble figuring out how) is from my cols string earlier I want to add the comment from the query_list to the correct column name in cols AND add the word COMMENT before the actual comment
so cols next should look like this:
_LOAD_DATETIME datetime,
_LOAD_FILENAME string,
_LOAD_FILE_ROW_NUMBER int,
_LOAD_FILE_TIMESTAMP datetime COMMENT 'TEST',
ID int COMMENT 'ID of Account'

Related

How to sort nested dict based on second value

I have a dict like this: (More than 1000 records)
{"Device1": [["Device1", "TenGigabitEthernet1/0/12", "SHUT", "", "", "IDF03"], ["Device1", "TenGigabitEthernet1/0/11", "SHUT", "", "", "IDF03", "#f76f6f"]], "Device2": [["Device2", "TenGigabitEthernet1/0/12", "SHUT", "", "", "IDF03"], ["Device2", "TenGigabitEthernet1/0/11", "SHUT", "", "", "IDF03", "#f76f6f"]]}
The problem is, I don't know how to sort the dict based on the portName which would be TenGigabitEthernet1/0/* or GigabitEthernet1/0/*
I have the following code but it's not doing it right:
with open("data-dic.txt", 'r') as dic:
data = dic.read()
dataDic = json.loads(data)
dataDic = ast.literal_eval(json.dumps(dataDic))
d2 = OrderedDict({ k : dataDic[1] for k in natsorted(dataDic) })
print(d2)
It is sorting the keys which is the Device1, Device2,...
How can I sort the dict based on the second value of the nested dict? which would be all the portNames.
import pandas as pd
from itertools import chain
header = ['Name', 'Connection', 'Type', 'Col_3', 'Col_4', 'ID', 'Color']
df = pd.DataFrame(chain(*data.values()), columns=header).fillna('')
print(df)
This looks like:
Name Connection Type Col_3 Col_4 ID Color
0 Device1 TenGigabitEthernet1/0/12 SHUT IDF03
1 Device1 TenGigabitEthernet1/0/11 SHUT IDF03 #f76f6f
2 Device2 TenGigabitEthernet1/0/12 SHUT IDF03
3 Device2 TenGigabitEthernet1/0/11 SHUT IDF03 #f76f6f
Overkill for this issue... but if you are going to be doing other manipulation of this data, you may want to consider pandas.
df['Port'] = df.Connection.str.extract('.*/(.*)').astype(int)
out = (df.sort_values('Port')
.drop('Port', axis=1)
.groupby('Name', sort=False)
.apply(lambda x: x.apply(list, axis=1).tolist())
.to_dict())
print(out)
Output:
{'Device1': [['Device1', 'TenGigabitEthernet1/0/11', 'SHUT', '', '', 'IDF03', '#f76f6f'], ['Device1', 'TenGigabitEthernet1/0/12', 'SHUT', '', '', 'IDF03', '']],
'Device2': [['Device2', 'TenGigabitEthernet1/0/11', 'SHUT', '', '', 'IDF03', '#f76f6f'], ['Device2', 'TenGigabitEthernet1/0/12', 'SHUT', '', '', 'IDF03', '']]}
Use a dict comprehension and sort sublist by index.
output = {k: sorted(data.get(k), key=lambda x: x[1].split("/", 2)[-1]) for k in data.keys()}
print(output)
{'Device1': [['Device1', 'TenGigabitEthernet1/0/11', 'SHUT', '', '', 'IDF03', '#f76f6f'], ['Device1', 'TenGigabitEthernet1/0/12', 'SHUT', '', '', 'IDF03']], 'Device2': [['Device2', 'TenGigabitEthernet1/0/11', 'SHUT', '', '', 'IDF03', '#f76f6f'], ['Device2', 'TenGigabitEthernet1/0/12', 'SHUT', '', '', 'IDF03']]}
You can do this,
with open("data-dic.txt", 'r') as dic:
data = json.load(dic)
d2 = OrderedDict(sorted(data.items(), key=lambda x:int(x[1][0][1].rsplit('/', 1)[-1])))
print(d2)
Sorting based on the value.first_row.second_element.rsplit('/', 1)[-1]
NB:
dataDic = ast.literal_eval(json.dumps(dataDic)) This line of code doing absolutely nothing. You change dict -> str -> dict. And it's not a good practice to use ast.literal_eval to parse the JSON.

Pandas dataframe, get the row number for a column meeting certain conditions

This is similar to a question I asked before but I needed to make some changes to my condition statement.
I have the below output that I make a dataframe from. I then check each row if Status rows are empty and if comment is not empty. The next thing I want to get is the row number of the Status col that meets those conditions:
output = [['table_name', 'schema_name', 'column_name', 'data_type', 'null?', 'default', 'kind', 'expression', 'comment', 'database_name', 'autoincrement', 'Status'], ['ACCOUNT', 'SO', '_LOAD_DATETIME', '{"type":"TIMESTAMP_LTZ","precision":0,"scale":9,"nullable":true}', 'TRUE', '', 'COLUMN', '', 'date and time when table was loaded', 'VICE_DEV'], ['ACCOUNT', 'SO', '_LOAD_FILENAME', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'DEV'], ['ACCOUNT', 'SO', '_LOAD_FILE_ROW_NUMBER', '{"type":"TEXT","length":16777216,"byteLength":16777216,"nullable":true,"fixed":false}', 'TRUE', '', 'COLUMN', '', '', 'DEV']]
df = pd.DataFrame(output)
df.columns = df.iloc[0]
df = df[1:]
query_list = []
for index, row in df.iterrows():
if row['Status'] is None and row['comment'] is not None and row['comment'] != '':
empty_status = df[df['Status'].isnull()].index.tolist()
I've tried with empty_status var above but I see:
empty_status_idx = [1, 2, 3]
when I would like it to be is below, because only the first row meets those conditions:
empty_status = [1]
because only the first row has a comment and has status empty or null

How do I split a name in a list then reinsert the split items?

I have lists inside the csv list:
newlist = [
['id', 'name', 'lastContactedTime', 'email', 'phone_phones', 'home_phones', 'mobile_phones', 'work_phones', 'fax_phones', 'other_phones', 'address_1', 'address_2', 'address_3', 'city', 'state', 'postal_code', 'country', 'tags'],
['12-contacts', 'Courtney James', '', 'courtney#forlanchema.com', '+1 3455463849', '', '', '', '', '', '654 Rodney Franklin street', '', '', 'Birmingham', 'AL', '45678', 'US', ''],
['4-contacts', 'Joe Malcoun', '2019-08-13 14:41:12', 'ceo#nutshell.com', '', '', '', '', '', '', '212 South Fifth Ave', '', '', 'Ann Arbor', 'MI', '48103', 'US', ''],
['8-contacts', 'Rafael Acosta', '', 'racosta#forlanchema.com', '+1 338551534', '', '', '', '', '', '13 Jordan Avenue SW', '', '', 'Birmingham', 'AL', '45302', 'US', '']
]
I want to create a recurring event where I split the names like: "Courtney James" in each list and add it to a new list.
I have tried to split and append each name separately using a while loop to a list but it did not work out
#Splitting an item in the list and adding it to a new list
m = 1
while newlist[m][1] != None:
splitter = newlist[m][1].split()
namelist = splitter
m+1
print(namelist)
else:
break
I get errors or the code does not compile. I expect the names to be split and added to a new list.
My desired output would be recurring lists to be able to add it to a new excel worksheet using xlsxwriter:
Headers= ['Lastname','First name','Company','Title','Willing To share', 'Willing to introduce', 'Work phone', 'Work email', 'Work street', 'Work City', ' Work State', 'Work Zip', 'Personal Street', 'Personal City', 'Personal State', 'Personal Zip', 'Mobile Phone', 'Personal email', 'Note', 'Note Category']
List1= ['Doe1', 'John', 'company1', 'CIO', 'Yes', 'Yes', '999-999-999', 'email#email.com', '123 work street', 'workville', 'IL', '12345', '1234 personal street', 'peronville', 'Il', '12345', '999-999-999', 'personemail#email.com', 'public note visible to everyone', 'Public']
List2=
List3=
When you split each name, you get a list such as [first_name, last_name]. Assuming you wanted to build up a "list of these lists", then you want to do the following using your code as a basis:
namelist = [] # new, empty list
for i in range(1, len(newlist)):
names = newlist[i][1].split() # this yields [first_name, last_name]
#print(names)
namelist.append([names[1], names[0]]) # [last_name, first_name]
range(1, len(newlist)) generates the numbers 1, 2, ... length of newlist - 1
namelist.append([names[1], names[0]]) appends the split names to our new list
The result:
[['James', 'Courtney'], ['Malcoun', 'Joe'], ['Acosta', 'Rafael']]
What you are looking for is a more complicated list with other elements in it. But at least the above code shows how to properly loop through your original list.

NoneType Error when trying to parse Table using BeautifulSoup

Here's my code:
source = urllib.request.urlopen('http://nflcombineresults.com/nflcombinedata_expanded.php ?year=2015&pos=&college=').read()
soup = bs.BeautifulSoup(source, 'lxml')
table = soup.table
table = soup.find(id='datatable')
table_rows = table.find_all('tr')
#print(table_rows)
year = []
name = []
college = []
pos = []
height = []
weight = []
hand_size = []
arm_length = []
wonderlic = []
fortyyrd = []
for row in table_rows[1:]:
col = row.find_all('td')
#row = [i.text for i in td]
#print(col[4])
# Create a variable of the string inside each <td> tag pair,
column_1 = col[0].string.strip()
# and append it to each variable
year.append(column_1)
column_2 = col[1].string.strip()
name.append(column_2)
column_3 = col[2].string.strip()
college.append(column_3)
column_4 = col[3].string.strip()
pos.append(column_4)
#print(col[4])
column_5 = col[4].string.strip()
height.append(column_5)
There are several more columns in the table I want to add, but whenever I try and run these last two lines, I get an error saying:
"AttributeError: 'NoneType' object has no attribute 'strip'"
when I print col[4] right above this line, I get:
<td><div align="center">69</div></td>
I originally thought this is due to missing data, but the first instance of missing data in the original table on the website is in the 9th column (Wonderlic) of the first row, not the 4th column.
There are several other columns not included in this snippet of code that I want to add to my dataframe and I'm getting the NoneType error with them as well despite there being an entry in that cell.
I'm fairly new to parsing tables from a site using BeautifulSoup and so this could be a stupid question, but why is this object NoneType how can I fix this so I can put this table into a pandas dataframe?
Alternately if you want to try it with pandas, you can do it like so:
import pandas as pd
df = pd.read_html("http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=")[0]
df.head()
Output:
AttributeError: 'NoneType' object has no attribute 'strip'
The actual error is happening on the last row of the table which has a single cell, here is it's HTML:
<tr style="background-color:#333333;"><td colspan="15"> </td></tr>
Just slice it:
for row in table_rows[1:-1]:
As far as improving the overall quality of the code, you can/should follow #宏杰李's answer.
import requests
from bs4 import BeautifulSoup
r = requests.get('http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=')
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print (row)
out:
['Year', 'Name', 'College', 'POS', 'Height (in)', 'Weight (lbs)', 'Hand Size (in)', 'Arm Length (in)', 'Wonderlic', '40 Yard', 'Bench Press', 'Vert Leap (in)', 'Broad Jump (in)', 'Shuttle', '3Cone', '60Yd Shuttle']
['2015', 'Ameer Abdullah', 'Nebraska', 'RB', '69', '205', '8.63', '30.00', '', '4.60', '24', '42.5', '130', '3.95', '6.79', '11.18']
['2015', 'Nelson Agholor', 'Southern California', 'WR', '73', '198', '9.25', '32.25', '', '4.42', '12', '', '', '', '', '']
['2015', 'Malcolm Agnew', 'Southern Illinois', 'RB', '70', '202', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Jay Ajayi', 'Boise State', 'RB', '73', '221', '10.00', '32.00', '24', '4.57', '19', '39.0', '121', '4.10', '7.10', '11.10']
['2015', 'Brandon Alexander', 'Central Florida', 'FS', '74', '195', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Kwon Alexander', 'Louisiana State', 'OLB', '73', '227', '9.25', '30.25', '', '4.55', '24', '36.0', '121', '4.20', '7.14', '']
['2015', 'Mario Alford', 'West Virginia', 'WR', '68', '180', '9.38', '31.25', '', '4.43', '13', '34.0', '121', '4.07', '6.64', '11.22']
['2015', 'Detric Allen', 'East Carolina', 'CB', '73', '200', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Javorius Allen', 'Southern California', 'RB', '73', '221', '9.38', '31.75', '12', '4.53', '11', '35.5', '121', '4.28', '6.96', '']
As you can see, there are a lot of empty fields in the table, the better way is to put all the field in a list, then unpack them or use namedtuple.
This will improve your code stability.

python replace \n from text

Here is a text file sample:
'15235457345', '', '\n\nR\n\nE\nM\nO\n\nV\nE\nD\n', '1445133666', 'nick', '', '1236500', 'git', '', '', '123face', '2015-10-18 ', '2015-10-23 ', 'name', 'great', 'sha', '\n\nB\n\nU\n\nT\n\nH\nO\nW\n', '1445123147'
I want to remove some pieces like
\n\nR\n\nE\nM\nO\n\nV\nE\nD\n
and
\n\nB\n\nU\n\nT\n\nH\nO\nW\n
I use removed and buthow to figure out the problem, but in real practice these are other words\timestamp etc.
le = ['15235457345', '', '\n\nR\n\nE\nM\nO\n\nV\nE\nD\n', '1445133666', 'nick', '', '1236500', 'git', '', '', '123face', '2015-10-18 ', '2015-10-23 ', 'name', 'great', 'sha', '\n\nB\n\nU\n\nT\n\nH\nO\nW\n', '1445123147']
print [value for value in le if '\n' not in value]
Output:
['15235457345', '', '1445133666', 'nick', '', '1236500', 'git', '', '', '123face', '2015-10-18 ', '2015-10-23 ', 'name', 'great', 'sha', '1445123147']
s='15235457345', '', '\n\nR\n\nE\nM\nO\n\nV\nE\nD\n', '1445133666', 'nick', '', '1236500', 'git', '', '', '123face', '2015-10-18 ', '2015-10-23 ', 'name', 'great', 'sha', '\n\nB\n\nU\n\nT\n\nH\nO\nW\n', '1445123147'
for i in range(0,len(s)):
print s[i].replace('\n','')
Output:
15235457345
REMOVED
1445133666
nick
1236500
git
123face
2015-10-18
2015-10-23
name
great
sha
BUTHOW
1445123147
Hope this is what you are looking for.

Categories