how to convert a vertical list to panda dataframe? - python

i have a list from a webscraper that makes a log file in a vertical list from.
example:
21-Oct-19 14:46:14 - Retrieving data from https://www.finn.no/bap/forsale/search.html?category=0.93&page=1&product_category=2.93.3904.69&sub_category=1.93.3904
0 21-Oct-19 14:46:14 - Found:
1 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
2 Price: 4�900 kr
3 Link: https://www.finn.no/bap/forsale/ad.html?...
4 21-Oct-19 14:46:14 - Found:
5 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
6 Price: 4�900 kr
7 Link: https://www.finn.no/bap/forsale/ad.html?...
8 21-Oct-19 14:46:14 - Found:
9 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
10 Price: 4�900 kr
11 Link: https://www.finn.no/bap/forsale/ad.html?...
12 21-Oct-19 14:46:14 - Found:
13 Title: Nesten ubrukt Canon 17-40 mm vidvinkell...
Can i convert it intro readble dataframe for Pandas ?
example:
title price link
canon 100mm 6900kr https
canon 50mm 100r https
canon 17mm 63530kr https
my code right now look like this:
import pandas as pd
data = pd.read_csv('finn.no-2019-10-21-.log', sep ="Line", engine='python')
df = pd.DataFrame(data)
title = 1,5,9,13,17,21
price = 2,6,10,14,18,22
link = 3,7,11,15,19,23
print(df)
can i do anything with the numbers in the original row to convert to a more traditinal dataframe ?

This should do it for you:
with open('finn.no-2019-10-21-.log') as f:
lines = f.readlines()
clean = [line.strip() for line in lines]
title = [j.split('Title: ')[1] for j in clean if j.startswith('Title: ')]
price = [k.split('Price: ')[1] for k in clean if k.startswith('Price: ')]
link = [l.split('Link: ')[1] for l in clean if l.startswith('Link: ')]
df = pd.DataFrame(data=[title, price, link], columns=['Title', 'Price', 'Link'])

from help of #zipa i got it right:
import pandas as pd
with open('finn.no-2019-10-22-.log') as f:
lines = f.readlines()
clean = [line.strip() for line in lines]
titles = [j.split('Title: ')[1] for j in clean if j.startswith('Title: ')]
prices = [k.split('Price: ')[1] for k in clean if k.startswith('Price: ')]
links = [l.split('Link: ')[1] for l in clean if l.startswith('Link: ')]
output = []
for title, price, link in zip(titles, prices, links):
articles = {}
articles['titles'] = title
articles['prices'] = price
articles['links'] = link
output.append(articles)
df = pd.DataFrame(data=output)
print(df)

Related

Storing keyvalue as header and value text as rows using data frame in python using beautiful soup

for imo in imos:
...
...
keys_div= soup.find_all("div", {"class","col-4 keytext"})
values_div = soup.find_all("div",{"class","col-8 valuetext"})
for key, value in zip(keys_div, values_div):
print(key.text + ": " + value.text)
'......
Output:
Ship Name: MAERSK ADRIATIC
Shiptype: Chemical/Products Tanker
IMO/LR No.: 9636632
Gross: 23,297
Call Sign: 9V3388
Deadweight: 37,538
MMSI No.: 566429000
Year of Build: 2012
Flag: Singapore
Status: In Service/Commission
Operator: Handytankers K/S
Shipbuilder: Hyundai Mipo Dockyard Co Ltd
ShipType: Chemical/Products Tanker
Built: 2012
GT: 23,297
Deadweight: 37,538
Length Overall: 184.000
Length (BP): 176.000
Length (Reg): 177.460
Bulbous Bow: Yes
Breadth Extreme: 27.430
Breadth Moulded: 27.400
Draught: 11.500
Depth: 17.200
Keel To Mast Height: 46.900
Displacement: 46565
T/CM: 45.0
This is the output for one imo, i want to store this output in dataframe and write to csv, the csv will have the keytext as header and value text as rows for all the IMO's please help me on how to do it
All you have to do is add the results to a list and then output that list to a dataframe.
import pandas as pd
filepath = r"C\users\test\test_file.csv"
output_data = []
for imo in imos:
keys_div = [i.text for i in soup.find_all("div", {"class","col-4 keytext"})]
values_div = [i.text for i in soup.find_all("div",{"class","col-8 valuetext"})]
dict1 = dict(zip(keys_div, values_div))
output_data.append(dict1)
df = pd.DataFrame(output_data)
df.to_csv(filepath, index=False)

Extracting Author name from XML tags using ElelemtTree

Following is the link to access the XML document:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%2726161999%27&retmode=xml
I'm trying to extract the author Name which includes Lastname+Forename and make a string with only author name. I'm only being able to extract the details separately.
Following is the code that I have tried
r = requests.get(
'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id='26161999'&retmode=xml')
root = et.fromstring(r.content)
for elem in root.findall(".//ForeName"):
elem_ = elem.text
auth_name = list(elem_.split(" "))
authordata.append(auth_name)
val = [item if isinstance(item, str) else " ".join(item) for item in authordata] #flattening the list since its a nested list, converting nested list into string
seen = set()
val = [x for x in val if x not in seen and not seen.add(x)]
author= ' '.join(val)
print(author)
The output obtained from the above code is:
Elisa Riccardo Mirco Laura Valentina Antonio Sara Carla Borri Barbara
The expected output is a combination of firstname + Lastname:
Elisa Oppici Riccardo Montioli Mirco Dindo Laura Maccari Valentina Porcari Antonio Lorenzetto Chellini Sara Carla Borri Voltattorni Barbara Cellini
From your question I understand that you want a concatenation of ForeName and LastName for each author. You can achieve that by querying directly for those fields for each Author element in the tree and concatenate the corresponding text fields:
import xml.etree.ElementTree as et
import requests
r = requests.get(
'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="26161999"&retmode=xml'
)
root = et.fromstring(r.content)
author_names = []
for author in root.findall(".//Author"):
fore_name = author.find('ForeName').text
last_name = author.find('LastName').text
author_names.append(fore_name + ' ' + last_name)
print(author_names)
# or to get your exact output format:
print(' '.join(author_names))

XML to CSV in PYTHON: Extract series of subnodes for every node

My goal is to convert an .XML file into a .CSV file.
This part of the code is already functional.
However, I also want to extract the sub-sub-nodes of one of the "father" nodes.
Maybe an example would be more self explanatory;
Here is the structure of my XML:
<nedisCatalogue>
<headerInfo>
<feedVersion>1-0</feedVersion>
<dateCreated>2018-01-22T23:37:01+0100</dateCreated>
<supplier>Nedis_BENED</supplier>
<locale>nl_BE</locale>
</headerInfo>
<productList>
<product>
<nedisPartnr><![CDATA[VS-150/63BA]]></nedisPartnr>
<nedisArtlid>17005</nedisArtlid>
<vendorPartnr><![CDATA[TONFREQ-ELKOS / BIPOL 150, 5390]]></vendorPartnr>
<brand><![CDATA[Visaton]]></brand>
<EAN>4007540053905</EAN>
<intrastatCode>8532220000</intrastatCode>
<UNSPSC>52161514</UNSPSC>
<headerText><![CDATA[Crossover Foil capacitor]]></headerText>
<internetText><![CDATA[Bipolaire elco met een ruwe folie en een zeer goede prijs/kwaliteits-verhouding voor de bouw van cross-overs. 63 Vdc, 10% tolerantie.]]></internetText>
<generalText><![CDATA[Dimensions 16 x 35 mm
]]></generalText>
<images>
<image type="2" order="15">767736.JPG</image>
</images>
<attachments>
</attachments>
<categories>
<tree name="Internet_Tree_ISHP">
<entry depth="001" id="1067858"><![CDATA[Audio]]></entry>
<entry depth="002" id="1067945"><![CDATA[Speakers]]></entry>
<entry depth="003" id="1068470"><![CDATA[Accessoires]]></entry>
</tree>
</categories>
<properties>
<property id="360" multiplierID="" unitID="" valueID=""><![CDATA[...]]></property>
</properties>
<status>
<code status="NORMAL"></code>
</status>
<packaging quantity="1" weight="8"></packaging>
<introductionDate>2015-10-26</introductionDate>
<serialnumberKeeping>N</serialnumberKeeping>
<priceLevels>
<normalPricing from="2017-02-13" to="2018-01-23">
<price level="1" moq="1" currency="EUR">2.48</price>
</normalPricing>
<specialOfferPricing></specialOfferPricing>
<goingPriceInclVAT currency="EUR" quantity="1">3.99</goingPriceInclVAT>
</priceLevels>
<tax>
</tax>
<stock>
<inStockLocal>25</inStockLocal>
<inStockCentral>25</inStockCentral>
<ATP>
<nextExpectedStockDateLocal></nextExpectedStockDateLocal>
<nextExpectedStockDateCentral></nextExpectedStockDateCentral>
</ATP>
</stock>
</product>
....
</nedisCatalogue>
And here is the code that I have now:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("/Users/BE07861/Documents/nedis_catalog_2018-01-23_nl_BE_32191_v1-0_xml")
root = tree.getroot()
f = open('/Users/BE07861/Documents/test2.csv', 'w')
csvwriter = csv.writer(f, delimiter='ç')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text', 'categories']
prdlist = root[1]
prdct = prdlist[5]
cat = prdct[12]
tree1=cat[0]
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
categ = time.find('categories').find('tree').find('entry').text
row.append(categ)
csvwriter.writerow(row)
f.close()
If you run the code, you'll see that I only retrieve the first "entry" of the categories/tree; which is normal. However, I don't know how to create a loop that, for every node "categories", creates new columns such as categories1, categories2 & categories3 with the values: "entry".
My result should look like this
Nedis Part Number Nedis Article ID Vendor Part Number
VS-150/63BA 17005 TONFREQ-ELKOS / BIPOL 150, 5390
Brand EAN Header text Internet Text
Visaton 4,00754E+12 Crossover Foil capacitor Bipolaire elco …
General Text Category1 Categroy2 Category3
Dimensions 16 x 35 mm Audio Speakers Accessoires
I've really tried my best but didn't manage to find the solution.
Any help would be very much appreciated!!! :)
Thanks a lot,
Allan
I think this is what you're looking for:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
Here's a solution that loops through the xml once to figure out how many headers to add, adds the headers, and then loops through each product's category list:
**Updated to iterate through images in addition to categories. This is the biggest difference:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
It's going to figure out the maximum number of categories on a single record and then and that many columns. If a particular record has less categories, this code will stick blank values in as placeholders so the column headers always line up with the data.
For instance:
Cat1 Cat2 Cat3 Img1 Img2 Img3
A B C 1 2 3
D E <blank> 4 <blank> <blank>
Here's the full solution:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("c:\\python\\xml.xml")
root = tree.getroot()
f = open('c:\\python\\xml.csv', 'w')
csvwriter = csv.writer(f, delimiter=',')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text']
prdlist = root[1]
maxcat = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('categories').find('tree'):
cur += 1
if cur > maxcat:
maxcat = cur
for cnt in range (0, maxcat):
head.append('Category ' + str(cnt + 1))
maximg = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('images'):
cur += 1
if cur > maximg:
maximg = cur
for cnt in range(0, maximg):
head.append('Image ' + str(cnt + 1))
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
curcat = 0
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
curimg = 0
for img in time.find('images'):
image = img.text
row.append(image)
curimg += 1
while curimg < maximg:
row.append('')
curimg += 1
csvwriter.writerow(row)
f.close()

How to read a particular line of interest from a text file?

Here I have a text file. I want to read Adress, Beneficiary, Beneficiary Bank, Acc Nbr, Total US$, Date which is at the top, RUT, BOX. I tried writing some code by myself but I am not able to correctly get the required information and moreover if the length of character changes I will not get correct output. How should I do this such that I will get every required information in a particular string.
The main problem will arise when my slicings will go wrong. For eg: I am using line[31:] for Acc Nbr. But if the address change then my slicing will also go wrong
My Text.txt
2014-11-09 BOX 1531 20140908123456 RUT 21 654321 0123
Girry S.A. CONTADO
G 5 Y Serie A
NO 098765
11 al Rayo 321 - Oqwerty 108 Monteaudio - Gruguay
Pharm Cosco, Inc - Britania PO Box 43215
Dirección Hot Springs AR 71903 - Estados Unidos
Oescripción Importe
US$
DO 7640183 - 50% of the Production Degree 246,123
Beneficiary Bank: Bankue Heritage (Gruguay) S.A Account Nbr: 1234563 Swift: MANIUYMM
Adress: Tencon 108 Monteaudio, Gruguay.
Beneficiary: Girry SA Acc Nbr: 1234567
Servicios prestados en el exterior, exentos de IVA o IRAE
Subtotal US$ 102,500
Iva US$ ---------------
Total US$ 102,500
I.V.A AL DIA Fecha de Vencimiento
IMPRENTA IRIS LTDA. - RUT 210161234015 - 0/40987 17/11/2015
CONSTANCIA N9 1234559842 -04/2013
CONTADO A 000.001/ A 000.050 x 2 VIAS
QWERTYAS ZXCVBIZADA
R. U.T. Bamprador Asdfumldor Final
Fecha 12/12/2014
1º ORIGINAL CLLLTE (Blanco) 2º CASIA AQWERVO (Rosasd)
My Code:
txt = 'Text.txt'
lines = [line.rstrip('\n') for line in open(txt)]
for line in lines:
if 'BOX' in line:
Date = line.split("BOX")[0]
BOX = line.split('BOX ', 1)[-1].split("RUT")[0]
RUT = line.split('RUT ',1)[-1]
print 'Date : ' + Date
print 'BOX : ' + BOX
print 'RUT : ' + RUT
if 'Adress' in line:
Adress = line[8:]
print 'Adress : ' + Adress
if 'NO ' in line:
Invoice_No = line.split('NO ',1)[-1]
print 'Invoice_No : ' + Invoice_No
if 'Swift:' in line:
Swift = line.split('Swift: ',1)[-1]
print 'Swift : ' + Swift
if 'Fecha' in line and '/' in line:
Invoice_Date = line.split('Fecha ',1)[-1]
print 'Invoice_Date : ' + Invoice_Date
if 'Beneficiary Bank' in line:
Beneficiary_Bank = line[18:]
Ben_Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Beneficiary_Bank : ' + Beneficiary_Bank.split("Acc")[0]
print 'Ben_Acc_Nbr : ' + Ben_Acc_Nbr.split("Swift")[0]
if 'Beneficiary' in line and 'Beneficiary Bank' not in line:
Beneficiary = line[13:]
print 'Beneficiary : ' + Beneficiary.split("Acc")[0]
if 'Acc Nbr' in line:
Acc_Nbr = line.split('Nbr: ', 1)[-1]
print 'Acc_Nbr : ' + Acc_Nbr
if 'Total US$' in line:
Total_US = line.split('US$ ', 1)[-1]
print 'Total_US : ' + Total_US
Output:
Date : 2014-11-09
BOX : 1531 20140908123456
RUT : 21 654321 0123
Invoice_No : 098765
Swift : MANIUYMM
Beneficiary_Bank : Bankue Heritage (Gruguay) S.A
Ben_Acc_Nbr : 1234563
Adress : Tencon 108 Monteaudio, Gruguay.
Beneficiary : Girry SA
Acc_Nbr : 1234567
Total_US : 102,500
Invoice_Date : 12/12/2014
Some Code Changes
I have made some changes but still I am not convinced as I need to provide spaces also in split.
I would recommend you to use regular expressions to extract information you need. It helps to avoid the calculation of the numbers of offset characters.
import re
with open('C:\Quad.txt') as f:
for line in f:
match = re.search(r"Acc Nbr: (.*?)", line)
if match is not None:
Acc_Nbr = match.group(1)
print Acc_Nbr
# etc...
you can search to obtain index of it. for example:
if 'Acc Nbr' in line:
Acc_Nbr = line[line.find("Acc Nbr") + 10:]
print Acc_Nbr
note that find gives you index of first char of item you searched.

Organize by Twitter unique identifier using python

I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
​
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.

Categories