Scrapy - Getting nan when getting data from a CSV

Scrapy - Getting nan when getting data from a CSV - python

Here is my code snippet to get the data I need from the CSV:
pathName = 'pathName'
export = pd.read_csv(pathName, skiprows = [0], header = None)
#pathName: Find the correct path for the file
#skiprows: The first row is occupied for the title, we dont need that
omsList = export.values.T[1].tolist() #Transpose the matrix + get second path
for omsID in omsList:
productOMS = omsID
Here is how I'm yielding said item:
item['productOMS'] = productOMS
yield item
Here is the column I am trying to get data from
When I run my spider I get nan as the output for omsID, which after research I found out means not a number. It would make sense why I'm getting that since I think they would be considered strings so how would I adjust my program to recognize these data fields as strings and not ints or read them in as ints?

you need to use pythons type conversion / casting - i.e int(my_numerical_string) tells python to interpret the text as an integer. you can also use type(my_var) to find out the type of your variable

This was a silly problem that I did not see coming. I have to increase the width of the target column in excel so the values could actually be read in.

Related

How would I be able to remove this part of the variable?

So I am making a code like a guessing game. The data for the guessing game is in the CSV file so I decided to use pandas. I have tried to use pandas to import my csv file, pick a random row and put the data into variables so I can use it in the rest of the code but, I can't figure out how to format the data in the variable correctly.
I've tried to split the string with split() but I am quite lost.
ar = pandas.read_csv('names.csv')
ar.columns = ["Song Name","Artist","Intials"]
randomsong = ar.sample(1)
songartist = randomsong["Artist"]
songname = (randomsong["Song Name"])
songintials = randomsong["Intials"]
print(songname)
My CSV file looks like this.
Song Name,Artist,Intials
Someone you loved,Lewis Capaldi,SYL
Bad Guy,Billie Eilish,BG
Ransom,Lil Tecca,R
Wow,Post Malone, W
I expect the output to be the name of the song from the csv file. For Example
Bad Guy
Instead the output is
1 Bad Guy
Name: Song Name, dtype:object
If anyone knows the solution please let me know. Thanks

You're getting a series object as output. You can try
randomsong["Song Name"].to_string()

Use df['column].values to get values of the column.
In your case, songartist = randomsong["Artist"].values[0] because you want only the first element of the returned list.

Using an IF THEN loop with nested JSON files in Python

I am currently writing a program which uses the ComapaniesHouse API to return a json file containing information about a certain company.
I am able to retrieve the data easily using the following commands:
r = requests.get('https://api.companieshouse.gov.uk/company/COMPANY-NO/filing-history', auth=('API-KEY', ''))
data = r.json()
With that information I can do an awful lot, however I've ran into a problem which I was hoping you guys could possible help me with. What I aim to do is go through every nested entry in the json file and check if the value of certain keys matches certain criteria, if the values of 2 keys match a certain criteria then other code is executed.
One of the keys is the date of an entry, and I would like to ignore results that are older than a certain date, I have attempted to do this with the following:
date_threshold = datetime.date.today() - datetime.timedelta(days=30)``
for each in data["items"]:
date = ['date']
type = ['type']
if date < date_threshold and type is "RM01":
print("wwwwww")
In case it isn't clear, what I'm attempting to do (albeit very badly) is assign each of the entries to a variable, which then gets tested against certain criteria.
Although this doesn't work, python spits out a variable mismatch error:
TypeError: unorderable types: list() < datetime.date()
Which makes me think the date is being stored as a string, and so I can't compare it to the datetime value set earlier, but when I check the API documentation (https://developer.companieshouse.gov.uk/api/docs/company/company_number/filing-history/filingHistoryItem-resource.html), it says clearly that the 'date' entry is returned as a date type.
What am I doing wrong, its very clear that I'm extremely new to python given what I presume is the atrocity of my code, but in my head it seems to make at least a little sense. In case none of this clear, I basically want to go through all the entries in the json file, and the if the date and type match a certain description, then other code can be executed (in this case I have just used random text).
Any help is greatly appreciated! Let me know if you need anything cleared up.
:)
EDIT
After tweaking my code to the below:
for each in data["items"]:
date = each['date']
type = each['type']
if date is '2016-09-15' and type is "RM01":
print("wwwwww")
The code executes without any errors, but the words aren't printed, even though I know there is an entry in the json file with that exact date, and that exact type, any thoughts?
SOLUTION:
Thanks to everyone for helping me out, I had made a couple of very basic errors, the code that works as expected is below::
for each in data["items"]:
date = each['date']
typevariable = each['type']
if date == '2016-09-15' and typevariable == "RM01":
print("wwwwww")
This prints the word "wwwwww" 3 times, which is correct seeing as there are 3 entries in the JSON that fulfil those criteria.

You need to first convert your date variable to a datetime type using datetime.strptime()
You are comparing a list type variable date with datetime type variable date_threshold.

Parsing multiple occurrences of an item into a dictionary

Attempting to parse several separate image links from JSON data through python, but having some issues drilling down to the right level, due to what I believe is from having a list of strings.
For the majority of the items, I've had success with the below example, pulling back everything I need. Outside of this instance, everything is a 1:1 ratio of keys:values, but for this one, there are multiple values associated with one key.
resultsdict['item_name'] = item['attribute_key']
I've been adding it all to a resultsdict={}, but am only able to get to the below sample string when I print.
INPUT:
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
OUTPUT (only relevant section):
'images': [{u'VariationSpecificPictureSet': [{u'PictureURL': [u'http//imagelink1'], u'VariationSpecificValue': u'color1'}, {u'PictureURL': [u'http//imagelink2'], u'VariationSpecificValue': u'color2'}, {u'PictureURL': [u'http//imagelink3'], u'VariationSpecificValue': u'color3'}, {u'PictureURL': [u'http//imagelink4'], u'VariationSpecificValue': u'color4'}]
I feel like I could add ['VariationPictureSet']['PictureURL'] at the end of my initial input, but that throws an error due to the indices not being integers, but strings.
Ideally, I would like to see the output as a simple comma-separated list of just the URLs, as follows:
OUTPUT:
'images': http//imagelink1, http//imagelink2, http//imagelink3, http//imagelink4

An answer to your comment that required a bit of code to it.
When using
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
you get a list with one element, so I recommend using this
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures'][0]
now you can use
for image in resultsdict['images']['VariationsSpecificPictureSet']:
print(image['PictureUR‌L'])

Thanks for the help, #uzzee, it's appreciated. I kept tinkering with it and was able to pull the continuous string of all the image URLs with the following code.
resultsdict['images'] = sum([x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']],[])
Without the sum it looks like this and pulls in the whole list of lists...
resultsdict['images'] = [x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']]

Python: Using str.split and getting list index out of range

I just started using python and am trying to convert some of my R code into python. The task is relatively simple; I have many csv file with a variable name (in this case cell lines) and values ( IC50's). I need to pull out all variables and their values shared in common among all files. Some of these files share the save variables but are formatted differently. For example in some files a variable is just "Cell_line" and in others it is MEL:Cell_line. So first things first to make a direct string comparison I need to format them the same and hence am trying ti use str.split() to do so. There is probably a much better way to do this but for now I am using the following code:
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
name_str=row[0]
splt=name_str.split(':')
n_name=splt[1]
alldata[n_name]=row
[1]
name_str.split return a list of length 2. Since the portion I want is after the ":" I want the second element which should be indexed as splt[1] as splt[0] is the first in python. However when I run the code I get this error message "IndexError: list index out of range"
I'm trying the second element out of a list of length 2 thus I have no idea why it is out of range. Any help or suggestions would be appreciated.

I am pretty sure that there are some rows where name_str does not have a : in them. From your own example if the name_str is Cell_line it would fail.
If you are sure that there would only be 1 : in name_str (at max) , or if there are multiple : you want to select the last one, instead of splt[1] , you should use - splt[-1] . -1 index would take the last element in the list (unless its empty) .

The simple answer is that sometimes the data isn't following the specification being assumed when you write this code (i.e. that there is a colon and two fields).
The easiest way to deal with this is to add an if block if len(splot)==2: and do the subsequent lines within that block.
Optionally, add an else: and print the lines that are not so spec or save them somewhere so you can diagnose.
Like this:
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
name_str=row[0]
splt=name_str.split(':')
if len(splt)==2:
n_name=splt[1]
alldata[n_name]=row
else:
print "invalid name: "+name_str
Alternatively, you can use try/except, which in this case is a bit more robust because we can handle IndexError anywhere, in either row[0] or in split[1], with the one exception handler, and we don't have to specify that the length of the : split field should be 2.
In addition we could explicitly check that there actually is a : before splitting, and assign the name appropriately.
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
try:
name_str=row[0]
if ':' in name_str:
splt=name_str.split(':')
n_name=splt[1]
else:
n_name = name_str
alldata[n_name]=row
except IndexError:
print "bad row:"+str(row)

Search a single column for a particular value in a CSV file and return an entire row

Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.

You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - Getting nan when getting data from a CSV - python

you need to use pythons type conversion / casting - i.e int(my_numerical_string) tells python to interpret the text as an integer. you can also use type(my_var) to find out the type of your variable

This was a silly problem that I did not see coming. I have to increase the width of the target column in excel so the values could actually be read in.

Related

How would I be able to remove this part of the variable?

Using an IF THEN loop with nested JSON files in Python

Parsing multiple occurrences of an item into a dictionary

Python: Using str.split and getting list index out of range

Search a single column for a particular value in a CSV file and return an entire row

Categories

Resources