Scrapy handle missing path - python
I am building a forum-scraper for a university project. The page of the forum that I am using is the following: https://www.eurobricks.com/forum/index.php?/forums/topic/163541-lego-ninjago-2019/&tab=comments#comment-2997338.
I am able to extract all the information that I need except for the location. This information is stored inside the following path.
<li class="ipsType_light"> <\li>
<span class="fc">Country_name<\span>
The problem is that sometimes this information and path does not exist. But my actual solution can not handle it.
Here the code I wrote to get the information about the location.
location_path = "//span[#class='fc']/text()"
def parse_thread(self, response):
comments = response.xpath("//*[#class='cPost_contentWrap ipsPad']")
username = response.xpath(self.user_path).extract()
x = len(username)
if x>0:
score = response.xpath(self.score_path).extract()
content = ["".join(comment.xpath(".//*[#data-role='commentContent']/p/text()").extract()) for comment in comments]
date = response.xpath(self.date_path).extract()
location = response.xpath(self.location_path).extract()
for i in range(x):
yield{
"title": title,
"category": category,
"user": username[i],
"score": score[i],
"content": content[i],
"date": date[i],
"location": location[i]
}
One possible solution that I have tried was to check the length of the location but is not working.
Right now the code results in the following (sample data)
Title | category | test1 | 502 | 22 june 2020 | correct country
Title | category | test2 | 470 | 22 june 2020 | wrong country (it takes the next user country)
Title | category | test3 | 502 | 28 june 2020 | correct country
And what I would like to achieve is:
Title | category | test1 | 502 | 22 june 2020 | correct country
Title | category | test2 | 470 | 22 june 2020 | Not available
Title | category | test3 | 502 | 28 june 2020 | correct country
The solution to my problem is that instead of selecting the specific information one by one. First I have to select the entire block where all the pieces of information and only then pick the single information that I need.
Related
SAS Programming: How to replace missing values in multiple columns using one column?
Background I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data. cylinders condition drive paint_color type manufacturer title_status model fuel transmission description region state price (num) posting_date (num) odometer (num) year (num) After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables. Request Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on? Here is the Python code from the blog post. import re manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \ | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \ mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \ | hennessey)' condition = '(excellent | good | fair | like new | salvage | new)' fuel = '(gas | hybrid | diesel |electric)' title_status = '(clean | lien | rebuilt | salvage | missing | parts only)' transmission = '(automatic | manual)' drive = '(4x4 | awd | fwd | rwd | 4wd)' size = '(mid-size | full-size | compact | sub-compact)' type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)' paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)' cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)' keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders'] columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders] for i,column in zip(keys,columns): database[i] = database[i].fillna( database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower() database.drop('description', axis=1, inplace= True) What would be the equivalent SAS code for the Python code shown above?
It's basically just doing a word search of sorts. A simplified example in SAS: data want; set have; array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric"); do i=1 to dim(_fuel); if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i); *does not deal with multiple finds so the last one found will be kept; end; run; You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.
Python program that reorganizes Excel formatting?
I am working on a Python program that aims to take Excel data that is vertical and make it horizontal. For example, the data is shaped something like this: County | State | Number | Date Oakland | MI | 19 | 1/12/10 Oakland | MI | 32 | 1/19/10 Wayne | MI | 9 | 1/12/10 Wayne | MI | 6 | 1/19/10 But I want it like this (purposefully excluding the state): County | 1/12/10 | 1/19/10 Oakland | 19 | 32 Wayne | 9 | 6 (And for the actual data, it’s quite long). My logic so far: Read in the Excel File Loop through the counties If county name is the same, place # in Row 1? Make a new Excel File? Any ideas of how to write this out? I think I am a little stuck on the syntax here.
Web scraping using beautiful soup is giving inaccurate results
So I am using beautiful soup and trying to get list of companies from the website https://www.weps.org/companies . This function I have made simply takes the url "https://www.weps.org/companies?combine=&field_sector_target_id=All&field_company_type_value=All&field_number_of_employees_value=All&field_region_target_id=All&field_country_target_id=All&page=0" and adds 1 at the last digit till its 310 to get the list from all the pages .Then simplet get text is used to get the data and saved to csv . I got almost complete list , but some are not in chronological orders and sometimes some are repeated too . I think basically 95% or more of the data is accurate but some are altered . What could be the reason ? This is my code : #!/usr/bin/python3 import requests from bs4 import BeautifulSoup import pandas as pd company = [] types = [] requrl = "https://www.weps.org/companies?combine=&field_sector_target_id=All&field_company_type_value=All&field_number_of_employees_value=All&field_region_target_id=All&field_country_target_id=All&page=0" reqlist = list(requrl) j = 0 for i in range(0, 310): reqlist[-1] = j j = j + 1 listToStr = ''.join([str(elem) for elem in reqlist]) page = requests.get(listToStr) soup = BeautifulSoup(page.content, 'html.parser') company_only = soup.select(".field-content .skiptranslate") company = company + [cm.get_text() for cm in company_only] types_only = soup.select(".views-field-nothing .field-content") types = types + [tp.get_text() for tp in types_only] data = pd.DataFrame({ 'Name': company, 'Type | Location | Date': types# 'Type | Location | Data': types }) data.to_csv(r 'finalfile.csv', index = False)
I tried tidying you code and using requests.session(). Your range is wrong it only goes to page 309. I stripped white space to make it easier to parse. #!/usr/bin/python3 import requests from bs4 import BeautifulSoup import pandas as pd session = requests.session() company = [] types = [] base_url = "https://www.weps.org/companies?combine=&field_sector_target_id=All&field_company_type_value=All&field_number_of_employees_value=All&field_region_target_id=All&field_country_target_id=All&page=" # The last page with data on is 310 so use range(0, 311). for i in range(0, 311): page = session.get(f'{base_url}{i}') soup = BeautifulSoup(page.content, 'html.parser') company_only = soup.select(".field-content .skiptranslate") company = company + [cm.get_text().strip() for cm in company_only] types_only = soup.select(".views-field-nothing .field-content") types = types + [tp.get_text().strip() for tp in types_only] data = pd.DataFrame({ 'Name': company, 'Type | Location | Date': types# 'Type | Location | Data': types }) data.to_csv(r'finalfile.csv', index=False) I then counted the lines in the file: cat finalfile.csv | wc -l 3104 The website was reporting 3103 Companies, plus the headers in the csv file, it's correct. Then I counted the unique lines in the file: cat finalfile.csv | sort -u | wc -l 3091 Some companies are repeated so I printed the difference: cat finalfile.csv | sort | uniq -d Banco Amazonas S.A.,Banks | Americas and the Caribbean | Ecuador | 09 May 2019 Careem,Software & Computer Services | Arab States | Qatar | 13 May 2018 Careem,Software & Computer Services | Asia and the Pacific | Pakistan | 13 May 2018 Hong Kong Exchanges and Clearing Limited,"Financial Services | Asia and the Pacific | China, Hong Kong SAR |" H?TAY PLAZA,General Retailers | Europe and Central Asia | Turkey | 06 March 2019 "Kowa Co., Ltd.",Health Care Equipment & Services | Asia and the Pacific | Japan | 17 September 2010 Madrigal Sports,General Industrials | Asia and the Pacific | Pakistan | 05 December 2017 Novartis Corporativo S.A. de C.V.,Health Care Providers | Global | Mexico | 07 February 2020 Poppins Corporation,Support Services | Asia and the Pacific | Japan | 17 September 2010 Procter & Gamble Japan K.K.,Food & Drug Retailers | Asia and the Pacific | Japan | 17 September 2010 "Shiseido Co., Ltd.",Personal Goods | Asia and the Pacific | Japan | 17 September 2010 Tesco PLC,Food & Drug Retailers | Europe and Central Asia | United Kingdom of Great Britain and Northern Ireland | 06 March 2019 Xiaohongshu,Internet | Asia and the Pacific | China | 05 March 2020 I repeated running the script and bash commands and got the same result. So I conclude that the 3103 Companies listed on the website have duplicates on the website and there are none missing from the results. Just to check I searched for the keyword "Careem" and got duplicated results.
Filling out a website form using Python requests
I'm trying to programmatically fill out a form on a page using Python requests. I wrote some code to do that: #!/usr/bin/python import requests URL = 'https://www.acgov.org/ptax_pub_app/RealSearch.do' payload = { 'displayApn': '1-123-1', 'showHistory': 'y', } s = requests.session() r = s.post(URL, data=payload) print r.status_code print r.cookies print r.text However, the output isn't coming out as expected. The status code returned is 200 The cookies are printing out as <RequestsCookieJar[]> And the text of the response has html headers but it's just a bunch of jumbled up javascript: <!DOCTYPE html> <html><head> <meta http-equiv="Pragma" content="no-cache"/> <meta http-equiv="Expires" content="-1"/> <meta http-equiv="CacheControl" content="no-cache"/> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <link rel="shortcut icon" href="data:;base64,iVBORw0KGgo="/> <script> (function(){ window["bobcmn"] = "111110101010102000000022000000052000000012744f9810200000096300000021application/x-www-form-urlencoded300000000300000006/TSPD/300000008TSPD_101300000005https3000000b008ae96f08bab2000f746485dcaefc4a635c0beff477f241b9355c916986257756d516313dd184676085e51d6fb0a280088bb71708ecac997cbd3b91abf62403b987812f208f2d2cfcb59631333f545e4de4c55cc4d2f00b230000002ashowHistory%3dy%26displayApn%3d1%2d123%2d1200000000"; window.yfma=!!window.yfma;try{(function(){(function(){})();var _s=59;try{var js,ls,Os=S(840)?0:1,zs=S(798)?0:1,sS=S(200)?1:0,SS=S(659)?0:1,_S=S(223)?1:0,LS=S(478)?1:0;for(var OS=(S(787),0);OS<ls;++OS)Os+=(S(125),2),zs+=(S(260),2),sS+=S(567)?2:1,SS+=(S(515),2),_S+=(S(835),2),LS+=(S(127),3);js=Os+zs+sS+SS+_S+LS;window.lJ===js&&(window.lJ=++js)}catch(S_){window.lJ=js}var __=!0;function I(s,_){s+=_;return s.toString(36)} function I_(s){var _=53;!s||document[l(_,171,158,168,158,151,158,161,158,169,174,136,169,150,169,154)]&&document[L(_,171,158,168,158,151,158,161,158,169,174,136,169,150,169,154)]!==I(68616527613,_)||(__=!1);return __}function l(s){var _=arguments.length,J=[];for(var z=1;z<_;++z)J.push(arguments[z]-s);return String.fromCharCode.apply(String,J)}function j_(){}I_(window[j_[L(_s,169,156,168,160)]]===j_);I_(typeof ie9rgb4!==l(_s,161,176,169,158,175,164,170,169)); I_(RegExp("\x3c")[I(1372146,_s)](function(){return"\x3c"})&!RegExp(l(_s,179,110,159))[I(1372146,_s)](function(){return"'x3'+'d';"})); var l_=window[L(_s,156,175,175,156,158,163,128,177,160,169,175)]||RegExp(l(_s,168,170,157,164,183,156,169,159,173,170,164,159),I(-41,_s))[L(_s,175,160,174,175)](window["\x6e\x61vi\x67a\x74\x6f\x72"]["\x75\x73e\x72A\x67\x65\x6et"]),O_=+new Date+(S(33)?6E5:615140),Z_,Si,ii,Ii=window[l(_s,174,160,175,143,164,168,160,170,176,175)],Ji=l_?S(99)?3E4:21582:S(85)?6E3:5497; document[L(_s,156,159,159,128,177,160,169,175,135,164,174,175,160,169,160,173)]&&document[L(_s,156,159,159,128,177,160,169,175,135,164,174,175,160,169,160,173)](l(_s,177,164,174,164,157,164,167,164,175,180,158,163,156,169,162,160),function(s){var _=48;document[l(_,166,153,163,153,146,153,156,153,164,169,131,164,145,164,149)]&&(document[l(_,166,153,163,153,146,153,156,153,164,169,131,164,145,164,149)]===I(1058781935,_)&&s[L(_,153,163,132,162,165,163,164,149,148)]?ii=!0:document[L(_,166,153,163,153, 146,153,156,153,164,169,131,164,145,164,149)]===I(68616527618,_)&&(Z_=+new Date,ii=!1,Li()))});function L(s){var _=arguments.length,J=[],z=1;while(z<_)J[z-1]=arguments[z++]-s;return String.fromCharCode.apply(String,J)}function Li(){if(!document[l(39,152,156,140,153,160,122,140,147,140,138,155,150,153)])return!0;var s=+new Date;if(s>O_&&(S(386)?6E5:758599)>s-Z_)return I_(!1);var _=I_(Si&&!ii&&Z_+Ji<s);Z_=s;Si||(Si=!0,Ii(function(){Si=!1},S(477)?1:0));return _}Li(); var oi=[S(626)?17972802:17795081,S(388)?27611931586:2147483647,S(830)?1862183071:1558153217];function Zi(s){var _=11;s=typeof s===l(_,126,127,125,116,121,114)?s:s[L(_,127,122,94,127,125,116,121,114)](S(475)?36:48);var J=window[s];if(!J[L(_,127,122,94,127,125,116,121,114)])return;var z=""+J;window[s]=function(s,_){Si=!1;return J(s,_)};window[s][l(_,127,122,94,127,125,116,121,114)]=function(){return z}}for(var sI=(S(965),0);sI<oi[L(_s,167,160,169,162,175,163)];++sI)Zi(oi[sI]); I_(!1!==window[L(_s,180,161,168,156)]);window.Jl={oL:"089e4a9f79017800e36ff59ba1e5d6d5e1f93b16b5b458d18a09540515a45f4c2fa1cb5ea167a407bc42c2be8a0eeaf8c16869b5dd03a199749963ce5b01e899032b244489e7c78f8618c6a53a224b50de13cacbe6346167e00de073de7b15625d0451b8a5cd04cb0895c8cb503536a54c9e0c5e860626b71fc398289ea1aada"};function iI(s){var _=+new Date,J;!document[l(48,161,165,149,162,169,131,149,156,149,147,164,159,162,113,156,156)]||_>O_&&(S(347)?6E5:514364)>_-Z_?J=I_(!1):(J=I_(Si&&!ii&&Z_+Ji<_),Z_=_,Si||(Si=!0,Ii(function(){Si=!1},S(468)?1:0)));return!(arguments[s]^J)}function S(s){return 568>s} (function(){var s=/(\A([0-9a-f]{1,4}:){1,6}(:[0-9a-f]{1,4}){1,1}\Z)|(\A(([0-9a-f]{1,4}:){1,7}|:):\Z)|(\A:(:[0-9a-f]{1,4}){1,7}\Z)/ig,_=document.getElementsByTagName("head")[0],J=[];_&&(_=_.innerHTML.slice(0,1E3));while(_=s.exec(""))J.push(_)})();})();}catch(x){ }finally{ie9rgb4=void(0);};function ie9rgb4(a,b){return a>>b>>0}; })(); </script> <script type="text/javascript" src="/TSPD/08ae96f08bab2000d96246327d838c6fa30bb9c4f41390f6fbd80de23adbed5ac22558a0c0007168?type=7"></script> <noscript>Please enable JavaScript to view the page content.<br/>Your support ID is: 183979068942220394.</noscript> </head><body> </body></html> That's obviously not what I want. I wanna get the contents of the page that renders when I submit the form manually on the browser. After some browser inspection, when I send the form manually the following request headers are being posted to the server: POST /ptax_pub_app/RealSearch.do HTTP/1.1 Host: www.acgov.org User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:66.0) Gecko/20100101 Firefox/66.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate, br Referer: https://www.acgov.org/ptax_pub_app/RealSearch.do Content-Type: multipart/form-data; boundary=---------------------------5784378851470632262085445332 Content-Length: 304 Connection: keep-alive Cookie: TS744f9810_75=TS744f9810_rc=1&TS744f9810_id=2&TS744f9810_cr=08ae96f08bab280047871c302267d274621ba715eb672bba8c4e6326721d39c4e9275ba2573dd8ecb04e5fd2ed8b14de:08e8846af6032000365890ddfe7c40338b1c71881c3aa160e9b7511f898e727042a17ecd4e549128&TS744f9810_ef=&TS744f9810_pg=0&TS744f9810_ct=application/x-www-form-urlencoded&TS744f9810_bg=08ae96f08bab20007ed7e7334af2c3a0ddc2a737a8f76402a06229c2abec9c180de6732a86a9648608ba63d37c0a28007e212e36225cb10a4cd776ce268b7178b1d33e9bc0271ac4819eb499a739f93571208168c1d71d9c&TS744f9810_rf=https%3a%2f%2fwww.acgov.org%2fptax_pub_app%2fRealSearchInit.do%3fshowSearchParmsFromLookup%3dtrue; _ga=GA1.2.1302812812.1549499581; TSPD_101=08ae96f08bab280047871c302267d274621ba715eb672bba8c4e6326721d39c4e9275ba2573dd8ecb04e5fd2ed8b14de:; JSESSIONID=0000Im6xKN_53mKz4Iw5KNO5gR0:16hgu6tbb; TS01ed31ee=0129191c7e5fb1688bfcca5087fec2a194712c77706b9ba0027f29d8162a79cfc6c4aefe2136c8ca6d34cd2a1622154e5765f831e0e88ce369724f44b0e9f3ebe5c827a6011131434eedec5e04b97f4977a6091f7d; TS01ed31ee_77=08ae96f08bab2800dd88029ca6fb0fa267ec2a5e40e37cef6351b9876c3e34f6bb42cae44bc0afadbb819ab098f6e9b408de561ace82400034a3a6b4be45a224cb4595200fc21d5c6f05b9f72090ad9bf8cf1db9cef92af4944728ce98cc9906ca77cf3a81dbe502fadd7ae968c030f5b7e5f37a743d021e; ASP.NET_SessionId=db12w03jxf5pelnstiyf35jh; _gid=GA1.2.879815811.1551480793 Upgrade-Insecure-Requests: 1 Pragma: no-cache Cache-Control: no-cache I doubt my code is sending all of those headers. I'm not even sure what some of them mean or how I could replicate that in my script. Any ideas?
You are simply missing a single element the site is looking for when you post a request; when you actually use the intended form page, the form includes a submit button: <input type="submit" name="searchBills" tabindex="9" value="Search" class="btcommon"> You need to include that button in your POST data, because it is the presence of that field that the site uses to detect that you made an actual search: payload = { 'displayApn': '1-123-1', 'showHistory': 'y', 'searchBills': 'Search', } With that one addition, the returned page contains the looked-for search results: >>> import requests >>> from bs4 import BeautifulSoup >>> URL = 'https://www.acgov.org/ptax_pub_app/RealSearch.do' >>> payload = { ... 'displayApn': '1-123-1', ... 'showHistory': 'y', ... 'searchBills': 'Search', ... } >>> response = requests.post(URL, data=payload) >>> soup = BeautifulSoup(response.content, 'lxml') >>> for row in soup.select('#pplresultcontent3 tr'): ... text = row.get_text(': ', strip=True) ... if text: print(text) ... Property Summary APN: 1-123-1 Property Address: 424 M L KING JR WAY, OAKLAND 94607-3536 >>> for row in soup.select('#pplresultcontent4 tr'): ... text = row.get_text(' | ', strip=True) ... if text: print(text) ... Tax Type | Bill Year | Tracer | Total Amount | Options Installment | Due Date | Installment Amount | Status/Status Date Secured | 2018-2019 | 01009500 | $8,773.64 | View Bill | Pay Bill 1st Installment | 12/10/2018 | $4,386.82 | Paid Oct 31, 2018 2nd Installment | 04/10/2019 | $4,386.82 The history (the pplresultcontent5 table) is not included until you use a capital Y for the showHistory option: >>> payload['showHistory'] = 'Y' >>> response = requests.post(URL, data=payload) >>> soup = BeautifulSoup(response.content, 'lxml') >>> for row in soup.select('#pplresultcontent5 tr'): ... text = row.get_text(' | ', strip=True) ... if text: print(text) ... Tax Type | Bill Year | Tracer | Total Amount | Options Installment | Due Date | Installment Amount | Status/Status Date Secured | 2017-2018 | 01009500 | $8,303.42 | View Bill 1st Installment | 12/10/2017 | $4,151.71 | Paid Dec 8, 2017 2nd Installment | 04/10/2018 | $4,151.71 | Paid Apr 6, 2018 Secured | 2016-2017 | 01009500 | $7,983.02 | View Bill 1st Installment | 12/10/2016 | $3,991.51 | Paid Dec 8, 2016 2nd Installment | 04/10/2017 | $3,991.51 | Paid Mar 30, 2017 Secured | 2015-2016 | 01009400 | $7,864.14 | View Bill 1st Installment | 12/10/2015 | $3,932.07 | Paid Dec 9, 2015 2nd Installment | 04/10/2016 | $3,932.07 | Paid Apr 8, 2016 Secured | 2014-2015 | 01009400 | $7,691.52 | View Bill 1st Installment | 12/10/2014 | $3,845.76 | Paid Dec 10, 2014 2nd Installment | 04/10/2015 | $3,845.76 | Paid Apr 7, 2015 Secured | 2013-2014 | 01009400 | $7,655.08 | View Bill 1st Installment | 12/10/2013 | $3,827.54 | Paid Dec 4, 2013 2nd Installment | 04/10/2014 | $3,827.54 | Paid Apr 9, 2014 Secured | 2012-2013 | 01009400 | $6,102.96 | View Bill 1st Installment | 12/10/2012 | $3,051.48 | Paid Dec 7, 2012 2nd Installment | 04/10/2013 | $3,051.48 | Paid Apr 8, 2013 Secured | 2011-2012 | 01009400 | $6,213.30 | View Bill 1st Installment | 12/10/2011 | $3,106.65 | Paid Dec 9, 2011 2nd Installment | 04/10/2012 | $3,106.65 | Paid Apr 10, 2012 Secured | 2010-2011 | 01069800 | $5,660.56 | View Bill 1st Installment | 12/10/2010 | $2,830.28 | Paid Dec 9, 2010 2nd Installment | 04/10/2011 | $2,830.28 | Paid Apr 10, 2011 Secured | 2009-2010 | 01070300 | $5,917.10 | View Bill 1st Installment | 12/10/2009 | $2,958.55 | Paid Dec 10, 2009 2nd Installment | 04/10/2010 | $2,958.55 | Paid Apr 10, 2010 Secured | 2008-2009 | 01070300 | $5,547.66 | View Bill 1st Installment | 12/10/2008 | $2,773.83 | Paid Dec 10, 2008 2nd Installment | 04/10/2009 | $2,773.83 | Paid Apr 10, 2009 Secured | 2007-2008 | 01069100 | $5,423.06 | View Bill 1st Installment | 12/10/2007 | $2,711.53 | Paid Dec 10, 2007 2nd Installment | 04/10/2008 | $2,711.53 | Paid Apr 10, 2008 Secured | 2006-2007 | 01069000 | $5,387.94 | View Bill 1st Installment | 12/10/2006 | $2,693.97 | Paid Dec 10, 2006 2nd Installment | 04/10/2007 | $2,693.97 | Paid Apr 10, 2007 Secured | 2005-2006 | 01069100 | $5,243.04 | View Bill 1st Installment | 12/10/2005 | $2,621.52 | Paid Dec 9, 2005 2nd Installment | 04/10/2006 | $2,621.52 | Paid Apr 10, 2006 Secured | 2004-2005 | 01068900 | $4,855.00 | View Bill 1st Installment | $2,427.50 | Paid Dec 10, 2004 2nd Installment | $2,427.50 | Paid Apr 10, 2005
having problems downloading a csv file
So I was trying to make a function that downloads a csv file using the csv download link and then basically prints it dividing it in lines but I'm having problems when I have to save def download_data(csv_url): response = request.urlopen(csv_url) csv = response.read() csv_str = str(csv) lines = csv_str.split("\\n") dest_url = r'data.csv' fx = open(dest_url, 'r') for line in lines: fx.write(line + '/n') fx.close() when I give it the csv link , it tells me it can't find file/directory "data.csv" even though I should've downloaded it. Running Mac os
You're reading the file. Change the 'r' in fx = open(dest_url, 'r') to 'w'. fx = open(dest_url, 'w') As a side note you really should be using a with statement. with will make the file object close the connection once the code leaves the the with's scope. This way you don't have to worry about closing the connection. def download_data(csv_url): response = request.urlopen(csv_url) with open('data.csv', 'w') as f: f.write(str(response.read())) Though really there isn't any need to save the file at all if you're just going to read it and display the contents on the screen. Just have download_data return csv_str. Finally take a look at the builtin csv module. It makes life easy. import csv from io import StringIO import requests def download_data(csv_url): return csv.reader( StringIO( requests.get(csv_url) .text ), delimiter=',' ) for row in download_data('https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv'): print("| {} |".format(str(' | '.join(row)))) # Prints: # # | John | Doe | 120 jefferson st. | Riverside | NJ | 08075 | # | Jack | McGinnis | 220 hobo Av. | Phila | PA | 09119 | # | John "Da Man" | Repici | 120 Jefferson St. | Riverside | NJ | 08075 | # | Stephen | Tyler | 7452 Terrace "At the Plaza" road | SomeTown | SD | 91234 | # | | Blankman | | SomeTown | SD | 00298 | # | Joan "the bone", Anne | Jet | 9th, at Terrace plc | Desert City | CO | 00123 |