CSV being read with no rows - python
I'm trying to read a csv from a web page into pandas but get something saying empty dataframe and an object with 0 rows and 155 columns. I feel like I'm missing a step. Struggling to use data from web as opposed to my machine in general.
url ="https://data.world/exercises/data-wrangling-exercise-1/workspace/file?filename=Crime_2015.csv"
crimex= pd.read_csv(url)
print(crimex)
output as follows:
Empty DataFrame
Columns: [ data.world Loading...?Feedback{"dataset":{"hasError":false, loadedDatasets:{}, usersDatasets:{}, loadedCurrentUsersDatasets:false, usersDatasetsTotalCount:0}, user:{"hasError":false, loadedUsers:{}, userFollows:{}, userLikes:{}, userInvites:null, userRecentComments:{}, groupAuthorizations:{}}, invite:{}, treatments:["showVersions", showWorkspaceTabs, showEntityLinkOptions, paramQueries], currentUser:{"authorizedAccounts":null, requests:[], unDismissedTourCount:2, needsAnalyticsAlias:false, notifications:[], dismissed:[], profile:{"agentid":"", visitorid:"109156be-c4fb-41ea-b1b4-efe1671c580f", displayName:"", email:"", company:"", activeSubscriptionid:"", accountStanding:"good", bio:"", emailVerified:true, location:"", website:"", avatarUrl:"", numFollowers:0, numFollowing:0, numOrganizations:0, allowedRoles:[], relationship:{}, created:"", updated:"", tags:[], abTests:{"buckets":{"landingPageVideo":{"name":"video", value:1}, openAccessTwo:{"name":"normal", value:0}, sidedoorVideo:{"name":"play-video", value:1}}}, orgMode:"", orgDetails:"", level:"", userCapabilities:{}}, requestsFetching:false, authorizedClients:[], isLoggedIn:false, onboardThisSession:false, token:"eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJwcm9kLXVzZXItY2xpZW50IiwiaXNzIjoiYXV0aG9yaXR5OmRhdGFkb3R3b3JsZDo6MEJGMEVCRDMtMkRFNy00OUZCLUI4N0ItQUQwMzA1N0JCODlFIiwiaWF0IjoxNTM4NDkxMzM0LCJyb2xlIjpbXSwiZXhwIjoxNTM4NTc3NzM0LCJ2aXNpdG9yaWQiOiIxMDkxNTZiZS1jNGZiLTQxZWEtYjFiNC1lZmUxNjcxYzU4MGYiLCJnZW5lcmFsLXB1cnBvc2UiOnRydWUsImF1dGhvcml0eWlkcyI6WyJkYXRhZG90d29ybGQiXX0.9zL3wK5ceD8ylOykw30wuN5JQLPNjArVYo7H-1R85OvTb2hHPTKekcZKYoW-fLhZhnRH22PXTfg23RD__wK3wA"}, discussion:{}, routing:{"locationBeforeTransitions":null}, siteStatus:{"uploadsOperational":true, queryOperational:true, websiteOperational:true, downloadsOperational:true, shouldAskReload:false, shouldForceReload:false, incidentName:"", incidentMessage:"", pageStatus:"none"}, loginModal:{"show":false}, signUpModal:{"show":false, treatment:null, treatmentProps:null}, linkedOnboardModal:{"show":false}, queries:{}, requestInfo:{"pathsMarkedAs404":[], metadata:{}}, integration:{"activeIntegrations":[]}, clients:{}, events:{"status":"closed", meta:{}}}{"analytics":{"snowplow_pixel":"d2vtrn3jrzj4cp.cloudfront.net", segment_id:"IXhtZmPohuGM60VHk59cdSbWqWQBB7aR"}, filestack_media_key:"AhpzUeRlHRGCiCUDy2Tz3z", filestack_dataset_key:"AuM8NdQnIQGE3NOBFdr6wz", filestack_dataset_strings_url:"https:\u002F\u002Fcdn.filepicker.io\u002Fapi\u002Ffile\u002F2OkHIxqQqq0DvgBPfiAh", filestack_max_upload_size:524288000, iframely_key:"e4239223dd4ad21531b3a044840d5898", sentry_dsn_client:"https:\u002F\u002F78d03c5189424a139f3e29423a60f67f#app.getsentry.com\u002F78710", version_build:"b003207", version_commit:"96f0da448", version_check_interval:60000, csrf_cookie_name:"_csrf", csrf_header_name:"x-csrf-token", analytics_log_to_console:false, ...]
Index: []
[0 rows x 155 columns]
You need to do what it says on the data.world site.
Click the Download link (at the top right), Share URL and use either:
To share a secure download link:
https://query.data.world/s/xxxxxxxlink_codexxxxxxxx
To import:
import pandas as pd
df = pd.read_csv('https://query.data.world/s/xxxxxxxlink_codexxxxxxxx')
Note: my links have been edited to not work, you may provide your own link.
Related
How to split csv data
I have a problem where I got a csv data like this: AgeGroup Where do you hear our company from? How long have you using our platform? 18-24 Word of mouth; Supermarket Product 0-1 years 36-50 Social Media; Word of mouth 1-2 years 18-24 Advertisement +4 years and I tried to make the file into this format through either jupyter notebook or from excel csv: AgeGroup Where do you hear our company from? 18-24 Word of mouth 0-1 years 18-24 Supermarket Product 0-1 years 36-50 Social Media 1-2 years 36-50 Word of mouth 1-2 years 18-24 Advertisement +4 years Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook: data = pd.read_csv('Untitled form.csv') Can anyone tell me how should I do it? I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252' So it would become like this: import pandas as pd data_split = pd.read_csv('Untitled form split.csv', skipinitialspace=True, usecols=range(1,7), encoding='cp1252') However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it. import pandas as pd data = pd.read_fwf('Untitled form.csv') cols = data.columns data_long = pd.DataFrame(columns=data.columns) for idx, row in data.iterrows(): hear_from = row['Where do you hear our company from?'].split(';') hear_from_fmt = list(map(lambda x: x.strip(), hear_from)) n_items = len(hear_from_fmt) d = { cols[0] : [row[0]]*n_items, cols[1] : hear_from_fmt, cols[2] : [row[2]]*n_items, } data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True) Let's brake it down. This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help. Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together. Now there might be a more efficient solution, but this should work.
Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?
newbie python learner here! I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false. Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? : trialnum,colourtext,colourname,condition,response,rt,correct 1,blue,red,incongruent,red,0.767041,True 2,yellow,yellow,congruent,yellow,0.647259,True 3,green,blue,incongruent,blue,0.990185,True 4,green,green,congruent,green,0.720116,True 5,yellow,yellow,congruent,yellow,0.562909,True 6,yellow,yellow,congruent,yellow,0.538918,True 7,green,yellow,incongruent,yellow,0.693017,True 8,yellow,red,incongruent,red,0.679368,True 9,yellow,blue,incongruent,blue,0.951432,True 10,blue,blue,congruent,blue,0.633367,True 11,blue,green,incongruent,green,1.289047,True 12,green,green,congruent,green,0.668142,True 13,blue,red,incongruent,red,0.647722,True 14,red,blue,incongruent,blue,0.858307,True 15,red,red,congruent,red,1.820112,True 16,blue,green,incongruent,green,1.118404,True 17,red,red,congruent,red,0.798532,True 18,red,red,congruent,red,0.470939,True 19,red,blue,incongruent,blue,1.142712,True 20,red,red,congruent,red,0.656328,True 21,red,yellow,incongruent,yellow,0.978830,True 22,green,red,incongruent,red,1.316182,True 23,yellow,yellow,congruent,green,0.964292,False 24,green,green,congruent,green,0.683949,True 25,yellow,green,incongruent,green,0.583939,True 26,green,blue,incongruent,blue,1.474140,True 27,green,blue,incongruent,blue,0.569109,True 28,green,green,congruent,blue,1.196470,False 29,red,red,congruent,red,4.027546,True 30,blue,blue,congruent,blue,0.833177,True 31,red,red,congruent,red,1.019672,True 32,green,blue,incongruent,blue,0.879507,True 33,red,red,congruent,red,0.579254,True 34,red,blue,incongruent,blue,1.070518,True 35,blue,yellow,incongruent,yellow,0.723852,True 36,yellow,green,incongruent,green,0.978838,True 37,blue,blue,congruent,blue,1.038232,True 38,yellow,green,incongruent,yellow,1.366425,False 39,green,red,incongruent,red,1.066038,True 40,blue,red,incongruent,red,0.693698,True 41,red,blue,incongruent,blue,1.751062,True 42,blue,blue,congruent,blue,0.449651,True 43,green,red,incongruent,red,1.082267,True 44,blue,blue,congruent,blue,0.551023,True 45,red,blue,incongruent,blue,1.012258,True 46,yellow,green,incongruent,yellow,0.801443,False 47,blue,blue,congruent,blue,0.664119,True 48,red,green,incongruent,yellow,0.716189,False 49,green,green,congruent,yellow,0.630552,False 50,green,yellow,incongruent,yellow,0.721917,True 51,red,red,congruent,red,1.153943,True 52,blue,red,incongruent,red,0.571019,True 53,yellow,yellow,congruent,yellow,0.651611,True 54,blue,blue,congruent,blue,1.321344,True 55,green,green,congruent,green,1.159240,True 56,blue,blue,congruent,blue,0.861646,True 57,yellow,red,incongruent,red,0.793069,True 58,yellow,yellow,congruent,yellow,0.673190,True 59,yellow,red,incongruent,red,1.049320,True 60,red,yellow,incongruent,yellow,0.773447,True 61,red,yellow,incongruent,yellow,0.693554,True 62,red,red,congruent,red,0.933901,True 63,blue,blue,congruent,blue,0.726794,True 64,green,green,congruent,green,1.046116,True 65,blue,blue,congruent,blue,0.713565,True 66,blue,blue,congruent,blue,0.494177,True 67,green,green,congruent,green,0.626399,True 68,blue,blue,congruent,blue,0.711896,True 69,blue,blue,congruent,blue,0.460420,True 70,green,green,congruent,yellow,1.711978,False 71,blue,blue,congruent,blue,0.634218,True 72,yellow,blue,incongruent,yellow,0.632482,False 73,yellow,yellow,congruent,yellow,0.653813,True 74,green,green,congruent,green,0.808987,True 75,blue,blue,congruent,blue,0.647117,True 76,green,red,incongruent,red,1.791693,True 77,red,yellow,incongruent,yellow,1.482570,True 78,red,red,congruent,red,0.693132,True 79,red,yellow,incongruent,yellow,0.815830,True 80,green,green,congruent,green,0.614441,True 81,yellow,red,incongruent,red,1.080385,True 82,red,green,incongruent,green,1.198548,True 83,blue,green,incongruent,green,0.845769,True 84,yellow,blue,incongruent,blue,1.007089,True 85,green,blue,incongruent,blue,0.488701,True 86,green,green,congruent,yellow,1.858272,False 87,yellow,yellow,congruent,yellow,0.893149,True 88,yellow,yellow,congruent,yellow,0.569597,True 89,yellow,yellow,congruent,yellow,0.483542,True 90,yellow,red,incongruent,red,1.669842,True 91,blue,green,incongruent,green,1.158416,True 92,blue,red,incongruent,red,1.853055,True 93,green,yellow,incongruent,yellow,1.023785,True 94,yellow,blue,incongruent,blue,0.955395,True 95,yellow,yellow,congruent,yellow,1.303260,True 96,blue,yellow,incongruent,yellow,0.737741,True 97,yellow,green,incongruent,green,0.730972,True 98,green,red,incongruent,red,1.564596,True 99,yellow,yellow,congruent,yellow,0.978911,True 100,blue,yellow,incongruent,yellow,0.508151,True 101,red,green,incongruent,green,1.821969,True 102,red,red,congruent,red,0.818726,True 103,yellow,yellow,congruent,yellow,1.268222,True 104,yellow,yellow,congruent,yellow,0.585495,True 105,green,green,congruent,green,0.673404,True 106,blue,yellow,incongruent,yellow,1.407036,True 107,red,red,congruent,red,0.701050,True 108,red,green,incongruent,red,0.402334,False 109,red,green,incongruent,green,1.537681,True 110,green,yellow,incongruent,yellow,0.675118,True 111,green,green,congruent,green,1.004550,True 112,yellow,blue,incongruent,blue,0.627439,True 113,yellow,yellow,congruent,yellow,1.150248,True 114,blue,yellow,incongruent,yellow,0.774452,True 115,red,red,congruent,red,0.860966,True 116,red,red,congruent,red,0.499595,True 117,green,green,congruent,green,1.059725,True 118,red,red,congruent,red,0.593180,True 119,green,yellow,incongruent,yellow,0.855915,True 120,blue,green,incongruent,green,1.335018,True But I am only interested in the 'condition', 'rt', and 'correct' columns. I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table: Participant Stimulus Type Mean Reaction Time Percentage Correct 01 Congruent 0.560966 80 01 Incongruent 0.890556 64 02 Congruent 0.460576 89 02 Incongruent 0.956556 55 Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice! I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants? Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table: import os import glob import pandas as pd #set working directory os.chdir('data') #find all csv files in the folder #use glob pattern matching -> extension = 'csv' #save result in list -> all_filenames extension = 'csv' all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #print(all_filenames) #combine all files in the list combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ]) #export to csv combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig') Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table? Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes? I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this: from pathlib import Path # Use the Path class to represent a path. It offers more # functionalities when perform operations on paths path = Path("./data").resolve() # Create a dictionary whose keys are the Participant ID # (the `01` in `P01.csv`, etc), and whose values are # the data frames initialized from the CSV data = { p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv") } # Create a master data frame by combining the individual # data frames from each CSV file df = pd.concat(data, keys=data.keys(), names=["participant", None]) # Calculate the statistics result = ( df.groupby(["participant", "condition"]).agg(**{ "Mean Reaction Time": ("rt", "mean"), "correct": ("correct", "sum"), "size": ("trialnum", "size") }).assign(**{ "Percentage Correct": lambda x: x["correct"] / x["size"] }).drop(columns=["correct", "size"]) .reset_index() )
Python JSON to a dataframe
I am using a Yahoo finance Python library to grab accounting financial data to do some basic analysis. All of the financial statement data comes in JSON format. I want the data to be in a tabular format as I typically see in a Python dataframe. Hello there are several wrappers around the data and I'm not sure how to remove those so that I can get my data into a simple columns and rows dataframe. Here is what the Python looks like: { "incomeStatementHistory":{ "F":[ { "2019-12-31":{ "researchDevelopment":"None", "effectOfAccountingCharges":"None", "incomeBeforeTax":-640000000, "minorityInterest":45000000, "netIncome":47000000, "sellingGeneralAdministrative":10218000000, "grossProfit":12876000000, "ebit":2658000000, "operatingIncome":2658000000, "otherOperatingExpenses":"None", "interestExpense":-1049000000, "extraordinaryItems":"None",
you don't have the full response so it's difficult to tell if this will be what you want d = { "incomeStatementHistory":{ "F":[ { "2019-12-31":{ "researchDevelopment":"None", "effectOfAccountingCharges":"None", "incomeBeforeTax":-640000000, "minorityInterest":45000000, "netIncome":47000000, "sellingGeneralAdministrative":10218000000, "grossProfit":12876000000, "ebit":2658000000, "operatingIncome":2658000000, "otherOperatingExpenses":"None", "interestExpense":-1049000000, "extraordinaryItems":"None",}}]}} pd.json_normalize(d['incomeStatementHistory']['F']) Output: 2019-12-31.researchDevelopment 2019-12-31.effectOfAccountingCharges 2019-12-31.incomeBeforeTax ... 2019-12-31.otherOperatingExpenses 2019-12-31.interestExpense 2019-12-31.extraordinaryItems 0 None None -640000000 ... None -1049000000 None [1 rows x 12 columns]
You should use Pandas Here its a tutorial of how to do that with pandas Also you could check this question
how to web scrape a nested table from html with python selenium
i want to scrape a web nested table with python selenium. The table format has 4 columns x 10 rows. The 4th column has an inner cell containing 6 spans storing 6 images in each row. My problem is i can only scrape the first 3 columns but cannot show the 4th column data with 6 image src in correct row order. row = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div') column = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div[1]/div') column_4th = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div[4]') innercell_column_4th = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div[4]/span[1]/img') span_1 = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div/span[1]/img') span_2 = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div/span[2]/img') for new_span_1 in span_1: span_1_img = (new_span_1.get_attribute('src')) for new_span_2 in span_2: span_2_img = (new_span_2.get_attribute('src')) for new_row in row: print ((new_row.text), (span_1_img), (span_2_img))
I would recommend you to use selenium along with BeautifulSoup . In the BeautifulSoup class when it ask about page source use the selenium function called selenium.page_source instead of requests module which cant recognise javascript.
Loading a generic Google Spreadsheet in Pandas
When I try to load a Google Spreadsheet in pandas from StringIO import StringIO import requests r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv') data = r.content df = pd.read_csv(StringIO(data), index_col=0) I get the following: CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386 Why? I would think that one could identify the spreadsheet set of rows and columns with data and use the spreadsheets rows and columns as the dataframe index and columns respectively (with NaN for anything empty). Why does it fail?
This question of mine shows how Getting Google Spreadsheet CSV into A Pandas Dataframe As one of the commentators noted you have not asked for the data in CSV format you have the "edit" request at the end of the url You can use this code and see it work on the spreadsheet (which by the way needs to be public..) It is possible to do private sheets as well but that is another topic. from StringIO import StringIO # got moved around in python3 if you're using that. import requests r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv') data = r.content In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate']) In [11]: df.head() Out[11]: City region Res_Comm \ 0 Dothan South_Central-Montgomery-Auburn-Wiregrass-Dothan Residential 10 Foley South_Mobile-Baldwin Residential 12 Birmingham North_Central-Birmingham-Tuscaloosa-Anniston Commercial 38 Brent North_Central-Birmingham-Tuscaloosa-Anniston Residential 44 Athens North_Huntsville-Decatur-Florence Residential mkt_type Quradate National_exp Alabama_exp Sales_exp \ 0 Rural 2010-01-15 00:00:00 2 2 3 10 Suburban_Urban 2010-01-15 00:00:00 4 4 4 12 Suburban_Urban 2010-01-15 00:00:00 2 2 3 38 Rural 2010-01-15 00:00:00 3 3 3 44 Suburban_Urban 2010-01-15 00:00:00 4 5 4 The new Google spreadsheet url format for getting the csv output is https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id Well they changed the url format slightly again now you need: https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet I also found I needed to do the following to deal with Python 3 a slight revision to the above: from io import StringIO and to get the file: guid=0 #for the 1st sheet act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid) dataact = act.content.decode('utf-8') #To convert to string for Stringio actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort() actdf is now a full pandas dataframe with headers (column names)
Warning: this solution will make your data accessible by anyone. In google sheet click file>publish to web. Then select what do you need to publish and select export format .csv. You'll have the link something like: https://docs.google.com/spreadsheets/d/<your sheets key yhere>/pub?gid=1317664180&single=true&output=csv Then simply: import pandas as pd pathtoCsv = r'https://docs.google.com/spreadsheets/d/<sheets key>/pub?gid=1317664180&single=true&output=csv' dev = pd.read_csv(pathtoCsv) print dev
Did you share the sheet? Click the “Share” button in the top-right corner of your document. Click on the “Get link” section and pick “Anyone with the link”. This solved for me the problem. If you didn't share, Google Sheet returns an errorpage what causes the Panda-error. (The fact that the URL works and returns a CSV when opening/pasting in the browser is because you are logged in)
The current Google Drive URL to export as csv is: https://drive.google.com/uc?export=download&id=EnterIDHere So: import pandas as pd pathtocsv = r'https://drive.google.com/uc?export=download&id=EnterIDHere' df = pd.read_csv(pathtocsv)