I have turned a list of dicts into a dataframe using this:
import pandas as pd
d = pd.read_csv('/path/to/file.csv')
res_df = pd.DataFrame(d)
res_df.head()
Pandas did not interpret the format successfully, I'm guessing bc there were no quotes around the keys of the dict. It looked like this:
[{location:'playroom',state:'NY',zip:10011},{..}]
As a workaround, I stripped out "'","{}", and "[]", to make the file standard csv. However, when I call the names argument from pd.read_csv, I have two issues: 1 - the names columns are blank, and 2 - I end up with a dataframe that is 1 row with thousands of columns. res_df.transpose() did not work.
If my csv has no header row, and assuming it has the same number of fields for each record, why is it that I can't give pandas my column names, and create new dataframe rows based on these arguments/instructions?
What is the quicker/better way to do this?
*Update: here is a snippet of the csv file:
websitedotcomcom/,Jim,jim#testdotcom,777-444-5555,Long City, NY,1,http://document-url,,another_field,,,true,12 Feb 2015 (18:17),http://website.com/,Jim,jim#test.com,777-444-5555,Long City, NY,1,http://document-url,,another_field,,,true,12 Feb 2015 (18:17)
This looks like JSON rather than CSV. You should use pandas read_json method.
df = pd.read_json('/path/to/file.json')
Note: that it is sensitive to valid json, say you may have to do some string manipulation (e.g. replacing ' with ").
Related
I have an excel document that has the information of 3 columns in only one separated by ",". I want to separate the columns during the pd.read_excel(). I tried to use usecols but it did not work. I would like also to name the columnus while calling pd.read_excel().
enter image description here
The text inside your excel is comma sep. One way to do is simply convert that excel to text before reading like so.
your excel
a,b,c
0 1,2,3
1 4,5,6
Convert to text & read again.
import pandas as pd
with open('file.txt', 'w') as file:
pd.read_excel('file.xlsx').to_string(file, index=False)
df = pd.read_csv("file.txt", sep = ",")
print(df)
Which prints #
a b c
0 1 2 3
1 4 5 6
Pandas provide a method to split string around a passed separator/delimiter. After that, the string can be stored as a list in a series or it can also be used to create multiple column data frames from a single separated string. It works similarly to Python’s default split() method but it can only be applied to an individual string. Pandas str.split() method can be applied to a whole series. .str has to be prefixed every time before calling this method to differentiate it from Python’s default function otherwise, it will throw an error. Source
Not sure how your .xlsx file is formatted but it looks you should be using pandas.read_csv() instead. Link here.
So maybe something llike pandas.read_csv(filename, sep=',', names=['Name', 'Number', 'Gender'])
I have a JSON dataframe with 12 columns, however, I only want to read columns 2 and 5 which are named "name" and "score."
Currently, the code I have is:
df = pd.read_json("path",orient='columns', lines=True)
print(df.head())
What that does is displays every column, as would be expected.
After reading through the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
I can't find any real way to only parse certain columns within json, compared to csv where you can parse columns using names=[]
pass a list of columns for indexing
df[["name","score"]]
I want to read a file 'tos_year.csv' into a Pandas dataframe, such that all values are in one single column. I will later use pd.concat() to add this column to an existing dataframe.
The CSV file holds 80 entries in the form of years, i.e. "... 1966,1966,1966,1966,1967,1967,... "
What I can't figure out is how to read the values into one column with 80 rows, instead of 80 columns with one row.
This is probably quite basic but I'm new to this. Here's my code:
import pandas as pd
tos_year = pd.read_csv('tos_year.csv').T
tos_year.reset_index(inplace=True)
tos_year.columns = ['Year']
As you can see, I tried reading it in and then transposing the dataframe, but when it gets read in initially, the year numbers are interpreted as column names, and there apparently cannot be several columns with identical names, so I end up with a dataframe that holds str-values like
...
1966
1966.1
1966.2
1966.3
1967
1967.1
...
which is not what I want. So clearly, it's preferable to read it in correctly from the start.
Thanks for any advice!
Add header=None for avoid parse years like columns names, then transpose and rename column, e.g. by DataFrame.set_axis:
tos_year = pd.read_csv('tos_year.csv', header=None).T.set_axis(['Year'], axis=1)
Or:
tos_year = pd.read_csv('tos_year.csv', header=None).T
tos_year.columns = ['Year']
I am trying to convert a DBF of about 3233 records created from a shapefile of US counties to a dataframe; then I want to take two of the columns from that dataframe and convert to a dictionary where column1 is the key and column2 is the value. However, the resulting dictionary doesn't have the same number of records as my dataframe.
I use arcpy to call in the shapefile for all US Counties. When I use arcpy.GetCount_management(county_shapefile), this returns a feature count of 3233 records.
In order to convert to a dataframe, I converted to a dbf first with arcpy.TableToTableconversion(), this returns a dbf with 3233 records.
After converting to a df using Dbf5 from simpledbf, I get a df with 3233 records.
I then convert the first two columns to a dictionary which returns 56 records. Can anyone tell me what's going on here? (I recently switched to Python 3 from Python 2, could that be part of the issue?)
Code:
county_shapefile = "U:/Shapefiles/tl_2018_us_county/tl_2018_us_county.shp"
dbf = arcpy.TableToTable_conversion(county_shapefile,"U:/","county_data.dbf")
from simpledbf import Dbf5
dbfile = Dbf5(str(dbf))
df = dbfile.to_dataframe()
df_dict = {row[0]:row[1] for row in df.values}
I have also tried doing this with the .to_dict() function, but I'm not getting the desired dictionary structure {column1:column2,column1:column2...}
from simpledbf import Dbf5
dbfile=Dbf5(str(dbf))
df=dbfile.to_dataframe()
subset=df[["STATEFP","COUNTYFP"]]
subset=subset.set_index("COUNTYFP")
dict=subset.to_dict()
In the end, I'm hoping to create a dictionary where the key is the County FIPS code (COUNTYFP) and the value is the State FIPS code (STATEFP). I do not want to have any nested dictionaries, just a simple dictionary with the format...
dict={
COUNTYFP1:STATEFP1,
COUNTYFP2:STATEFP2,
COUNTYFP3:STATEFP3,
....
}
Are you sure that the column1 has no duplicates? Because dictionaries in python do not support duplicate keys! If you want to preserve all the values in the column1 as keys you'll have to find a workaround for the same.
I have a very large, deeply nested json, from which I need only some key-values pairs, not all of them. Because it's very deeply nested, it's not comfortable to create a pandas dataframe directly from the json, because all the values I need will not be in columns.
I need to create pandas dataframe that should look like:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
From my complex json I extract the values I need as follows (All the values are extracted correctly so there is no error here):
jfile=json.loads(data)
groupe=jfile['location']['groupe']
id=jfile['id']
MotherName=jfile['Mother']['MotherName']
FatherName=jfile['Father']['FatherName']
jfile looks like:
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing","season":["summer","spring"]}
Then I create an empty dataframe and try to fill it with these values:
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
for index,row in df.iterrows():
row['groupe']=jfile['location']['groupe']
row['id']=jfile['id']
row['MotherName']=jfile['Father']['FatherName']
row['id']=jfile['Mother']['MotherName']
but when I try to print it, it says the dataframe is empty:
Empty DataFrame Columns: [groupe, id, MotherName, FatherName] Index:
[]
Why is it empty with this method, and how I can fill it properly?
If you're just trying to add a single row to your dataframe, you can use
df = df.append({"groupe":group,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)