I'm attempting to create a DataFrame with the following:
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
# The inital set of baby names and birth rates
names =['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
#Now we wil zip them together
BabyDataSet = zip(names,births)
##we have to add the 'list' for version 3.x
print (list(BabyDataSet))
#create the DataFrame
df = DataFrame(BabyDataSet, columns = ['Names', 'Births'] )
print (df)
when I run the program I get the following error: 'data type can't be an iterator'
I read the following, 'What does the "yield" keyword do in Python?', but I do not understand how that applies to what I'm doing. Any help and further understanding would be greatly appreciated.
In python 3, zip returns an iterator, not a list like it does in python 2. Just convert it to a list as you construct the DataFrame, like this.
df = DataFrame(list(BabyDataSet), columns = ['Names', 'Births'] )
You can also create the dataframe using an alternate syntax that avoids the zip/generator issue entirely.
df = DataFrame({'Names': names, 'Births': births})
Read the documentation on initializing dataframes. Pandas simply takes the dictionary, creates one column for each entry with the key as the name and the value as the value.
Dict can contain Series, arrays, constants, or list-like objects
Related
I have the following code where I talk to a service to get data. The dictionary that is returned from this service looks like below
{'karnatakaCount': 44631, 'bangaloreCount': 20870, 'mangaloreCount': 985, 'kodaguCount': 743, 'udupiCount': 556, 'kasargodCount': 673, 'mangaloreVaccinations': 354095, 'karnatakaVaccinations': 9892349, 'keralaVaccinations': 7508437, 'mangaloreDeath': 4, 'indiaDailyConfirmed': 382602}
I want to store these values in a CSV file for which I had created a script using Pandas and then saving it to a CSV file. While creating the file, I get multiple duplicate values which looks as follows
The code looks like below
import pandas as pd
from getData import getData
from datetime import datetime
data = getData()
print(data)
dateToday=str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
df = pd.DataFrame(data, index=list(data.keys()))
df['Date'] = dateToday
# df.drop_duplicates(subset='Date', inplace=True)
df.to_csv('data1.csv',index=False)
To remove the duplicates, I have added the following code.
df.drop_duplicates(subset='Date', inplace=True)
How do I avoid this line and why is my data repeating in the DataFrame?
The problem is in df = pd.DataFrame(data, index=list(data.keys())).
pandas.DataFrame() can accepts a list of dictionary or a dictionary whose value is list-like objects.
a list of dictionary
[{'karnatakaCount': 44631, 'bangaloreCount': 20870}, {'karnatakaCount': 44631, 'bangaloreCount': 20870}]
a dictionary whose value is list-like objects
{'karnatakaCount': [44631], 'bangaloreCount': [20870]}
However, your data is a dictionary whose value only contains constants.
{'karnatakaCount': 44631, 'bangaloreCount': 20870}
Pandas fails to determine how many rows to create. If you do df = pd.DataFrame(data) in your example, pandas will give you an error
ValueError: If using all scalar values, you must pass an index
The usage of index is to determine how many times you want to repeat the data. In your example, you choose to repeat it list(data.keys()) times. That's why you got so many duplicated rows.
You can use df = pd.DataFrame([data]) to create one row dataframe.
Just try the from_dict method.
df = pd.DataFrame.from_dict(data)
And then create a new column with the keys and assgin it as index.
df.set_index('column_name')
Use the from_dict method after reformatting your data as such :
data["Date"] = str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
data = {k:[v] for k,v in data.items()}
df = pd.DataFrame.from_dict(data)
Interpretation by a friendly editor:
I have data in the form of a set.
import numpy as n , pandas as p
s={12,34,78,100}
print(n.array(s))
print(p.DataFrame(s))
The above code converts the set without a problem into a numpy array.
But when I try to create a DataFrame from it I get the following error:
ValueError: DataFrame constructor not properly called!
So is there any way to convert a python set/nested set into a numpy array/dictionary so I can create a DataFrame from it?
Original Question:
I have a data in form of set .
Code
import numpy as n , pandas as p
s={12,34,78,100}
print(n.array(s))
print(p.DataFrame(s))
The above code returns same set for numpyarray and DataFrame constructor not called at o/p . So is there any way to convert python set , nested set into numpy array and dictionary ??
Pandas can't deal with sets (dicts are ok you can use p.DataFrame.from_dict(s) for those)
What you need to do is to convert your set into a list and then convert to DataFrame:
import pandas as pd
s = {12,34,78,100}
s = list(s)
print(pd.DataFrame(s))
You can use list(s):
import pandas as p
s = {12,34,78,100}
df = p.DataFrame(list(s))
print(df)
Why do you want to convert it to a list first? The DataFrame() method accepts data which can be iterable. Sets are iterable.
dataFrame = pandas.DataFrame(yourSet)
This will create a column header: "0" which you can rename it like so:
dataFrame.columns = ['columnName']
import numpy as n , pandas as p
s={12,34,78,100}
#Create DataFrame directly from set
df = p.DataFrame(s)
#Can also create a keys, values pair (dictionary) and then create Data Frame,
#it useful as key will be used as Column Header and values as data
df1 = p.DataFrame({'Values': data} for data in s)
I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?
After about an hour, the only thing I could come up with was:
import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))
This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df
Out[14]:
name dob eye_color height weight
0 john smith 1/1/1980 brown 160 76
1 dave jones 2/2/1981 blue 170 85
2 bob roberts 3/3/1982 green 180 94
I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:
stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
or alternatively in one step:
df.join(df['stats'].apply(json.loads).apply(pd.Series))
There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
In your case you'll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.
Option 1
If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
Option 2
If you didn't then you might need to use this:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
Option 3
For more complicated situations you can write a custom converter like this:
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)
We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
json_normalize function in pandas.io.json package helps to do this without using custom function.
(assuming you are loading the data from a file)
from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)
If you have DateTime values in your .csv file, df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series) will mess up the date time values
This link has some tip how to read the csv file
with json strings into the dataframe.
You could do the following to read csv file with json string column and convert your json string into columns.
Read your csv into the dataframe (read_df)
read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")
Convert the json string column to a new dataframe
state_df = read_df['state'].apply(pd.Series)
Merge the 2 dataframe with index number.
df = pd.merge(read_df, state_df, left_index=True, right_index=True)
I'm issuing the following SQL statement against a tempview
cloudantdata.createOrReplaceTempView("washingflat")
sqlDF = spark.sql("SELECT temperature FROM washingflat")
sqlDF.rdd.map(lambda row : row.temperature).collect()
I'm just interested in the plain (unwrapped) integer values. All I've tried so far with the dataframe API always returned Row-objects wrapping the values I'm interested in.
Is there a way to get the scalar contents without using the RDD api?
You could manually put them in a list after collecting them as below
temps = []
rows = sqlDF.collect()
for r in rows:
temps.append(r['temperature'])
So given an input DataFrame
import numpy as np
import pandas as pd
test_df = pd.DataFrame({'Age': np.random.uniform(0,100, size = (100,)), 'City': 'LA'})
sqlContext.createDataFrame(test_df).registerTempTable('AgeTable')
There are two (primary) ways for extracting a value without using the Row abstraction. The first is to use the .toPandas() method of a DataFrame / SQL Query
print(sqlContext.sql("SELECT Age FROM AgeTable").toPandas()['Age'])
This returns a Pandas DataFrame / Series.
The second is to actually group the data inside of SQL and then extract it from a single Row object
al_qry = sqlContext.sql("SELECT City, COLLECT_SET(Age) as AgeList FROM AgeTable GROUP BY City")
al_qry.first()[0].AgeList
This returns a raw python list.
The more efficient way is with the toPandas method and this approach will likely be improved more in the future.
Try:
>>> from itertools import chain
>>>
>>> chain.from_iterable(sqlDF.collect())
temp_list = [str(i.temperature) for i in sqlDF.select("temperatue").collect()]
I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0