I have the following JSON snippet:
{'search_metadata': {'completed_in': 0.027,
'count': 2},
'statuses': [{'contributors': None,
'coordinates': None,
'created_at': 'Wed Mar 31 19:25:16 +0000 2021',
'text': 'The text',
'truncated': True,
'user': {'contributors_enabled': False,
'screen_name': 'abcde',
'verified': false
}
}
,{...}]
}
The info that interests me is all in the statuses array. With pandas I can turn this into a DataFrame like this
df = pd.DataFrame(Data['statuses'])
Then I extract a subset out of this dataframe with
dfsub = df[['created_at', 'text']]
display(dfsub) shows exactly what I expect.
But I also want to include [user][screen_name] to the subset.
dfs = df[[ 'user', 'created_at', 'text']]
is syntactically correct but user contains to much information.
How do I add only the screen_name to the subset?
I have tried things like the following but none of that works
[user][screen_name]
user.screen_name
user:screen_name
I would normalize data before contructing DataFrame.
Take a look here: https://stackoverflow.com/a/41801708/14596032
Working example as an answer for your question:
df = pd.json_normalize(Data['statuses'], sep='_')
dfs = df[[ 'user_screen_name', 'created_at', 'text']]
print(dfs)
You can try to access Dataframe, then Series, then Dict
df['user'] # user column = Series
df['user'][0] # 1st (only) item of the Series = dict
df['user'][0]['screen_name'] # screen_name in dict
You can use pd.Series.str. The docs don't do justice to all the wonderful things .str can do, such as accessing list and dict items. Case in point, you can access dict elements like this:
df['user'].str['screen_name']
That said, I agree with #VladimirGromes that a better way is to normalize your data into a flat table.
I am using pandas to read a CSV which contains a phone_number field (string), however, I need to convert this field into the below JSON format
[{'phone_number':'+01 373643222'}] and put it under a new column name called phone_numbers, how can I do that?
Searched online but the examples I found are converting the all the columns into JSON by using to_json() which is apparently cannot solve my case.
Below is an example
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'phone_number': ['+1 569-483-2388', '+1 555-555-1212', '+1 432-867-5309']})
use map function like this
df["phone_numbers"] = df["phone_number"].map(lambda x: [{"phone_number": x}] )
display(df)
So the problem is that it is a csv file and when I open it with pandas, it looks like this:
data=pd.read_csv('test.csv', sep=',', usecols=['properties'])
data.head()[![enter image description here][1]][1]
It is like a dictionary in each row, just confused how to open it correctly with gender, document_type, etc as columns
{'gender': 'Male', 'nationality': 'IRL', 'document_type': 'passport', 'date_of_expiry': '2019-08-12', 'issuing_country': 'IRL'}
{'gender': 'Female', 'document_type': 'driving_licence', 'date_of_expiry': '2023-02-28', 'issuing_country': 'GBR'}
{'gender': 'Male', 'nationality': 'ITA', 'document_type': 'passport', 'date_of_expiry': '2018-06-09', 'issuing_country': 'ITA'}
It looks like the cvs file is not properly formated to be read by the default funtion from pandas. You will need to create the columns your self.
data['gender'] = data['properties'].str[0].str['gender']
For each one of the fields in the dictionary you have.
If there are too many columns, you should consider evaluating the string. like this
import ast
my_dict = ast.literal_eval(df.loc[0]['properties')
for key in my_dict.keys():
data[key] = data['properties'].str[0].str[key]
This should build your DataFrame just fine
I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
ValueError: expected an element of ColumnData(String, Seq(Any)),got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
source = ColumnDataSource(data={
'x' : data.loc[1970].fertility,
'y' : data.loc[1970].life,
'pop' : (data.loc[1970].population / 20000000) + 2,
'region' : data.loc[1970].region,
})
I have tried two different data sets by importing data from excel and have been running out of issues on exactly why this happening.
As the name suggests, the ColumnDataSource is a data structure for storing columns of data. This means that the value of every key in .data must be a column, i.e. a Python list, a NumPy array, or a Pandas series. But you are trying to assign plain numbers as the values, which is what the error message is telling you:
I am trying to create an interactive bokeh plot that holds multiple data and I am not sure why I am getting the error
expected an element of ColumnData(String, Seq(Any))
This is saying acceptable, expected values are dicts that map strings to sequences. But what you passed is clearly not that:
got {'x': 6.794, 'y': 46.8339999999999, 'country': 'Congo, Dem. Rep.', 'pop': 3.5083789999999997, 'region': 'Sub-Saharan Africa'}
The value for x for instance is just the number 6.794 and not an array or list, etc.
You can easily do this:
source = ColumnDataSource({str(c): v.values for c, v in df.items()})
This would be a solution. I think the problem is in getting the data from df.
source = ColumnDataSource(data={
'x' : data[data['Year'] == 1970]['fertility'],
'y' : data[data['Year'] == 1970]['life'],
'pop' : (data[data['Year'] == 1970]['population']/20000000) + 2,
'region' : data[data['Year'] == 1970]['region']
})
I had this same problem using this same dataset.
My solution was to import the csv in pandas using "Year" as index column.
data = pd.read_csv(csv_path, index_col='Year')
I stumbled upon a little coding problem. I have to basically read data from a .csv file which looks a lot like this:
2011-06-19 17:29:00.000,72,44,56,0.4772,0.3286,0.8497,31.3587,0.3235,0.9147,28.5751,0.3872,0.2803,0,0.2601,0.2073,0.1172,0,0.0,0,5.8922,1,0,0,0,1.2759
Now, I need to basically an entire file consisting of rows like this and parse them into numpy arrays. Till now, I have been able to get them into a big string type object using code similar to this:
order_hist = np.loadtxt(filename_input,delimiter=',',dtype={'names': ('Year', 'Mon', 'Day', 'Stock', 'Action', 'Amount'), 'formats': ('i4', 'i4', 'i4', 'S10', 'S10', 'i4')})
The format for this file consists of a set of S20 data types as of now. I need to basically extract all of the data in the big ORDER_HIST data type into a set of arrays for each column. I do not know how to save the date time column (I've kept it as String for now). I need to convert the rest to float, but the below code is giving me an error:
temparr=float[:len(order_hist)]
for x in range(len(order_hist['Stock'])):
temparr[x]=float(order_hist['Stock'][x]);
Can someone show me just how I can convert all the columns to the arrays that I need??? Or possibly direct me to some link to do so?
Boy, have I got a treat for you. numpy.genfromtxt has a converters parameter, which allows you to specify a function for each column as the file is parsed. The function is fed the CSV string value. Its return value becomes the corresponding value in the numpy array.
Morever, the dtype = None parameter tells genfromtxt to make an intelligent guess as to the type of each column. In particular, numeric columns are automatically cast to an appropriate dtype.
For example, suppose your data file contains
2011-06-19 17:29:00.000,72,44,56
Then
import numpy as np
import datetime as DT
def make_date(datestr):
return DT.datetime.strptime(datestr, '%Y-%m-%d %H:%M:%S.%f')
arr = np.genfromtxt(filename, delimiter = ',',
converters = {'Date':make_date},
names = ('Date', 'Stock', 'Action', 'Amount'),
dtype = None)
print(arr)
print(arr.dtype)
yields
(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56)
[('Date', '|O4'), ('Stock', '<i4'), ('Action', '<i4'), ('Amount', '<i4')]
Your real csv file has more columns, so you'd want to add more items to names, but otherwise, the example should still stand.
If you don't really care about the extra columns, you can assign a fluff-name like this:
arr = np.genfromtxt(filename, delimiter=',',
converters={'Date': make_date},
names=('Date', 'Stock', 'Action', 'Amount') +
tuple('col{i}'.format(i=i) for i in range(22)),
dtype = None)
yields
(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56, 0.4772, 0.3286, 0.8497, 31.3587, 0.3235, 0.9147, 28.5751, 0.3872, 0.2803, 0, 0.2601, 0.2073, 0.1172, 0, 0.0, 0, 5.8922, 1, 0, 0, 0, 1.2759)
You might also be interested in checking out the pandas module which is built on top of numpy, and which takes parsing CSV to an even higher level of luxury: It has a pandas.read_csv function whose parse_dates = True parameter will automatically parse date strings (using dateutil).
Using pandas, your csv could be parsed with
df = pd.read_csv(filename, parse_dates = [0,1], header = None,
names=('Date', 'Stock', 'Action', 'Amount') +
tuple('col{i}'.format(i=i) for i in range(22)))
Note there is no need to specify the make_date function. Just to be clear --pands.read_csvreturns aDataFrame, not a numpy array. The DataFrame may actually be more useful for your purpose, but you should be aware it is a different object with a whole new world of methods to exploit and explore.