I have a csv file consisting of 5 fields.
Some sample data:
market_name,vendor_name,price,name,ship_from
'Greece',03wel,1.79367196,huhif,Germany
'Greece',le,0.05880975,fdfd,Germany
'Mlkio',dpg,0.11344859,fdfd,Germany
'Greece',gert,0.18655316,,Germany
'Tu',roland,0.52856728,fdfsdv,Germany
'ghuo',andy,0.52856728,jhjhj,Germany
'ghuo',didier,0.02085452,fsdfdf,Germany
'arsen',roch,0.02578377,uykujkj,Germany
'arsen',dpg,0.10010169,wrefrewrf,Germany
'arsen',dpg,0.06415609,jhgjhg,Germany
'arsen',03wel,0.02578377,gfdgb,Germany
'giar',03wel,0.02275039,gfhfbf,Germany
'giar',03wel,0.42751765,sdgfdgfg,Germany
In this file there are multiple records for every vendor. I want to find every unique value of the field vendor_name and also calculate the average price for each vendor. I am using the following script:
import pandas as pd
import numpy as np
import csv
from random import randint
ds = pd.read_csv("sxedonetoimo2.csv",
dtype={"vendor_name": object, "name" : object,
"ship_from" : object, "price": object})
ds['ship_from']=ds.ship_from.str.lower()
print(ds.dtypes)
pd.to_numeric(ds['price'], errors='coerce')
d = { 'name': pd.Series.nunique,
'ship_from' : lambda x: randint(1,2) if (x==('eu'or'europe'or'eu'or'europeanunion'or'worldwide'or'us'or'unitedstates'or'usa'or'us'or'ww'or'wweu'or'euww'or'internet')).any() else randint(3,20)
,'price': ds.groupby('vendor_name')['price'].mean()
}
result = ds.groupby('vendor_name').agg(d)
result.to_csv("scaled_upd.csv")
But I am getting this error :
raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate
In my csv file, the values of the field price is only numbers. If I change the type of that field to float , it raises an error that a specific string cannot be parsed. I am really confused. Any help?
Just use read_csv(), groupby() and agg():
import pandas as pd
df = pd.read_csv('test.csv').groupby('vendor_name') /
.agg({'price': 'mean', 'name': lambda x: x.nunique()})
Yields:
price name
vendor_name
03wel 0.567431 4
andy 0.528567 1
didier 0.020855 1
dpg 0.092569 3
gert 0.186553 0
le 0.058810 1
roch 0.025784 1
roland 0.528567 1
Related
I have a csv that I am reading using pandas.
In the csv, I have a column that has the following values:
x<1
1<x<2
2<x<3
3<x<4
x<4
when I convert them to category and then use category code, I am getting something such as this for category codes
{
x<1:2
1<x<2:1
2<x<3:3
3<x<4:4
x<4:0
}
but I need the code to be as follow:
{
x<1:0
1<x<2:1
2<x<3:2
3<x<4:3
x<4:4
}
How can I change the category code without changing the dataframe?
I used the following code to convert the column to category:
df['col'] = df['col'].astype('category')
You can use pd.api.types.CategoricalDtype to change the category code follows:
Code:
import pandas as pd
# Create a sample dataframe
df = pd.DataFrame({'col': ['x<1', '1<x<2', '2<x<3', '3<x<4', 'x<4']})
df['col'] = df['col'].astype('category')
# Get the current category order
category_order1 = df.col.cat.categories.to_list()
print('category_order_1:', category_order1)
# Invert the category order
co = category_order1[::-1]
df['col'] = df['col'].astype(pd.api.types.CategoricalDtype(categories=co, ordered=True))
# Get the current category order
category_order2 = df.col.cat.categories.to_list()
print('category_order_2:', category_order2)
# Define and apply arbitrary category order
co = ['2<x<3', 'x<4', 'x<1', '3<x<4', '1<x<2']
df['col'] = df['col'].astype(pd.api.types.CategoricalDtype(categories=co, ordered=True))
# Get the current category order
category_order3 = df.col.cat.categories.to_list()
print('category_order_3:', category_order3)
Output:
category_order_1: ['1<x<2', '2<x<3', '3<x<4', 'x<1', 'x<4']
category_order_2: ['x<4', 'x<1', '3<x<4', '2<x<3', '1<x<2']
category_order_3: ['2<x<3', 'x<4', 'x<1', '3<x<4', '1<x<2']
I am currently trying to write a function that will extract the string between 2 specific characters.
My data set contains emails only, that look like this: pstroulgerrn#time.com.
I am trying to extract everything after the # and everything before the . so that the email listed above would output time.
Here is my code so far :
new = df_personal['email'] # 1000x1 dataframe of emails
def extract_company(x):
y = [ ]
y = x[x.find('#')+1 : x.find('.')]
return y
extract_company(new)
Note : If I change new to df_personal['email'][0] the correct output is displayed for that row.
However, when trying to do it for the entire dataframe, I get an error saying :
AttributeError: 'Series' object has no attribute 'find'
You can extract a series of all matching texts using regex:
import pandas as pd
df = pd.DataFrame( ['kabawonga#something.whereever','kabawonga#omg.whatever'])
df.columns = ['email']
print(df)
k = df["email"].str.extract(r"#(.+)\.")
print(k)
Output:
# df
email
0 kabawonga#something.whereever
1 kabawonga#omg.whatever
# extraction
0
0 something
1 omg
See pandas.Series.str.extract
Try:
df_personal["domain"]=df_personal["email"].str.extract(r"\#([^\.]+)\.")
Outputs (for the sample data):
import pandas as pd
df_personal=pd.DataFrame({"email": ["abc#yahoo.com", "xyz.abc#gmail.com", "john.doe#aol.co.uk"]})
df_personal["domain"]=df_personal["email"].str.extract(r"\#([^\.]+)\.")
>>> df_personal
email domain
0 abc#yahoo.com yahoo
1 xyz.abc#gmail.com gmail
2 john.doe#aol.co.uk aol
You can do it with an apply function, by first splitting by a . and then by # for each of the row:
Snippet:
import pandas as pd
df = pd.DataFrame( ['abc#xyz.dot','def#qwe.dot','def#ert.dot.dot'])
df.columns = ['email']
df["domain"] = df["email"].apply(lambda x: x.split(".")[0].split("#")[1])
Output:
df
Out[37]:
email domain
0 abc#xyz.dot xyz
1 def#qwe.dot qwe
2 def#ert.dot.dot ert
I am trying to calculate the column values of Cat "recursively"
Every loop should calculate the Cat columns max value (Catz) of a group of x. If the Date range becomes <=60, Cat column value should be updated with Catz +=1. I got an arcpy of this process going. I, however, have thousands of other data sets outside that need not be converted in arcpy friendly format. I am not well familiar with pandas.
Made reference to [1]: Calculate DataFrame values recursively and [2]: python pandas- apply function with two arguments to columns . I still havent quite understood the Series/Dataframe Concept and how to apply either outcome
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import datetime as dt
from datetime import timedelta
import time
from datetime import date
dict = {'x':["ASPELBJNMI", "JUNRNEXCRG", "ASPELBJNMI", "JUNRNEXCRG"],
'start': ["6/27/2018", "8/4/2018", "8/22/2018", "8/12/2018"],
'finish':["8/11/2018", "10/3/2018", "8/31/2018", "10/26/2018"],
'DateRange':[0,0,0,0],
'Cat':[-1,-1,-1,-1],
'ID':[1,2,3,4]}
df = pd.DataFrame(dict)
df.set_index('ID')
def classd(houp):
Catz = houp.Cat.min()
Catz +=1
houp = houp.groupby('x')
for x, houp2 in houp:
houp.DateRange = (pd.to_datetime(houp.finish.loc[:]).min()- houp.start.loc[:]).astype('timedelta64[D]')
houp.Cat = np.where(houp.DateRange<=60, Catz , -1)
return houp
df['Cat'] = df[['x','DateRange','Cat']].apply(classd, axis=1).Cat
print df
I get the following Traceback when I run my code
Catz = houp.Cat.min()
AttributeError: ("'long' object has no attribute 'min'", u'occurred at index 0')
Desired outcome
OBJECTID_1 * Conc * ID start finish DateRange Cat
1 ASPELBJNMI LAPMT 6/27/2018 8/11/2018 45 0
2 ASPELBJNMI KLKIY 8/22/2018 8/31/2018 9 1
15 JUNRNEXCRG CGCHK 8/4/2018 10/3/2018 60 1
16 JUNRNEXCRG IQYGJ 8/12/2018 10/26/2018 83 -1
You program is little bit complecated to comprehend
But i would suggest to try something simple with apply function:
s.apply(lambda x: x ** 2)
here s is a series
https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html
I have this short version of ADSB json data and would like to convert it into dataFrame columns as Icao, Alt, Lat, Long, Spd, Cou.....
After Alperen told me to do this
df = pd.read_json('2016-06-20-2359Z.json', lines=True),
I can load it into a DataFrame. However, df.acList is
[{'Id': 10537990, 'Rcvr': 1, 'HasSig': False, ...
Name: acList, dtype: object
How can I get the Icao, Alt, Lat, Long, Spd, Cou data?
"src":1,
"feeds":[
{
"id":1,
"name":"ADSBexchange.com",
"polarPlot":false
}
],
"srcFeed":1,
"showSil":true,
"showFlg":true,
"showPic":true,
"flgH":20,
"flgW":85,
"acList":[
{
"Id":11281748,
"Rcvr":1,
"HasSig":false,
"Icao":"AC2554",
"Bad":false,
"Reg":"N882AS",
"FSeen":"\/Date(1466467166951)\/",
"TSecs":3,
"CMsgs":1,
"AltT":0,
"Tisb":false,
"TrkH":false,
"Type":"CRJ2",
"Mdl":"2001
BOMBARDIER INC
CL-600-2B19",
"Man":"Bombardier",
"CNum":"7503",
"Op":"EXPRESSJET AIRLINES INC - ATLANTA, GA",
"OpIcao":"ASQ",
"Sqk":"",
"VsiT":0,
"WTC":2,
"Species":1,
"Engines":"2",
"EngType":3,
"EngMount":1,
"Mil":false,
"Cou":"United States",
"HasPic":false,
"Interested":false,
"FlightsCount":0,
"Gnd":false,
"SpdTyp":0,
"CallSus":false,
"TT":"a",
"Trt":1,
"Year":"2001"
},
{
"Id":11402205,
"Rcvr":1,
"HasSig":true,
"Sig":110,
"Icao":"ADFBDD",
"Bad":false,
"FSeen":"\/Date(1466391940977)\/",
"TSecs":75229,
"CMsgs":35445,
"Alt":8025,
"GAlt":8025,
"AltT":0,
"Call":"TEST1234",
"Tisb":false,
"TrkH":false,
"Sqk":"0262",
"Help":false,
"VsiT":0,
"WTC":0,
"Species":0,
"EngType":0,
"EngMount":0,
"Mil":true,
"Cou":"United States",
"HasPic":false,
"Interested":false,
"FlightsCount":0,
"Gnd":true,
"SpdTyp":0,
"CallSus":false,
"TT":"a",
"Trt":1
}
],
"totalAc":4231,
"lastDv":"636019887431643594",
"shtTrlSec":61,
"stm":1466467170029
}
</pre>
If you already have your data in acList column in a pandas DataFrame, simply do:
import pandas as pd
pd.io.json.json_normalize(df.acList[0])
Alt AltT Bad CMsgs CNum Call CallSus Cou EngMount EngType ... Sqk TSecs TT Tisb TrkH Trt Type VsiT WTC Year
0 NaN 0 False 1 7503 NaN False United States 1 3 ... 3 a False False 1 CRJ2 0 2 2001
1 8025.0 0 False 35445 NaN TEST1234 False United States 0 0 ... 0262 75229 a False False 1 NaN 0 0 NaN
Since pandas 1.0 the imports should be:
import pandas as pd
pd.json_normalize(df.acList[0])
#Sergey's answer solved the issue for me but I was running into issues because the json in my data frame column was kept as a string and not as an object. I had to add the additional step of mapping the column:
import json
import pandas as pd
pd.io.json.json_normalize(df.acList.apply(json.loads))
Since pandas 1.0, json_normalize is available in the top-level namespace.
Therefore use:
import pandas as pd
pd.json_normalize(df.acList[0])
I can't comment yet on ThinkBonobo's answer but in case the JSON in the column isn't exactly a dictionary you can keep doing .apply until it is. So in my case
import json
import pandas as pd
json_normalize(
df
.theColumnWithJson
.apply(json.loads)
.apply(lambda x: x[0]) # the inner JSON is list with the dictionary as the only item
)
In my case I had some missing values (None) then I created a more specific code that also drops the original column after creating the new ones:
for prefix in ['column1', 'column2']:
df_temp = df[prefix].apply(lambda x: {} if pd.isna(x) else x)
df_temp = pd.io.json.json_normalize(df_temp)
df_temp = df_temp.add_prefix(prefix + '_')
df.drop([prefix], axis=1, inplace=True)
df = pd.concat([df, df_temp], axis = 1, sort=False)
I am using pandas, Jupyter notebooks and python.
I have a following dataset as a dataframe
Cars,Country,Type
1564,Australia,Stolen
200,Australia,Stolen
579,Australia,Stolen
156,Japan,Lost
900,Africa,Burnt
2000,USA,Stolen
1000,Indonesia,Stolen
900,Australia,Lost
798,Australia,Lost
128,Australia,Lost
200,Australia,Burnt
56,Australia,Burnt
348,Australia,Burnt
1246,USA,Burnt
I would like to know how I can use a box plot to answer the following question "Number of cars in Australia that were affected by each type". So basically, I should have 3 boxplots(for each type) showing the number of cars affected in Australia.
Please keep in mind that this is a subset of the real dataset.
You can select only the rows corresponding to "Australia" from the column "Country" and group it by the column "Type" as shown:
from StringIO import StringIO
import pandas as pd
text_string = StringIO(
"""
Cars,Country,Type,Score
1564,Australia,Stolen,1
200,Australia,Stolen,2
579,Australia,Stolen,3
156,Japan,Lost,4
900,Africa,Burnt,5
2000,USA,Stolen,6
1000,Indonesia,Stolen,7
900,Australia,Lost,8
798,Australia,Lost,9
128,Australia,Lost,10
200,Australia,Burnt,11
56,Australia,Burnt,12
348,Australia,Burnt,13
1246,USA,Burnt,14
""")
df = pd.read_csv(text_string, sep = ",")
# Specifically checks in column name "Cars"
group = df.loc[df['Country'] == 'Australia'].boxplot(column = 'Cars', by = 'Type')