Dictionary in Pandas DataFrame, how to split the columns - python

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?

try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)

Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Related

divide the row into two rows after several columns

I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779

What is the correct way to get the first row of a dataframe?

The data in test.csv likes this:
device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_add_8hour,upload_time_year_month,car_id,car_type,car_num,marketer_name
1101,2020-09-30 16:03:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:03:41,202010,18,1,,
1101,2020-09-30 16:08:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:08:41,202010,18,1,,
1101,2020-09-30 16:13:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:13:41,202010,18,1,,
1101,2020-09-30 16:18:41+00:00,46.7242,131.140233,0,,0,2020/10/1 0:18:41,202010,18,1,,
1101,2020-10-02 08:19:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:19:41,202010,18,1,,
1101,2020-10-02 08:24:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:24:41,202010,18,1,,
1101,2020-10-02 08:29:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:29:41,202010,18,1,,
1101,2020-10-02 08:34:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:34:41,202010,18,1,,
1101,2020-10-02 08:39:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:39:41,202010,18,1,,
1101,2020-10-02 08:44:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:44:41,202010,18,1,,
1101,2020-10-02 08:49:41+00:00,46.7236,131.1396,0.1,,0,2020/10/2 16:49:41,202010,18,1,,
1101,2020-10-06 11:11:10+00:00,46.7245,131.14015,0.1,,2.1,2020/10/6 19:11:10,202010,18,1,,
1101,2020-10-06 11:16:10+00:00,46.7245,131.14015,0.1,,2.2,2020/10/6 19:16:10,202010,18,1,,
1101,2020-10-06 11:21:10+00:00,46.7245,131.14015,0.1,,3.84,2020/10/6 19:21:10,202010,18,1,,
1101,2020-10-06 16:46:10+00:00,46.7245,131.14015,0,,0,2020/10/7 0:46:10,202010,18,1,,
1101,2020-10-07 04:44:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:44:27,202010,18,1,,
1101,2020-10-07 04:49:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:49:27,202010,18,1,,
1101,2020-10-07 04:54:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:54:27,202010,18,1,,
1101,2020-10-07 04:59:27+00:00,46.724366,131.1402,1,,0,2020/10/7 12:59,202010,18,1,,
1101,2020-10-07 05:04:27+00:00,46.724366,131.1402,1,,0,2020/10/7 13:04:27,202010,18,1,,
I use this code to get the data with the speed is 0 in the dataframe, and then group the dataframe according to latitude, longitude,year,month and day.
After grouping, get the first upload_time_add_8hour and the last upload_time_add_8hour of each group. If the difference more than 5 minutes between the first upload_time_add_8hour and the last upload_time_add_8hour, get the first row of data for each group, and finally save these data to csv.
I think my code is not concise enough.
I use df_first_row = sub_df.iloc[0:1,:] to get the first row in the dataframe, I use upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] and upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] to get the first element and the last element of a specific column.
Is there any more suitable way?
My code:
import pandas as pd
device_csv_name = r'E:/test.csv'
df = pd.read_csv(device_csv_name, parse_dates=[7], encoding='utf-8', low_memory=False)
df['upload_time_year_month_day'] = df['upload_time_add_8hour'].dt.strftime('%Y%m%d')
df['upload_time_year_month_day'] = df['upload_time_year_month_day'].astype(str)
df_speed0 = df[df['speed'].astype(float) == 0.0] #Get data with speed is 0.0
gb = df_speed0.groupby(['latitude', 'longitude', 'upload_time_year_month_day'])
sub_dataframe_list = []
for i in gb.indices:
sub_df = pd.DataFrame(gb.get_group(i))
sub_df = sub_df.sort_values(by=['upload_time_add_8hour'])
count_row = sub_df.shape[0] #get row count
if count_row>1: #each group must have more then 1 row
upload_time_add_8hour_first = sub_df['upload_time_add_8hour'].iloc[0] # get first upload_time_add_8hour
upload_time_add_8hour_last = sub_df['upload_time_add_8hour'].iloc[-1] # get last upload_time_add_8hour
minutes_diff = (upload_time_add_8hour_last - upload_time_add_8hour_first).total_seconds() / 60.0
if minutes_diff >= 5: # if minutes_diff>5,append the first row of dataframe to sub_dataframe_list
df_first_row = sub_df.iloc[0:1,:]
sub_dataframe_list.append(df_first_row)
if sub_dataframe_list:
result = pd.concat(sub_dataframe_list,ignore_index=True)
result = result.sort_values(by=['upload_time'])
result.to_csv(r'E:/for_test.csv', index=False, mode='w', header=True,encoding='utf-8')
To get the first and last element of the column, your option is already the most efficient/correct way. If you're interested in this topic, I can recommend you to read this other Stackoverflow answer: https://stackoverflow.com/a/25254087/8294752
To get the first row, I personally prefer to use DataFrame.head(1), therefore for your code something like this:
df_first_row = sub_df.head(1)
I didn't look into how the head() method is defined in Pandas and its performance implications, but in my opinion it improves readability and reduces some potential confusion with indexes.
In other examples you might also find sub_df.iloc[0], but this option will return a pandas.Series which has as indexes the DataFrame column names.
sub_df.head(1) will return a 1-row DataFrame instead, which is the same result as sub_df.iloc[0:1,:]
Your way out is either groupby().agg or df. agg
If you need it it as per device you can
#sub_df.groupby('device_id')['upload_time_add_8hour'].agg(['first','last'])
sub_df.groupby('device_id')['upload_time_add_8hour'].agg([('upload_time_add_8hour_first','first'),('upload_time_add_8hour_last ','last')]).reset_index()
device_id upload_time_add_8hour_first upload_time_add_8hour_last
0 1101 10/1/2020 0:03 10/7/2020 13:04
If you do not want it as per device, maybe try
sub_df['upload_time_add_8hour'].agg({'upload_time_add_8hour_first': lambda x: x.head(1),'upload_time_add_8hour_last': lambda x: x.tail(1)})
upload_time_add_8hour_first 0 10/1/2020 0:03
upload_time_add_8hour_last 19 10/7/2020 13:04

Parallel Processing using Multiprocessing in Python

I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:
I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:
def split_country(data):
d_list = []
for index, row in data.iterrows():
for value in str(row['Country']).split(','):
d_list.append({'Name':row['Name'],
'value':value})
data = data.append(d_list, ignore_index=True)
data = data.groupby('Name')['value'].value_counts()
data = data.unstack(level=-1).fillna(0)
return (data)
The final output is something like this:
I'm trying to parallelize the above process by passing my dataframe (df) using the following:
import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])
But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help
multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
For a test DataFrame with 1M rows, the following code took 1.54 seconds.
First, use pandas.DataFrame.explode on the column of lists
If the column is strings, first use ast.literal_eval to convert them to list type
df.countries = df.countries.apply(ast.literal_eval)
If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum
import pandas as pd
from ast import literal_eval
# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}
# create the dataframe
df = pd.DataFrame(data)
# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)
# explode the lists
df = df.explode('countries')
# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()
# display(df_counts)
name Canada China UK USA
0 Jack 0 1 1 0
1 James 1 0 0 1
2 John 0 0 1 1

Create a dataframe from a dictionary with multiple keys and values

So I have a dictionary with 20 keys, all structured like so (same length):
{'head': X Y Z
0 -0.203363 1.554352 1.102800
1 -0.203410 1.554336 1.103019
2 -0.203449 1.554318 1.103236
3 -0.203475 1.554299 1.103446
4 -0.203484 1.554278 1.103648
... ... ... ...
7441 -0.223008 1.542740 0.598634
7442 -0.222734 1.542608 0.599076
7443 -0.222466 1.542475 0.599520
7444 -0.222207 1.542346 0.599956
7445 -0.221962 1.542225 0.600375
I'm trying to convert this dictionary to a dataframe, but I'm having trouble with getting the output I want. What I want is a dataframe structured like so: columns = [headX, headY, headZ etc.] and rows being the 0-7445 rows.
Is that possible? I've tried:
df = pd.DataFrame.from_dict(mydict, orient="columns")
And different variations of that, but can't get the desired output.
Any help will be great!
EDIT: The output I want has 60 columns in total, i.e. from each of the 20 keys, I want an X, Y, Z for each of them. So columns would be: [key1X, key1Y, key1Z, key2X, key2Y, key2Z, ...]. So the dataframe will be 60 columns x 7446 rows.
Use concat with axis=1 and then flatten Multiindex by f-strings:
df = pd.concat(d, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')

How to Copy the Matching Columns between CSV Files Using Pandas?

I have two dataframes(f1_df and f2_df):
f1_df looks like:
ID,Name,Gender
1,Smith,M
2,John,M
f2_df looks like:
name,gender,city,id
Problem:
I want the code to compare the header of f1_df with f2_df by itself and copy the data of the matching columns using panda.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw f1_df and f2_df
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I am new to Pandas and not sure how to handle the problem. I have tried to do an inner join to the matching columns, but that did not work.
Here is what I have so far:
import pandas as pd
f1_df = pd.read_csv("file1.csv")
f2_df = pd.read_csv("file2.csv")
for i in f1_df:
for j in f2_df:
i = i.lower()
if i == j:
joined = f1_df.join(f2_df)
print joined
Any idea how to solve this?
try this if you want to merge / join your DFs on common columns:
first lets convert all columns to lower case:
df1.columns = df1.columns.str.lower()
df2.columns = df2.columns.str.lower()
now we can join on common columns
common_cols = df2.columns.intersection(df1.columns).tolist()
joined = df1.set_index(common_cols).join(df2.set_index(common_cols)).reset_index()
Output:
In [259]: joined
Out[259]:
id name gender city
0 1 Smith M NaN
1 2 John M NaN
export to CSV:
In [262]: joined.to_csv('c:/temp/joined.csv', index=False)
c:/temp/joined.csv:
id,name,gender,city
1,Smith,M,
2,John,M,

Categories