Nested dictionary with dataframes to dataframe in pandas - python

I know that there are a few questions about nested dictionaries to dataframe but their solutions do not work for me. I have a dataframe, which is contained in a dictionary, which is contained in another dictionary, like this:
df1 = pd.DataFrame({'2019-01-01':[38],'2019-01-02':[43]},index = [1,2])
df2 = pd.DataFrame({'2019-01-01':[108],'2019-01-02':[313]},index = [1,2])
da = {}
da['ES']={}
da['ES']['TV']=df1
da['ES']['WEB']=df2
What I want to obtain is the following:
df_final = pd.DataFrame({'market':['ES','ES','ES','ES'],'device':['TV','TV','WEB','WEB'],
'ds':['2019-01-01','2019-01-02','2019-01-01','2019-01-02'],
'yhat':[43,38,423,138]})
Getting the code from another SO question I have tried this:
market_ids = []
frames = []
for market_id,d in da.items():
market_ids.append(market_id)
frames.append(pd.DataFrame.from_dict(da,orient = 'index'))
df = pd.concat(frames, keys=market_ids)
Which gives me a dataframe with multiple indexes and the devices as column names.
Thank you

The code below works well and gives the desired output:
t1=da['ES']['TV'].melt(var_name='ds', value_name='yhat')
t1['market']='ES'
t1['device']='TV'
t2=da['ES']['WEB'].melt(var_name='ds', value_name='yhat')
t2['market']='ES'
t2['device']='WEB'
m = pd.concat([t1,t2]).reset_index().drop(columns={'index'})
print(m)
And the output is:
ds yhat market device
0 2019-01-01 38 ES TV
1 2019-01-02 43 ES TV
2 2019-01-01 108 ES WEB
3 2019-01-02 313 ES WEB
The main takeaway here is melt function, which if you read about isn't that difficult to understand what's it doing here. Now as I mentioned in the comment above, this can be done iteratively over whole da named dictionary, but to perform that I'd need replicated form of the actual data. What I intended to do was to take this first t1 as the initial dataframe and then keep on concatinating others to it, which should be really easy. But I don't know how your actual values are. But I am sure you can figure out on your own from above how to put this under a loop.
The pseudo code for that loop thing I am talking about would be like this:
real=t1
for a in da['ES'].keys():
if a!='TV':
p=da['ES'][a].melt(var_name='ds', value_name='yhat')
p['market']='ES'
p['device']=a
real = pd.concat([real,p],axis=0,sort=True)
real.reset_index().drop(columns={'index'})

Related

Efficient way to merge two large dataframes based on a condition

I have two dataframes like as shown below. Already referred the posts here, here, here and here. Don't mark it as duplicate
id,id2,app_date
1,'A',20/3/2017
1,'A',28/8/2017
3,'B',18/10/2017
4,'C',15/2/2017
tf = pd.read_clipboard(sep=',')
tf['app_date'] = pd.to_datetime(tf['app_date'],dayfirst=True)
id,valid_from,valid_to,s_flag
1,20/1/2017,30/4/2017,0
1,28/11/2017,15/2/2018,1
1,18/12/2017,24/2/2018,0
2,15/7/2017,15/11/2017,1
2,2/2/2017,2/6/2017,0
2,11/5/2016,11/6/2016,1
df = pd.read_clipboard(sep=',')
df['valid_from'] = pd.to_datetime(df['valid_from'],dayfirst=True)
df['valid_to'] = pd.to_datetime(df['valid_to'],dayfirst=True)
I would like to do the below
a) check whether tf['app_date'] is within the df['valid_from'] and df['valid_to'] for matching id
b) If yes, then copy the column s_flag to tf dataframe for matching id
I tried the below but am not sure whether the below is efficient for million records plus dataframes
t1 = tf.merge(df, how = 'left',on=['id'])
t1 = t1.loc[(t1.app_date >= t1.valid_from) & (t1.app_date <= t1.valid_to),['id','s_flag','app_date']]
tf.merge(t1, how = 'inner',on=['id','app_date'])
While the above works in sample data, but in real data, for some records, I encounter issues like below
.
You can see that 9/1/2017 approval date doesn't meet the condition for the 2nd and 3rd row but it is still returned as output. This is incorrect.
I expect my output to be like as shown below
id app_date s_flag
0 1 2017-03-20 0.0
2 3 2017-10-18 NaN
3 4 2017-02-15 NaN

Cannot get access to pandas DataFrame in the way expected

I have a strange dataframe, that doesn't seem to operate in the way I expect. I should have a column heading that I can use.
The code I have produces the following, which is supposed to be used for a histogram.
categories = pd.Series(df['category'])
category_freq = pd.Series(df[df['engine'] == 'u']['category'])
hist = pd.crosstab(category_freq, categories)
counts = pd.DataFrame(np.diag(hist), index=[hist.index])
But the output has a '0' at the very top. I cannot seem to get things behaving as I would want. For example the output looks like the following:
0
category
baby 65
beauty 73
christmas 168
If I access via counts[0], I can remove this "top layer", but I can never find a way to access rows via say counts[0]['category']. I get key not found. How can I get the data in a format that works as DataFrame?
Make a Series out of it instead:
counts = pd.Series(np.diag(hist), index=[hist.index])

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?
you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735
This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

Append data to Pandas doing for loop

I have a data frame, that looks like this:
print(df)
Text
0|This is a text
1|This is also text
What I wish: I would like to do a for loop over the Text column for the data frame, and create a new column with the derived information to be like this:
Text | Derived_text
0|This is a text | Something
1|This is also text| Something
Code: I have written the following code (Im using Spacy btw):
for i in df['Text'].tolist():
doc = nlp(i)
resolved = [(doc._.coref_resolved) for docs in doc.ents]
df = df.append(pd.Series(resolved), ignore_index=True)
Problem: The problem is that the appended series gets misplaced/mismatched, so it looks like this:
Text | Derived_text
0|This is a text | NaN
1|This is also text| NaN
2|NaN | Something
3|NaN | Something
I have also tried to just save it into a list, but the list does not include NaN values, which can occur doing the derived for loop. I need the NaN values to be kept, so I can match the original text with the derived text using the index position.
It appears that you want to add a column, which can be done using pandas concat method using the axis argument like pd.concat([df, new_columns], axis = 1).
However I think you shouldn't use for loops while using pandas. What probably should do is use it's pandas's apply function, which would look something like:
# define you DataFrame
df = pd.DataFrame(data = [range(6), range(1, 7)], columns = ['a', 'b'])
# create the new column from one of them
df['a_squared'] = df['a'].apply(lambda x: x ** 2)
Maybe you should also look into lambda expressions.
Also, look into this stackoverflow question.
Hope this helped! Happy coding!

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

Categories