how to append data from different data frame in python? - python

I have about 20 data frames and all data frames are having same columns and I would like to add data into the empty data frame but when I use my code
interested_freq
UPC CPC freq
0 136.0 B64G 2
1 136.0 H01L 1
2 136.0 H02S 1
3 244.0 B64G 1
4 244.0 H02S 1
5 257.0 B64G 1
6 257.0 H01L 1
7 312.0 B64G 1
8 312.0 H02S 1
list_of_lists = []
max_freq = df_interested_freq[df_interested_freq['freq'] == df_interested_freq['freq'].max()]
for row, cols in max_freq.iterrows():
interested_freq = df_interested_freq[df_interested_freq['freq'] != 1]
interested_freq
list_of_lists.append(interested_freq)
list_of_lists
for append the first data frame, and then change the name in that code for hoping that it will append more data
list_of_lists = []
for row, cols in max_freq.iterrows():
interested_freq_1 = df_interested_freq_1[df_interested_freq_1['freq'] != 1]
interested_freq_1
list_of_lists.append(interested_freq_1)
list_of_lists
but the first data is disappeared and show only the recent appended data. do I have done something wrong?

One way to Create a new DataFrame from existing DataFrame is use to df.copy():
Here is Detailed documentation
The df.copy() is very much relevant here because changing the subset of data within new dataframe will change the initial DataFrame So, you have fair chances of losing your actual dataFrame thus you need it.
Suppose Example DataFrame is df1 :
>>> df1
col1 col2
1 11 12
2 21 22
Solution , you can use df.copy method as follows which will inherit the data along.
>>> df2 = df1.copy()
>>> df2
col1 col2
1 11 12
2 21 22
In case you need to new dataframe(df2) to be created as like df1 but don't want the values to inserted across the DF then you have option to use reindex_like() method.
>>> df2 = pd.DataFrame().reindex_like(df1)
# df2 = pd.DataFrame(data=np.nan,columns=df1.columns, index=df1.index)
>>> df2
col1 col2
1 NaN NaN
2 NaN NaN

Why do you use append here? It’s not a list. Once you have the first dataframe (called d1 for example), try:
new_df = df1
new_df = pd.concat([new_df, df2])
You can do the same thing for all 20 dataframes.

Related

How to add value of dataframe to another dataframe?

I want to add a row of dataframe to every row of another dataframe.
df1=pd.DataFrame({"a": [1,2],
"b": [3,4]})
df2=pd.DataFrame({"a":[4], "b":[5]})
I want to add df2 value to every row of df1.
I use df1+df2 and get following result
a b
0 5.0 8.0
1 NaN NaN
But I want to get the following result
a b
0 5 7
1 7 9
Any help would be dearly appreciated!
If really need add values per columns it means number of columns in df2 is same like number of rows in df1 use:
df = df1.add(df2.loc[0].to_numpy(), axis=0)
print (df)
a b
0 5 7
1 7 9
If need add by rows it means first value of df1 is add to first column of df2, so output is different:
df = df1.add(df2.loc[0], axis=1)
print (df)
a b
0 5 8
1 6 9

Way to combind single and multi index dataframes without multi indexing the single index dataframe

Is there a clean and simple way to combine a multi index and single index dataframe?
There are questions asking similar here and here but both are old and have "messy" solutions.
I have a single index datafame:
df1 = pd.DataFrame({'single': [10,11,12], 'double': [7,8,9]})
single double
0 10 7
1 11 8
2 12 9
And I want this to be combined to a a series of multi index dataframe with empty columns that have different column and sub column indexes:
df2 = pd.DataFrame(columns = pd.MultiIndex.from_product([['happy'], ['very', 'not_much']]))
Empty DataFrame
Columns: [(happy, very), (happy, not_much)]
Index: []
Then next itteration I will add this to the combine two dataframes above, and so on:
df3 =pd.DataFrame(columns = pd.MultiIndex.from_product([['sad'], ['always', 'never']]))
Empty DataFrame
Columns: [(sad, always), (sad, never)]
Index: []
I have tried both append and concatenate but get thir error for both:
TypeError: Expected tuple, got str
The end goal would be to get a dataframe looking like this:
happy sad
single double very not_much always never
0 10 7
1 11 8
2 12 9
I would just use concat and then post_process the columns:
resul = pd.concat([df1, df2, df3], axis=1, sort=False)
resul.columns = pd.MultiIndex.from_tuples(
[('', i) if isinstance(i, str) else i for i in resul.columns])
It gives the expected:
happy sad
single double very not_much always never
0 10 7 NaN NaN NaN NaN
1 11 8 NaN NaN NaN NaN
2 12 9 NaN NaN NaN NaN

Pandas vectorization for a multiple data frame operation

I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D

How can I create a new column in a pandas pivot table with only matching values of populated columns?

I have a pandas pivot table that lists individuals in rows and data sources across the columns. There are hundreds of individuals going down amongst the rows and hundreds of sources going across along the columns.
Desired_Value Source_1 Source_2 Source_3 ... Source_50
person1 20 20 20 20
person2 5 5 5 5
person3 Review 3 4 4 4
...
person50 1 1 1
What I want to do is create the Desired_Value column above. I want to pull in a value so long as it matches across all values (ignoring blank fields). If values do not match I want to show Review.
I use this pandas command to print my df to excel currently (without any Desired_Value column):
df13 = df12.pivot_table(index='person', columns = 'source_name', values = 'actual_data', aggfunc='first')
I'm new to Python so apologies if this is a silly question.
This is one method to do it:
df = df13.copy()
df = df.astype('Int64') # So NaN and Int values can coexist
# Create new column at the front of the data frame
df['Desired_Value'] = np.nan
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
# Loop over all rows and flag columns for review
for idx, row in df.iterrows():
val = row.dropna().unique()
if len(val) == 1:
df.loc[idx, 'Desired_Value'] = val
else:
df.loc[idx, 'Desired_Value'] = 'Review'
print(df)
Desired_Value Source_1 Source_2 Source_3 Source_50
person1 20 20 20 NaN 20
person2 5 5 NaN 5 5
person3 Review 3 4 4 4
person50 1 1 NaN NaN 1

Pandas dataframe how to replace row with one with additional attributes

I have a method that adds additional attributes to a given pandas series and I want to update a row in the df with the returned series.
Lets say I have a simple dataframe:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
a b
0 1 3
1 2 4
and now I want to replace a row with one with additional attributes, all other rows will show Nan for that column ex:
subdf = df.loc[1]
subdf["newVal"] = "foo"
# subdf is created externally and returned. Now it must be updated.
df.loc[1] = subdf #or something
df would look like:
a b newVal
0 1 3 Nan
1 2 4 foo
Without loss in generalisation, first reindex and then assign with (i)loc:
df = df.reindex(subdf.index, axis=1)
df.iloc[-1] = subdf
df
a b newVal
0 1 3 NaN
1 2 4 foo

Categories