I have 2 dataframes:
df_A
country_codes
0 4
1 8
2 12
3 16
4 24
and df_B
continent_codes
0 4
1 3
2 5
3 6
4 5
Both dataframes have same length, but no common column. I want to concatenate the two but since not all values are common, I get lots of NaNs. How do I concatenate or zip them up into a combined dataframe?
-- EDIT desired output is this:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
The following code will do as you want :
pd.concat([df1, df2], axis=1)
Source
Output:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
From the comments:
I feel like this is too simple, but may I suggest:
df_A['continent_codes'] = df_B['continent_codes']
print(df_A)
Output:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
Related
I have a dataframe that looks like
ID feature
1 2
1 3
1 4
2 3
2 2
3 5
3 8
3 4
3 2
4 4
4 6
and I want to add a new column n_ID that counts the number of times that an element occur in the column ID, so the desire output should look like
ID feature n_ID
1 2 3
1 3 3
1 4 3
2 3 2
2 2 2
3 5 4
3 8 4
3 4 4
3 2 4
4 4 2
4 6 2
I know the .value_counts() function but I don't know how to make use of this method to make the new column. Thanks in advance
Using value counts... I was thinking of this... #sophocles Thanks for transform... :)
df = pd.DataFrame({"ID":[1,1,1,2,2,3,3,3,3,4,4],
"feature":[1,2,3,4,5,6,7,8,9,10,11]})
df1 = pd.DataFrame(df["ID"].value_counts().reset_index())
df1.columns = ["ID","n_ID"]
df = df.merge(df1,how = "left",on="ID")
Just create new column and count the occurance using lambda func:
Code:
df['n_id'] = df.apply(lambda x: df['ID'].tolist().count(x.ID), axis=1)
Output:
ID feature n_id
0 1 1 3
1 1 2 3
2 1 3 3
3 2 4 2
4 2 5 2
5 3 6 4
6 3 7 4
7 3 8 4
8 3 9 4
9 4 10 2
10 4 11 2
I have a list with dataframes (each dataframe has one timeline, alsways starting with 0 and ending differently), which I would like to save as .csv:
I want to be able to read the .csv file with its original format as a list of dataframes.
Since I could not figure out how to save a list with dataframes, I concatinated the list and saved everything as one dataframe:
pd.concat(data).to_csv(csvfile)
For reading the .csv I tried this:
df = pd.read_csv(csvfile)
This will give the location of all zeros
zero_indices = list(df.loc[df['Unnamed: 0'] == 0].index)
Append the number of rows to this to get the last dataframe
zero_indices.append(len(df))
Get the ranges - tuples of consecutive entries in the above list
zero_ranges = [(zero_indices[i], zero_indices[i+1]) for i in range(len(zero_indices) - 1)]
Extract the dataframes into a list
X_test = [df.loc[x[0]:x[1] - 1] for x in zero_ranges]
The problem I have is that the index is in the final list with dataframes, but what I actually want is the column "Unnamed: 0" in the final list to be set as the index for each dataframe:
I am not entirely sure of how you wanted to approach this, but this is what I understood from your problem statement. Let me know if its what you wanted :
We have two df's :
>>> ee = {"Unnamed : 0" : [0,1,2,3,4,5,6,7,8],"price" : [43,43,14,6,4,2,6,4,2], "time" : [3,4,5,2,5,6,6,3,4], "hour" : [1,1,1,5,4,3,4,5,4]}
>>> one = pd.DataFame.from_dict(ee)
>>> dd = {"Unnamed : 0" : [0,1,2,3,4,5],"price" : [23,4,32,4,3,234], "time" : [3,2,4,3,2,4], "hour" : [3,4,3,2,4,4]}
>>> two = pd.DataFrame.from_dict(dd)
Which looks like this :
print(one)
Unnamed : 0 price time hour
0 0 23 3 3
1 1 4 2 4
2 2 32 4 3
3 3 4 3 2
4 4 3 2 4
5 5 234 4 4
print(two)
Unnamed : 0 price time hour
0 0 23 3 3
1 1 4 2 4
2 2 32 4 3
3 3 4 3 2
4 4 3 2 4
5 5 234 4 4
Now combining these two lists, by a list operator :
list_dfs = [one,two]
print(list_dfs)
[ Unnamed : 0 price time hour
0 0 43 3 1
1 1 43 4 1
2 2 14 5 1
3 3 6 2 5
4 4 4 5 4
5 5 2 6 3
6 6 6 6 4
7 7 4 3 5
8 8 2 4 4,
Unnamed : 0 price time hour
0 0 23 3 3
1 1 4 2 4
2 2 32 4 3
3 3 4 3 2
4 4 3 2 4
5 5 234 4 4]
Using the DataFrame's function
set_index()
list_dfs_index = list(map(lambda x : x.set_index("Unnamed : 0"), list_dfs))
print(list_dfs_index)
[ price time hour
Unnamed : 0
0 43 3 1
1 43 4 1
2 14 5 1
3 6 2 5
4 4 5 4
5 2 6 3
6 6 6 4
7 4 3 5
8 2 4 4,
price time hour
Unnamed : 0
0 23 3 3
1 4 2 4
2 32 4 3
3 4 3 2
4 3 2 4
5 234 4 4]
Alternatively,you can use the same set_index function to set the index as 'Unnamed : 0', before the putting the dataframes into a list.
I'm trying to create a historical time-series of a number of identifiers for a number of different metrics, as part of that i'm trying to create multi index dataframe and then "fill it" with the individual dataframes.
Multi Index:
ID1 ID2
ITEM1 ITEM2 ITEM1 ITEM2
index
Dataframe to insert
ITEM1 ITEM2
Date
a
b
c
looking through the official docs and this website i found the following relevant:
Add single index data frame to multi index data frame, Pandas, Python and the associated pandas official docs pages:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
i've managed with something like :
for i in df1.index:
for j in df2.columns:
df1.loc[i,(ID,j)]=df2.loc[i,j]
but it seems highly inefficient when i need to do this across circa 100 dataframes.
for some reason a simply
df1.loc[i,(ID)]=df2.loc[i] doesn't seem to work
neither does :
df1[ID1]=df1.append(df2)
which returns a Cannot set a frame with no defined index and a value that cannot be converted to a Series
my understanding from looking around is that this is because im effectively leaving half the dataframe empty ( ragged list? )
any help appreciated on how to iteratively populate my multi index DF would be greatly appreciated.
let me know if i've missed relevant information,
cheers.
Setup
df1 = pd.DataFrame(
[[1, 2, 3, 4, 5, 6] * 2] * 3,
columns=pd.MultiIndex.from_product(['ID1 ID2 ID3'.split(), range(4)])
)
df2 = df1.ID1 * 2
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 1 2 3 4 5 6
1 1 2 3 4 5 6 1 2 3 4 5 6
2 1 2 3 4 5 6 1 2 3 4 5 6
df2
0 1 2 3
0 2 4 6 8
1 2 4 6 8
2 2 4 6 8
The problem is that Pandas is trying to line up indices (or columns in this case). We can do some transpose/join trickery but I'd rather avoid that.
Option 1
Take advantage of the fact that we can assign via loc an array so long as the shape matches up. Well, we better make sure it does and that the order of columns and index are correct. I use align with the right parameter to do this. Then assign the values of the aligned df2
df1.loc[:, 'ID1'] = df2.align(df1.ID1, 'right')[0].values
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 2
Or, we can give df2 the additional level of column indexing that we need to lined it up. The use update to replace the relevant cells in place.
df1.update(pd.concat({'ID1': df2}, axis=1))
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 3
A creative way using stack and assign with unstack
df1.stack().assign(ID1=df2.stack()).unstack()
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
I am using pandas DataFrame as a lightweight dataset to maintain some status and need to dynamically/continuously merge new DataFrames into existing table. Say I have two datasets as below:
df1:
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df2:
b c
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
I want to merge df2 to df1 (on index), and for columns in common (in this case, it is 'b'), simply discard the common column of df2.
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
My code was checking common part between df1 and df2 by using SET, so that I manually drop common part in df2. I wonder is there any much efficient way to do this?
First identify the columns in df2 not in df1
cols = df2.columns.difference(df1.columns)
Then pd.DataFrame.join
df1.join(df2[cols])
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
Or pd.concat will also work
pd.concat([df1, df2[cols]], axis=1)
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
Pandas merge function will also work wonders. You can do it as:
pd.merge(left=df1, right=df2, how='inner')
a b c
0 0 1 11
1 2 3 13
2 4 5 15
3 6 7 17
4 8 9 19
by eliminating the 'on' attribute of merge function it will consider the columns which are in-common in both of the dataframes.
I have a dataframe that looks like this:
test_data = pd.DataFrame(np.array([np.arange(10)]*3).T, columns =['issuer_id','winner_id','gov'])
issuer_id winner_id gov
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
and a list of two-tuples consisting of a dataframe and a label encoding 'gov' (perhaps a label:dataframe dict would be better). In test_out below the two labels are 2 and 7.
test_out = [(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),2),(pd.DataFrame(np.array([np.arange(10)]*2).T, columns =['id','partition']),7)]
[( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 2), ( id partition
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9, 7)]
I want to add two columns to the test_data dataframe: issuer_partition and winner_partition
test_data['issuer_partition']=''
test_data['winner_partition']=''
and I would like to fill in these values from the test_out list where the entry in the gov column determines the labeled dataframe in test_out to draw from. Then I look up the winner_id and issuer_id in the id-partition dataframe and write them into test_data.
Put another way: I have a list of labeled dataframes that I would like to loop through to conditionally fill in data in a primary dataframe.
Is there a clever way to use merge in this scenario?
*edit - added another sentence and fixed test_out code