Concatenating data from two files

Concatenating data from two files - python

There are 2 files opened with Pandas. If there are common parts in the first column of two files (colored letters), I want to paste the data of the second column of second file into the matched part of the first file. And if there is no match, I want to write 'NaN'. Is there a way I can do in this situation?
File1
enter code here
0 1
0 JCW 574
1 MBM 4212
2 COP 7424
3 KVI 4242
4 ECX 424
File2
enter code here
0 1
0 G=COP d4ssd5vwe2e2
1 G=DDD dfd23e1rv515j5o
2 G=FEW cwdsuve615cdldl
3 G=JCW io55i5i55j8rrrg5f3r
4 G=RRR c84sdw5e5vwldk455
5 G=ECX j4ut84mnh54t65y
File1#
enter code here
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y

Use Series.str.extract for new Series for matched values by df1[0] values first and then merge with left join in DataFrame.merge:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
s = df2[0].str.extract(f'({"|".join(df1[0])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Or if need match last 3 values of column df1[0] use:
s = df2[0].str.extract(f'({"|".join(df1[0].str[-3:])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)

Have a look at the concat-function of pandas using join='outer' (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). There is also this question and the answer to it that can help you.
It involves reindexing each of your data frames to use the column that is now called "0" as the index, and then joining two data frames based on their indices.
Also, can I suggest that you do not paste an image of your dataframes, but upload the data in a form that other people can test their suggestions.

Related

Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.

First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

count number of elements in each group of data frame with a certain pattern from another data frame in python

I'm trying to count the number of elements from one group in one data frame and assign it to another column in another data frame, based in some condition from one column in the second data frame.
This is my first data frame that I need to update:
node name count
1 aaa-1-1
1 trg-3-4-5
2 bbb-2-2-4
3 ccc-3-3
This is the data frame that I'll use to count the values
node name
1 Empty-1-1-1
1 Empty-1-1-2
1 Empty-1-1-3
2 gbn-2-3-5
3 Empty-3-3-9
I should filter from the name in df1 and count the number of elements in df1 that has the same id and has the string 'EMPTY' and the 1-1 part of df1 so the output will look like
id name count
1 aaa-1-1 3
1 trg-3-4-5 0
2 bbb-2-2-4 0
3 ccc-3-3 1
To do that I appended both data frames and then grouped by id , and looped at each group, to get the count
df = df1.append(df2, ignore_index=True, sort=True)
for _, gdf in df.groupby('node'):
cds = gdf[gdf.name.str.count('-') == 2]
count_map = {}
for i, c in cds.iterrows():
k = c.name.split('-', 1)[-1] + '-'
count_map[i] = gdf[gdf.name.str.contains('EMPTY-' + k)].shape[0]
for kk, vv incount_map.items():
df.loc[kk, 'count'] = vv
return df
This functions works and get me the correct results but it takes very long time. I tried to merge both data frames and then count one column based on the other but the merging is not giving me the expected records, is there any way I could optimize this function
EDIT :
Having two data frames and searching between them is really expensive for huge datasets, So I used megred the to data frames and created daskdata frame where I grouped by 'node', now my search will be easier so what I have now is :
df_partioined a:
node name1 name2 count
1 . aaa-1-1 . nan
1 . trg-3-4-5 nan
1 . nan Empty-1-1-3
1 . nan . Empty-1-1-1
1 nan Empty-1-1-2
Now at column name1 I'll filter out the names that contains Only tow dashes
so for this case will be 1-1, and then count number of elements that has this string
So my expected output will be
node name1 name2 count
1 . aaa-1-1 . nan 3
1 . trg-3-4-5 nan 0
1 . nan Empty-1-1-3 nan
1 . nan . Empty-1-1-1 nan
1 nan Empty-1-1-2 nan
I splited the 1-1 into new column but not sure how I should do next :(

Try this:
df['count'] = df['name'].apply(lambda x: df1['name'].str.contains(pd.Series(x).str.extract(r'(?:(\d-\d.*))$')[0][0]).sum())
output
node name count
0 1 aaa-1-1 3
1 1 trg-3-4-5 0
2 2 bbb-2-2-4 0
3 3 ccc-3-3 1

Merging Dataframe with Different Dates?

I want to merge a seperate dataframe (df2) with the main dataframe (df1), but if, for a given row, the dates in df1 do not exist in df2, then search for the recent date before the underlying date in df1.
I tried to use pd.merge, but it would remove rows with unmatched dates, and only keep the rows that matched in both df's.
df1 = [['2007-01-01','A'],
['2007-01-02','B'],
['2007-01-03','C'],
['2007-01-04','B'],
['2007-01-06','C']]
df2 = [['2007-01-01','B',3],
['2007-01-02','A',4],
['2007-01-03','B',5],
['2007-01-06','C',3]]
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
df1[0] = pd.to_datetime(df1[0])
df2[0] = pd.to_datetime(df2[0])
Current df1 | pd.merge():
0 1 2
0 2007-01-06 C 3
Only gets the exact date between both df's, it does not consider value from recent dates.
Expected df1:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3
2 2007-01-03 C NaN
3 2007-01-04 B 3
4 2007-01-06 C 3
Getting NaNs because data doesn't exist on or before that date in df2. For index row 1, it gets data before a day before, while index row 4, it gets data exactly on the same day.

Check you output by using merge_asof
pd.merge_asof(df1,df2,on=0,by=1,allow_exact_matches=True)
Out[15]:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3.0
2 2007-01-03 C NaN
3 2007-01-04 B 5.0 # here should be 5 since 5 ' date is more close. also df2 have two B
4 2007-01-06 C 3.0

Using your merge code, which I assume you have since its not present in your question, insert the argument how=left or how=outer.
It should look like this:
dfmerged = pd.merge(df1, df2, how='left', left_on=['Date'], right_on=['Date'])
You can then use slicing and renaming to keep the columns you wish.
dfmerged = dfmerged[['Date', 'Letters', 'Numbers']]
Note: I do not know your column names since you haven't shown any code. Substitute as necessary

Drop row from data-frame where that contains a specific string

I have a number of CSV files where the head looks something like:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
I need to read this into a dataframe and remove any rows with ,, however when I read the CSV data into a dataframe using:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None)
I get:
0 1 2 3
0 09/07/2014 26268315 NaN NaN
1 10/07/2014 6601181 16.3857 NaN
2 11/07/2014 916651 12.5879 NaN
3 14/07/2014 213357 NaN NaN
4 15/07/2014 205019 10.8607 NaN
How can I read the CSV data into a dataframe and get:
0
0 09/07/2014,26268315,,
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
3 14/07/2014,213357,,
4 15/07/2014,205019,10.8607
I need to remove any rows where there are ,, present. and then resave the adjusted dataframe to a new CSV file. I was going to use:
stringList = [',,']
df = df[~df[0].isin([stringList])]
to remove the rows with ,, present so the resulting .csv head looks like:
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
15/07/2014,205019,10.8607

I guess here is possible remove all columns with all NaNs and then rows with any NaNs:
df = df.dropna(axis=1, how='all').dropna()
print (df)
0 1 2
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
4 15/07/2014 205019 10.8607
Another solution is add separator which value is not in data like | and then filter by endswith:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None, sep='|')
df = df[~df[0].str.endswith(',')]
#alternative solution - $ is for end of string
#df = df[~df[0].str.contains(',$')]
print (df)
0
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
4 15/07/2014,205019,10.8607

Pandas: merge multiple dataframes and control column names?

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?
I have nine datasets. All of them have the following columns:
org, name, items,spend
I want to join them into a single dataframe with the following columns:
org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...
I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:
ad = pd.DataFrame.merge(df_presents, df_trees,
on=['practice', 'name'],
suffixes=['_presents', '_trees'])
This works great, doing print list(aggregate_data.columns.values) shows me the following columns:
[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]
But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.

You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:
result = functools.reduce(merge, dfs)
This is equivalent to
result = dfs[0]
for df in dfs[1:]:
result = merge(result, df)
To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:
merge = functools.partial(pd.merge, on=['org', 'name'])
Since specifying the suffixes parameter in functools.partial would only allow
one fixed choice of suffix, and since here we need a different suffix for each
pd.merge call, I think it would be easiest to prepare the DataFrames column
names before calling pd.merge:
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
For example,
import pandas as pd
import numpy as np
import functools
np.random.seed(2015)
N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)),
columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())
yields
org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \
0 2 4 4 2 3 0 1
1 2 4 4 2 3 0 1
2 2 4 4 2 3 0 1
3 2 4 4 2 3 0 1
4 2 4 4 2 3 0 1
spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \
0 3 1 0 1 0 4
1 3 1 0 1 0 4
2 3 1 0 1 0 4
3 3 1 0 1 0 4
4 3 1 0 1 0 4
spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9
0 3 4 1 3 0 1 2
1 3 4 1 3 0 0 3
2 3 4 1 3 0 0 0
3 3 3 1 3 0 1 2
4 3 3 1 3 0 0 3

Would doing a big pd.concat() and then renaming all the columns work for you? Something like:
desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)
new_columns = ['org', 'name']
for i in range(num_dataframes):
new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])
bid_df.columns = new_columns
This should give you columns like:
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8

I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):
Create an empty dictionary, merge_dict.
Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
Generate a new index as sorted(merge_dict).
Generate a new list of data for each column by looping through merge_dict.items().
Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.
Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.
Good luck.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenating data from two files - python

Related

Creating a new map from existing maps in python

count number of elements in each group of data frame with a certain pattern from another data frame in python

Merging Dataframe with Different Dates?

Drop row from data-frame where that contains a specific string

Pandas: merge multiple dataframes and control column names?

Categories

Resources