Pandas append DataFrame2 ROW to DataFrame1 ROW - python

I want to append rows from second DataFrame (df2) to first DataFrame (df1) depending whether in df1 column "isValid" is [T]rue.
I know how to iterate over df1 column and search for True values, but don't know how to easily append rows from second DataFrame. Originally my data have around 1000 lines and 40 columns, so I need to do operations automatically.
import pandas
df1 = pandas.read_csv('df1.csv', sep=';')
df2 = pandas.read_csv('df2.csv', sep=';')
print(df1.to_string(), '\n')
print(df2.to_string(), '\n')
columnSeriesObj = df1.iloc[:, 2]
n = 0
k = 0
for i in columnSeriesObj:
if i == "T":
print("True in row number", k)
# APPEND n ROW from df2 to k ROW from df1
n += 1
k += 1
print('\n', df1.to_string())
Here are some test values:
df1.csv
DataA;DataB;isValid
1568;1104;F
1224;1213;F
1676;1246;F
1279;1489;T
1437;1890;T
1705;1007;F
1075;1720;F
1361;1983;F
1966;1751;F
1938;1564;F
1894;1684;F
1189;1803;F
1275;1138;F
1085;1748;T
1337;1775;T
1719;1975;F
1045;1187;F
1426;1757;F
1410;1363;F
1405;1025;F
1699;1873;F
1777;1464;F
1925;1310;T
df2.csv
Nr;X;Y;Z;A ;B;C
1;195;319;18;qwe;hjk;wsx
2;268;284;23;rty;zxc;edc
3;285;277;36;uio;vbn;rfv
4;143;369;34;asd;mlp;tgb
5;290;247;16;fgh;qaz;yhn
I want to df1 after appending look like this (screenshot from Excel):
Thank you for any suggestions! :D

You can filter the index values in df1 where the column isValid equals T, then update the index of df2 with the filtered index values from df1 finally join it with df1:
m = df1['isValid'].eq('T')
idx = m[m].index[:len(df2)]
df1.join(df2.set_index(idx)).fillna('')
DataA DataB isValid Nr X Y Z A B C
0 1568 1104 F
1 1224 1213 F
2 1676 1246 F
3 1279 1489 T 1 195 319 18 qwe hjk wsx
4 1437 1890 T 2 268 284 23 rty zxc edc
5 1705 1007 F
6 1075 1720 F
7 1361 1983 F
8 1966 1751 F
9 1938 1564 F
10 1894 1684 F
11 1189 1803 F
12 1275 1138 F
13 1085 1748 T 3 285 277 36 uio vbn rfv
14 1337 1775 T 4 143 369 34 asd mlp tgb
15 1719 1975 F
16 1045 1187 F
17 1426 1757 F
18 1410 1363 F
19 1405 1025 F
20 1699 1873 F
21 1777 1464 F
22 1925 1310 T 5 290 247 16 fgh qaz yhn

I suggest the following:
I created some dummy data, similar to yours:
import pandas as pd
import random
df = pd.DataFrame({"a": list(range(20)), "b": [random.choice(("T", "F")) for _ in range(20)]})
df2 = pd.DataFrame({"value1": list(range(5)), "nr": list(range(5))})
First you create a new column in the first dataframe that holds the incrementing ID ("Nr"). To do so, use the count generator from itertools.
from itertools import count
counter = count(start=1)
df["id"] = df.apply(lambda row: next(counter) if row["b"] == "T" else None, axis=1)
After that you can perform a join with the merge method.
df.merge(df2, left_on="id", right_on="nr", how="outer")

How about something like this:
(e.g. first find the overlapping index-values and then join the dataframes)
import pandas as pd
import numpy as np
df1 = pd.read_csv("df1.csv", sep=';')
df2 = pd.read_csv(r"df2.csv", sep=';')
# find intersecting indices
useidx = np.intersect1d(df2.index,
df1[df1.isValid == 'T'].index)
# join relevant values
df_joined = df1.join(df2.loc[useidx])
df_joined then looks like this:
>>> DataA DataB isValid Nr X Y Z A B C
>>> 0 1568 1104 F NaN NaN NaN NaN NaN NaN NaN
>>> 1 1224 1213 F NaN NaN NaN NaN NaN NaN NaN
>>> 2 1676 1246 F NaN NaN NaN NaN NaN NaN NaN
>>> 3 1279 1489 T 4.0 143.0 369.0 34.0 asd mlp tgb
>>> 4 1437 1890 T 5.0 290.0 247.0 16.0 fgh qaz yhn
>>> 5 1705 1007 F NaN NaN NaN NaN NaN NaN NaN
>>> 6 1075 1720 F NaN NaN NaN NaN NaN NaN NaN
>>> 7 1361 1983 F NaN NaN NaN NaN NaN NaN NaN

Related

Calculate the sum of values replacing NaN

I have a data frame with some NaNs in column B.
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
df
A B
0 654 987.0
1 987 NaN
2 321 741.0
3 654 NaN
4 987 65.0
5 15 35.0
6 98 94.0
7 338 NaN
I replace NaNs in B with the numbers form A
df.B.fillna(df.A, inplace = True)
df
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0
What's the easiest way to calculate the sum of the values that have replaced the NaNs in B?
You can use series.isna() with .loc[] to filter the Column A which meets the condition that column B is null and then sum:
df.loc[df['B'].isna(),'A'].sum()
Alternative:
df['B'].fillna(df['A']).sum() - df['B'].sum()
Note: you should do this before doing the inplace operation or preferable create a copy and save under a different variable for later reference.
Try the function math.isnan to check the NaN value.
import numpy as np
import pandas as pd
import math
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
for i in range(0,len(df['B'])):
if (math.isnan(df['B'][i])):
df['B'][i] = df['A'][i]
print(df)
Output :
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0

Read CSV into a dataFrame with varying row lengths using Pandas

So I have a CSV that looks a bit like this:
1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454
...
And when I try to use the following code to generate a dataFrame..
df = pd.read_csv('data.csv', header=0, engine='c', error_bad_lines=False)
It only adds rows with 3 columns to the df (rows 1, 3 and 5 from above)
The rest are considered 'bad lines' giving me the following error:
Skipping line 17467: expected 3 fields, saw 9
How do I create a data frame that includes all data in my csv, possibly just filling in the empty cells with null? Or do I have to declare the max row length prior to adding to the df?
Thanks!
If using only pandas, read in lines, deal with the separator after.
import pandas as pd
df = pd.read_csv('data.csv', header=None, sep='\n')
df = df[0].str.split('\s\|\s', expand=True)
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
If you know that the data contains N columns, you can
tell Pandas in advance how many columns to expect via the names parameter:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(7)))
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
If you have an the upper limit, N, on the number of columns, then you can
have Pandas read N columns and then use dropna to drop completely empty columns:
import pandas as pd
df = pd.read_csv('data', delimiter='|', names=list(range(20))).dropna(axis='columns', how='all')
print(df)
yields
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
Note that this could drop columns from the middle of the data set (not just
columns from the right-hand side) if they are completely empty.
Read fixed width should work:
from io import StringIO
s = '''1 01-01-2019 724
2 01-01-2019 233 436
3 01-01-2019 345
4 01-01-2019 803 933 943 923 954
5 01-01-2019 454'''
pd.read_fwf(StringIO(s), header=None)
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
or with a delimiter param
s = '''1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454'''
pd.read_fwf(StringIO(s), header=None, delimiter='|')
0 1 2 3 4 5 6
0 1 01-01-2019 724 NaN NaN NaN NaN
1 2 01-01-2019 233 436.0 NaN NaN NaN
2 3 01-01-2019 345 NaN NaN NaN NaN
3 4 01-01-2019 803 933.0 943.0 923.0 954.0
4 5 01-01-2019 454 NaN NaN NaN NaN
note that for your actual file you will not use StringIO you would just replace that with your file path: pd.read_fwf('data.csv', delimiter='|', header=None)
add extra columns (empty or otherwise) to the top of your csv file. Pandas will takes the first row as the default size, and anything below it will have NaN values. Example:
file.csv:
a,b,c,d,e
1,2,3
3
2,3,4
code:
>>> import pandas as pd
>>> pd.read_csv('file.csv')
a b c d e
0 1 2.0 3.0 NaN NaN
1 3 NaN NaN NaN NaN
2 2 3.0 4.0 NaN NaN
Consider using Python csv to do the lifting for importing data and format grooming. You can implement a custom dialect to handle varying csv-ness.
import csv
import pandas as pd
csv_data = """1 | 01-01-2019 | 724
2 | 01-01-2019 | 233 | 436
3 | 01-01-2019 | 345
4 | 01-01-2019 | 803 | 933 | 943 | 923 | 954
5 | 01-01-2019 | 454"""
with open('test1.csv', 'w') as f:
f.write(csv_data)
csv.register_dialect('PipeDialect', delimiter='|')
with open('test1.csv') as csvfile:
data = [row for row in csv.reader(csvfile, 'PipeDialect')]
df = pd.DataFrame(data = data)
Gives you a csv import dialect and the following DataFrame:
0 1 2 3 4 5 6
0 1 01-01-2019 724 None None None None
1 2 01-01-2019 233 436 None None None
2 3 01-01-2019 345 None None None None
3 4 01-01-2019 803 933 943 923 954
4 5 01-01-2019 454 None None None None
Left as an exercise is handling the whitespace padding in the input file.
colnames= [str(i) for i in range(9)]
df = pd.read_table('data.csv', header=None, sep=',', names=colnames)
Change 9 in colnames to number x if code gives the error
Skipping line 17467: expected 3 fields, saw x

Extract sub-DataFrames

I have this kind of dataframe in Pandas :
NaN
1
NaN
452
1175
12
NaN
NaN
NaN
145
125
NaN
1259
2178
2514
1
On the other hand I have this other dataframe :
1
2
3
4
5
6
I would like to separate the first one into differents sub-dataframes like this:
DataFrame 1:
1
DataFrame 2:
452
1175
12
DataFrame 3:
DataFrame 4:
DataFrame 5:
145
125
DataFrame 6:
1259
2178
2514
1
How can I do that without a loop?
UPDATE: thanks to #piRSquared for pointing out that the solution above will not work for DFs/Series with non-numeric indexes. Here is more generic solution:
dfs = [x.dropna()
for x in np.split(df, np.arange(len(df))[df['column'].isnull().values])]
OLD answer:
IIUC you can do something like this:
Source DF:
In [40]: df
Out[40]:
column
0 NaN
1 1.0
2 NaN
3 452.0
4 1175.0
5 12.0
6 NaN
7 NaN
8 NaN
9 145.0
10 125.0
11 NaN
12 1259.0
13 2178.0
14 2514.0
15 1.0
Solution:
In [31]: dfs = [x.dropna()
for x in np.split(df, df.index[df['column'].isnull()].values+1)]
In [32]: dfs[0]
Out[32]:
Empty DataFrame
Columns: [column]
Index: []
In [33]: dfs[1]
Out[33]:
column
1 1.0
In [34]: dfs[2]
Out[34]:
column
3 452.0
4 1175.0
5 12.0
In [35]: dfs[3]
Out[35]:
Empty DataFrame
Columns: [column]
Index: []
In [36]: dfs[4]
Out[36]:
Empty DataFrame
Columns: [column]
Index: []
In [37]: dfs[4]
Out[37]:
Empty DataFrame
Columns: [column]
Index: []
In [38]: dfs[5]
Out[38]:
column
9 145.0
10 125.0
In [39]: dfs[6]
Out[39]:
column
12 1259.0
13 2178.0
14 2514.0
15 1.0
w = np.append(np.where(np.isnan(df.iloc[:, 0].values))[0], len(df))
splits = {'DataFrame{}'.format(c): df.iloc[i+1:j]
for c, (i, j) in enumerate(zip(w, w[1:]))}
Print out splits to demonstrate
for k, v in splits.items():
print(k)
print(v)
print()
DataFrame0
0
1 1.0
DataFrame1
0
3 452.0
4 1175.0
5 12.0
DataFrame2
Empty DataFrame
Columns: [0]
Index: []
DataFrame3
Empty DataFrame
Columns: [0]
Index: []
DataFrame4
0
9 145.0
10 125.0
DataFrame5
0
12 1259.0
13 2178.0
14 2514.0
15 1.0

Remove special characters in pandas dataframe

This seems like an inherently simple task but I am finding it very difficult to remove the '' from my entire data frame and return the numeric values in each column, including the numbers that did not have ''. The dateframe includes hundreds of more columns and looks like this in short:
Time A1 A2
2.0002546296 1499 1592
2.0006712963 1252 1459
2.0902546296 1731 2223
2.0906828704 1691 1904
2.1742245370 2364 3121
2.1764699074 2096 1942
2.7654050926 *7639* *8196*
2.7658564815 *7088* *7542*
2.9048958333 *8736* *8459*
2.9053125000 *7778* *7704*
2.9807175926 *6612* *6593*
3.0585763889 *8520* *9122*
I have not written it to iterate over every column in df yet but as far as the first column goes I have come up with this
df['A1'].str.replace('*','').astype(float)
which yields
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 7639.0
20 7088.0
21 8736.0
22 7778.0
23 6612.0
24 8520.0
Is there a very easy way to just remove the '*' in the dataframe in pandas?
use replace which applies on whole dataframe :
df
Out[14]:
Time A1 A2
0 2.000255 1499 1592
1 2.176470 2096 1942
2 2.765405 *7639* *8196*
3 2.765856 *7088* *7542*
4 2.904896 *8736* *8459*
5 2.905312 *7778* *7704*
6 2.980718 *6612* *6593*
7 3.058576 *8520* *9122*
df=df.replace('\*','',regex=True).astype(float)
df
Out[16]:
Time A1 A2
0 2.000255 1499 1592
1 2.176470 2096 1942
2 2.765405 7639 8196
3 2.765856 7088 7542
4 2.904896 8736 8459
5 2.905312 7778 7704
6 2.980718 6612 6593
7 3.058576 8520 9122
I found this to be a simple approach - Use replace to retain only the digits (and dot and minus sign).
This would remove characters, alphabets or anything that is not defined in to_replace attribute.
So, the solution is:
df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]', value=r'']
df['A1'] = df['A1'].astype(float64)
There is another solution which uses map and strip functions.
You can see the below link:
Pandas DataFrame: remove unwanted parts from strings in a column.
df =
Time A1 A2
0 2.0 1258 *1364*
1 2.1 *1254* 2002
2 2.2 1520 3364
3 2.3 *300* *10056*
cols = ['A1', 'A2']
for col in cols:
df[col] = df[col].map(lambda x: str(x).lstrip('*').rstrip('*')).astype(float)
df =
Time A1 A2
0 2.0 1258 1364
1 2.1 1254 2002
2 2.2 1520 3364
3 2.3 300 10056
The parsing procedure only be applied on the desired columns.
I found the answer of CuriousCoder so brief and useful but there must be a ')' instead of ']'
So it should be:
df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]',
value=r''] df['A1'] = df['A1'].astype(float64)

How to prevent automatic assignment of values to missing data imported from SPSS

Let's say I have an spss file named "ab.sav" which looks like this:
gender value value2
F 433 329
. . 787
. . .
M 121 .
F 311 120
. . 899
M 341 .
In spss (Variable View) I defined the labels of gender with the values 1 and 2 for M and F respectively.
When I load this in python using the following commands:
>>> from rpy2.robjects.packages import importr
>>> from rpy2.robjects import pandas2ri
>>> foreign=importr("foreign")
>>> data=foreign.read_spss("ab.sav", to_data_frame=True, use_value_labels=True)
>>> pandas2ri.activate()
>>> data2=pandas2ri.ri2py(data)
I get the following dataframe:
>>> data2
gender value value2
0 F 433 329
1 M NaN 787
2 M NaN NaN
3 M 121 NaN
4 F 311 120
5 M NaN 899
6 M 341 NaN
So the missing values in the gender column for a given case are replaced by the subsequent known value of the subsequent case. Is there a simple way to prevent this?
When I change use_value_labels to False I get the expected result though:
>>> data2
gender value value2
0 2 433 329
1 NaN NaN 787
2 NaN NaN NaN
3 1 121 NaN
4 2 311 120
5 NaN NaN 899
6 1 341 NaN
However I'd like to be able to use the labels instead of numeric values for gender as above. Ideally the output should be:
>>> data2
gender value value2
0 F 433 329
1 NaN NaN 787
2 NaN NaN NaN
3 M 121 NaN
4 F 311 120
5 NaN NaN 899
6 M 341 NaN
Assuming data2 is a pandas DataFrame, and there's a 1-to-1 mapping between nulls in value and gender, you can do the following:
nulls = pandas.isnull(data2['value'])
data2.loc[nulls, 'gender'] = np.nan
And that turns it into:
gender value value2
0 F 433 329
1 NaN NaN 787
2 NaN NaN NaN
3 M 121 NaN
4 F 311 120
5 NaN NaN 899
6 M 341 NaN

Categories