Difficult archive format to read with pandas: Dephased output

Difficult archive format to read with pandas: Dephased output - python

I have the following arquive format:
7.2393690416406E+000 1.0690994646755E+001 3.1429089063731E+000
-2.7606309583594E+000 1.0690994646755E+001 1.3142908906373E+001
That is: Before non-negative values (talking about first column), there is one white space, and before negative values there is not white spaces. Therefore, if you read with a code like the following:
df = pd.read_csv('example.csv',header=None,engine='python',sep=' ')
You will get something like this:
1 NaN 7.239369 10.690995 3.142909
2 -2.760631 10.690995 13.142909 NaN
This happens because pandas identifies the first white space, and assumes it is a column. The dataframe indeed contains all values, but each negative line (talking about the first column) will be dephased by one column. How can I fix it? How can a get a pretty dataframe like the folliwing?
1 7.239369 10.690995 3.142909
2 -2.760631 10.690995 13.142909

Use sep='\s+'
df = pd.read_csv('test.csv', header=None, sep='\s+')
0 1 2
0 7.239369 10.690995 3.142909
1 -2.760631 10.690995 13.142909

Related

Pandas read_fwf is limiting string data to 127 characters upon read in

I am reading a fixed-width file into a pandas dataframe, but I notice that the data is not being properly stored into the dataframe. The cells in the dataframe are being restricted to 127 characters.
Input file:
Column 1 Column 2 Column 3
*see sentence below 18.0 True
this sentence is under 127 characters 12.0 False
For the sentence over 127 characters, imagine the sentence is this:
You think darkness is your ally. But you merely adopted the dark; I was born in it. Moulded by it. I didn't see the light until I was already a man. By then it was nothing to me but blinding!
Code:
df = pd.read_fwf(input_file_path, index_col=False)
df.to_csv('output.csv', index=False, encoding='utf8')
Output CSV:
Column 1,Column 2,Column 3
You think darkness is your ally. But you merely adopted the dark; I was born in it. Moulded by it. I didn't see the light until,18.0,True
this sentence is under 127 characters,12.0,False
Is there an argument I can put into the read_fwf to fix this issue, or is it likely just the autoparsing being problematic and cutting off too soon? Thanks!
Edit: I see that in my own version of the file I am reading, the long lines are over 100 lines below some much shorter lines. I believe that because colspecs='infer' is default for the first 100 rows, the column specs are not being properly determined, and hence cut off the longer values farther down. Does anyone have suggestions for this?

As an alternative, you could read the input file with Python readlines. Then, apply Pandas rsplit with n=2 using space by default, or any other pattern (rsplit with pat parameter) that is consistent with your last 2 columns, whilst using expand=True to split the data into separate columns.
import pandas as pd
with open('sample.csv') as f:
data = f.readlines()
df = pd.DataFrame(data[1:]) # discard the header column
print(df)
# use expand to split strings into separate columns
df_out = df[0].str.rsplit(n=2, expand=True)
# fix column names
df_out.columns = [f'Column_{i+1}' for i in df_out.columns]
df_out['LEN'] = df_out['Column_1'].apply(len)
print(df_out)
Output df_out
Column_1 Column_2 Column_3 LEN
0 You think darkness is your ally. But you merel... 18.0 True 191
1 this sentence is under 127 characters 12.0 False 37
2 Hope. Every man who has rotted here over the c... 847.11 True 498

Find string in one csv and replace with string in a different csv in a loop

I have two csv files. csv1 looks like this:
Title,glide gscore,IFDScore
235,-9.01,-1020.18
235,-8.759,-1020.01
235,-7.301,-1019.28
while csv2 looks like this:
ID,smiles,number
28604361,NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3,102
14492699,COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C,235
16888863,COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO,108
Both are much larger than what I show here. I need some way to match each value in the Title column of csv1 to the corresponding value in the number column of csv2. When a match is found, I need the value in the Title column of csv1 to be replaced with the corresponding value in the ID column of csv2. Thus I would want my desired output to be:
Title,glide gscore,IFDScore
14492699,-9.01,-1020.18
14492699,-8.759,-1020.01
14492699,-7.301,-1019.28
I am looking for a way to do it through pandas, bash or python.
This answer is close but gives me an ambiguous truth value of a DataFrame.
I also tried update in pandas without luck.
I'm not pasting the exact code I've tried yet because it would be overwhelming to see faulty code in pandas, bash and python all at once.

You could map it; then use fillna in case there were any "Titles" that did not have a matching "number":
csv1 = pd.read_csv('first_csv.csv')
csv2 = pd.read_csv('second_csv.csv')
csv1['Title'] = csv1['Title'].map(csv2.set_index('number')['ID']).fillna(csv1['Title']).astype(int)
Output:
Title glide gscore IFDScore
0 14492699 -9.010 -1020.18
1 14492699 -8.759 -1020.01
2 14492699 -7.301 -1019.28

You can use pandas module to load your dataframe, and then, using merge function, you can achieve what you are seeking for:
import pandas as pd
df1 = pd.read_csv("df1.csv")
df2 = pd.read_csv("df2.csv")
merged = df1.merge(df2, left_on="Title", right_on="number", how="right")
merged["Title"] = merged["ID"]
merged
Output
Title
glide gscore
IFDScore
ID
smiles
number
0
28604361
nan
nan
28604361
NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3
102
1
14492699
-9.01
-1020.18
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
2
14492699
-8.759
-1020.01
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
3
14492699
-7.301
-1019.28
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
4
16888863
nan
nan
16888863
COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO
108
Note that the Nan values are due to unavailable values. If your dataframe covers these parts too, it won't result in Nan.

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.

There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()

You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1

You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.

Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

Pandas new column as string extraction of another only for certain condition on string length verified: Fast way

I am working with a large df (near 2 millions rows) and need to create a new column from another one. The task seems easy: the starting column, called "PTCODICEFISCALE" contains a string made of 11 either 16 characters, no other possibilities, no NaN.
The new column I have to create ("COGNOME") must contain the 3 first characters of "PTCODICEFISCALE" ONLY IF the lenght of the "PTCODICEFISCALE" nth-row is 16; else when the lenght is 11 the new column should contain nothing, which means "NaN" I think.
I have tried this:
csv.loc[len(csv['PTCODICEFISCALE']) == 16, 'COGNOME'] = csv.loc[csv.PTCODICEFISCALE.str[:3]]
In the output this error message appears:
ValueError: cannot index with vector containing NA / NaN values
Which I don't understand.
I am sure there are no NA /NaN in "PTCODICEFISCALE" column.
Any help? Thanks!
P.S.: "csv" is the name of the DataFrame

I think you need numpy.where and condition with str.len:
csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)
Sample:
csv = pd.DataFrame({'PTCODICEFISCALE':['0123456789123456','1','01234567891234']})
print (csv)
PTCODICEFISCALE
0 0123456789123456
1 1
2 01234567891234
csv['COGNOME'] = np.where(csv.PTCODICEFISCALE.str.len() == 16, csv.PTCODICEFISCALE.str[:3], np.nan)
print (csv)
PTCODICEFISCALE COGNOME
0 0123456789123456 012
1 1 NaN
2 01234567891234 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difficult archive format to read with pandas: Dephased output - python

Use sep='\s+' df = pd.read_csv('test.csv', header=None, sep='\s+') 0 1 2 0 7.239369 10.690995 3.142909 1 -2.760631 10.690995 13.142909

Related

Pandas read_fwf is limiting string data to 127 characters upon read in

Find string in one csv and replace with string in a different csv in a loop

Pandas loop over 2 dataframe and drop duplicates

Splitting row values and count unique's from a DataFrame

Pandas new column as string extraction of another only for certain condition on string length verified: Fast way

Categories

Resources