Create CSV based on prefix in columns - python

I have a CSV, but the rows have different number of columns, because in some rows, some values are missing. So there is no index. The "meaning" of each value is at the moment encoded by a prefix to the value. I need to clean my CSV so as to create a new one, that only holds values of certain columns, based on the prefix.
Looks like that:
001234;aA431;cFM33;jJE LE (3);xABCD;421;
004321;aB432;cPD99;433
006543;aC332;cHR31;x4231;499
The new CSV should have a header, its name can be the prefix (first letter) of the column:
0;a;c;4
01234;A431;FM33;21
04321;B432;PD99;33
06543;C332;HR31;99
I am starting to work with python pandas, so any hints in that direction would be esp. welcome.

You can use
df1=df.astype(str).copy()
cols = df1.iloc[0].str[0].tolist()
df1=df1.apply(lambda x: x.str[1:])
df1.columns = cols
input
A B C D E F
0 1234 aA431 cFM33 jJE LE (3) xABCD 421.0
1 4321 aB432 cPD99 433 NaN NaN
2 6543 aC332 cHR31 x4231 499 NaN
output
print(df1)
1 a c j x 4
0 234 A431 FM33 JE LE (3) ABCD 21.0
1 321 B432 PD99 33 an an
2 543 C332 HR31 4231 99 an
print(df1)

Related

Naming dataframe columns based on the content of one of the row indices

Here is my data frame after reading the csv file and splitting it into columns.
Index 0 1 2
0 Dylos Logger v 1.6.0.0 None None
1 Unit: DC1700 v 2.08 None None
2 Date/Time: 12-07-15 11:11 None None
3 ------------------------- None None
4 Particles per cubic foot None None
5 ------------------------- None None
6 Date/Time Small Large
7 11-27-15 10:08 161200 8300
8 11-27-15 10:09 136500 8700
9 11-27-15 10:10 124000 8400
10 11-27-15 10:11 127300 7900
I would like to name my columns based on the content in the 6th row index, then get rid of the first 6 indices, and reset the index from zero. This means that I wish my data to look like this:
0 Date/Time Small Large
1 11-27-15 10:08 161200 8300
2 11-27-15 10:09 136500 8700
3 11-27-15 10:10 124000 8400
4 11-27-15 10:11 127300 7900
I know how to remove the first 6 rows and rest the indices. But I do not know how to rename the column name based on row 6 at the first step. Can you please help me?
Thanks
import pandas as pd
df = pd.DataFrame({'0':['a','Date/Time','x'],'1':['b','Small','y'],'2':['c','Large','z']})
row_with_column_names = 1 #would be 6 for you
df = df.rename(columns={cur_name:new_name for cur_name,new_name in zip(df,df.iloc[row_with_column_names,:])}) #rename
df = df.drop(row_with_column_names,axis='index') #remove the row with the names in it
df = df.reset_index(drop=True)
df
#Produces
# Date/Time Small Large
#0 a b c
#1 x y z

Pandas categorical series showing duplicate category names. How to find the indexes?

as I run this code:
df19['tipo'] = df19['tipo'].astype('category')
df19.tipo.value_counts()
I'm getting the following output:
CAS 1269
REF 667
QUE 408
CPPP 190
INH 60
COMP 25
EXC 22
REC 14
ACL 4
NUL 3
CAS 3
REP 3
AMICUS 2
AMI 2
RES 1
HON 1
PRE 1
QUE 1
QUE RET 1
ACLA 1
REV 1
Name: tipo, dtype: int64
As you can see, for example, there are 1269 "CAS" rows, but also 3 other "CAS" rows down the line (same happens with "QUE"). I'm confindent they all should be included in the same category, but there's probably some issue with the cell containing the las 3 values, because of which pandas interprets them as a different category. I tried stripping whitespace but it didn't work. What else could be causing this problem? How could I get the indexes of those 3 "CAS" rows so as to manually correct them, if needed?
Thanks!
Let us try
df19['tipo'].str.strip().value_counts()

Python Pandas - How to substitue a column of DataFrame1 with the values of a column in DataFrame2

Hi All I have a Dataframe with more than 50000 records. It has a column by name "Country" which has duplicate values.
As part of a Machine Learning project I am doing a Label Encoding on this column which will replace this column with 50000 records with integer values. (ok for those who do not know about Label Encoding - it takes the unique values of the column and assign an integer value to it mostly based on alphabetical order but not sure though). Say this Dataframe is DF1 and column is "Country".
Now my requirement is that I have to do the same for another dataframe (DF2) manually i.e without using the Label Encoding function.
What I have tried so far and where do I get struck is mentioned below
I have taken the unique values of DF1.Country column and kept in a
new dataframe(temp_df).
Tried to do right join of DF2 and temp_df keeping on="Country". But getting "NaN" in few records. Not sure why
Tried to do find-and-replace using .isin method but still not getting
desired output.
So my basic question is how to fill a column in a dataframe with the values of a column in another dataframe by matching the values of two columns in two dataframes ?
UPDATED
Sample code output is given below for better understanding
The Country Column in DF2 has repeatable values like this :
0 us
1 us
2 gb
3 us
4 au
5 fr
6 us
7 us
8 us
9 us
10 us
11 us
12 ca
13 at
14 us
15 us
16 es
17 fi
18 fr
19 us
20 us
The temp_df dataframe will have integer value for every unique country name like mentioned below (Note : This dataframe will only have unique values. Not duplicates) :
1 gb 49
2 ca 22
3 au 5
4 de 34
5 fr 48
6 br 17
7 jp 75
8 sv 136
9 no 111
10 se 132
11 es 43
12 nl 110
13 mx 103
14 dk 36
15 ro 127
16 ch 24
17 it 71
18 be 10
19 ru 129
20 kr 78
21 fi 44
22 hk 59
23 ie 65
24 sg 133
25 nz 112
26 ar 3
27 at 4
28 in 68
29 cl 26
30 il 66
Now I have to create a new column in DF2 by taking the integer values from temp_df for each country value in DF2. Hope this helps.
You could use pandas.Series.map to accomplish this:
from io import StringIO
import pandas as pd
# Your data ..
data = """
id,country
0,AT
1,DE
2,UK
3,FR
4,AT
5,UK
6,IT
7,DE
"""
df = pd.read_table(StringIO(data), sep=',', index_col=[0])
# Create a map from your current labels to numeric labels:
country_labels = dict([(c, i) for i, c in enumerate(df.country.unique())])
# Use map() to transform your column and re-assign it
df.country = df.country.map(lambda c: country_labels[c])
print(df)
which will transform the above data to
country
id
0 0
1 1
2 2
3 3
4 0
5 2
6 4
7 1
As suggested in one of the comments to your question, you could also use replace()
df = df.replace({'country': country_labels })
Try this:
import pandas as pd
# dataframe
df = pd.DataFrame({'Country' : ['z','x', 'x', 'a', 'a', 'b', 'c'], 'Something' : [10, 1, 2, 1, 2, 3, 4]})
# create dictionary for mapping `sorted` countries to integer
country_map = dict(zip(sorted(df.Country.unique()), range(len(df.Country.unique()))))
# country_map should look smthing like:
# {'a': 0, 'b': 1, 'c': 2, 'x': 3, 'z': 4}, where a, b, .. are countries
# replace `Country` coloumn with mapping
df.replace({'Country': country_map })

Pandas read_csv adds unnecessary " " to each row

I have a csv file
(I am showing the first three rows here)
HEIGHT,WEIGHT,AGE,GENDER,SMOKES,ALCOHOL,EXERCISE,TRT,PULSE1,PULSE2,YEAR
173,57,18,2,2,1,2,2,86,88,93
179,58,19,2,2,1,2,1,82,150,93
I am using pandas read_csv to read the file and put them into columns.
Here is my code:
import pandas as pd
import os
path='~/Desktop/pulse.csv'
path=os.path.expanduser(path)
my_data=pd.read_csv(path, index_col=False, header=None, quoting = 3, delimiter=',')
print my_data
The problem is the first and last columns have " before and after the values.
Additionally I can't get rid of the indexes.
It might be making some silly mistake but I thank you for your help in advance
Final solution - use replace with converting to ints and for remove " from columns names use strip:
df = pd.read_csv('pulse.csv', quoting=3)
df = df.replace('"','', regex=True).astype(int)
df.columns = df.columns.str.strip('"')
print (df.head())
HEIGHT WEIGHT AGE GENDER SMOKES ALCOHOL EXERCISE TRT PULSE1 \
0 173 57 18 2 2 1 2 2 86
1 179 58 19 2 2 1 2 1 82
2 167 62 18 2 2 1 1 1 96
3 195 84 18 1 2 1 1 2 71
4 173 64 18 2 2 1 3 2 90
PULSE2 YEAR
0 88 93
1 150 93
2 176 93
3 73 93
4 88 93
index_col=False means force not read first column to index, but dataframe always need some index, so is added default - 0,1,2.... So here can be omit.
header=None should be removed because it force dont read first row (header of csv) to columns of DataFrame. Then also first row of data is header and numeric values are converted to strings.
delimiter=',' should be removed too, because it is same as sep=',' what is default parameter.
#jezrael is right - a pandas dataframe will always add indices. It's necessary.
try something like df[0] = df[0].str.strip() replacing zero with the last column.
before you do so, convert your csv to a dataframe - pd.DataFrame.from_csv(path)

Pandas join issue: columns overlap but no suffix specified

I have the following data frames:
print(df_a)
mukey DI PI
0 100000 35 14
1 1000005 44 14
2 1000006 44 14
3 1000007 43 13
4 1000008 43 13
print(df_b)
mukey niccdcd
0 190236 4
1 190237 6
2 190238 7
3 190239 4
4 190240 7
When I try to join these data frames:
join_df = df_a.join(df_b, on='mukey', how='left')
I get the error:
*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')
Why is this so? The data frames do have common 'mukey' values.
Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don't overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn't have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN
The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.
Please find the two examples that should work in your case:
join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')
or
join_df = df_a.merge(df_b, on='mukey', how='left')
This error indicates that the two tables have one or more column names that have the same column name.
The error message translates to: "I can see the same column in both tables but you haven't told me to rename either one before bringing them into the same table"
You either want to delete one of the columns before bringing it in from the other on using del df['column name'], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought in.
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
The error indicates that the two tables have the 1 or more column names that have the same column name.
Anyone with the same error who doesn't want to provide a suffix can rename the columns instead. Also make sure the index of both DataFrames match in type and value if you don't want to provide the on='mukey' setting.
# rename example
df_a = df_a.rename(columns={'a_old': 'a_new', 'a2_old': 'a2_new'})
# set the index
df_a = df_a.set_index(['mukus'])
df_b = df_b.set_index(['mukus'])
df_a.join(df_b)
Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised

Categories