How to split a column when having multiple scenarios (pandas) - python

I have a column that has a combination of longitude and latitude. I'm trying to split them separately. But, I'm facing a problem. Here's what my data looks like:
print(df['location'])
location
0 -10.8544921875-49.8238090851324
1 2.021484375-59.478568831926
2 2.021484375 / 49.823809085
3 -10.8544921875/ 59.478568831926
4 9.61795 19.33163
As you can see some don't have any spacing but separated with a ' - '. Some have spacing a separated with ' / '. And other have spacing without any character between them.
I've tried to separate it one by one and firstly by doing:
df[['Long','Lat']] = df['location'].str.split(" ",1, expand=True)
Obviously, it didn't separate everything.
My problem is, what do I do next or is there a better approach using Regular Expression? which I'm not familiar with at all
Desired Output:
long Lat
0 -10.8544921875 -49.8238090851324
1 2.021484375 -59.478568831926
2 2.021484375 49.823809085
3 -10.8544921875 59.478568831926
4 9.61795 19.33163

Try:
df[['Long','Lat']] = df['location'].str.extractall(r'([-]?\d+(\.\d+)?)')[0].unstack(level=1)
Outputs:
>>> df[['Long','Lat']]
Long Lat
0 -10.8544921875 -49.8238090851324
1 2.021484375 -59.478568831926
2 2.021484375 49.823809085
3 -10.8544921875 59.478568831926
4 9.61795 19.33163

Related

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

Pandas - Extract a string starting with a particular character

It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.
You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')

Python Pandas - Pivot a csv file into a specific format

It's my first attempt at using pandas. I really need help with pivot_table. None of the combinations I used seem to work.
I have a csv file like this:
Id Param1 Param2
1 -5.00138282776 2.04990620034E-08
1 -4.80147838593 2.01516989762E-08
1 -4.60159301758 1.98263165885E-08
1 -4.40133094788 1.94918392538E-08
1 -4.20143127441 1.91767686175E-08
1 -4.00122880936 1.88457374151E-08
2 -5.00141859055 6.88369405921E-09
2 -4.80152130127 6.77335965094E-09
2 -4.60163593292 6.65415056389E-09
2 -4.40139055252 6.54434062497E-09
3 -5.00138044357 1.16316911658E-08
3 -4.80148792267 1.15515588206E-08
3 -4.60160970688 1.14048361866E-08
3 -4.40137386322 1.12357021465E-08
3 -4.20145988464 1.11049178741E-08
I want my final output to be like this:
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 Param1_for_Id3 Param2_for_Id3
-5.00138282776 2.04990620034E-08 -5.00141859055 6.88369405921E-09 -5.00138044357 1.16316911658E-08
-4.80147838593 2.01516989762E-08 -4.80152130127 6.77335965094E-09 -4.80148792267 1.15515588206E-08
-4.60159301758 1.98263165885E-08 -4.60163593292 6.65415056389E-09 -4.60160970688 1.14048361866E-08
-4.40133094788 1.94918392538E-08 -4.40139055252 6.54434062497E-09 -4.40137386322 1.12357021465E-08
-4.20143127441 1.91767686175E-08 -4.20145988464 1.11049178741E-08
-4.00122880936 1.88457374151E-08
I can't figure out how to reshape my data. Any help would be most welcome!
Use set_index X2 + unstack:
v = (df.set_index('Id') # optional, omit if `Id` is the index
.set_index(df.groupby('Id').cumcount(), append=True)
.unstack(0)
.sort_index(level=1, axis=1)
.fillna('') # I actually don't recommend adding this step in
)
v.columns = v.columns.map('{0[0]}_for_Id{0[1]}'.format)
And now,
print(v)
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 \
0 -5.001383 2.049906e-08 -5.00142 6.88369e-09
1 -4.801478 2.015170e-08 -4.80152 6.77336e-09
2 -4.601593 1.982632e-08 -4.60164 6.65415e-09
3 -4.401331 1.949184e-08 -4.40139 6.54434e-09
4 -4.201431 1.917677e-08
5 -4.001229 1.884574e-08
Param1_for_Id3 Param2_for_Id3
0 -5.00138 1.16317e-08
1 -4.80149 1.15516e-08
2 -4.60161 1.14048e-08
3 -4.40137 1.12357e-08
4 -4.20146 1.11049e-08
5
Note that that last fillna step results in mixed strings and numeric data, so I don't recommend adding that step if you're going to do more with this output.

How to fill rows automatically in pandas, from the content found in a column?

In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).

Input file, modify column, output file

I have data in a text file and I would like to be able to modify the file by columns and output the file again. I normally write in C (basic ability) but choose python for it's obvious string benefits. I haven't ever used python before so I'm a tad stuck. I have been reading up on similar problems but they only show how to change whole lines. To be honest I have on clue what to do.
Say I have the file
1 2 3
4 5 6
7 8 9
and I want to be able to change column two with some function say multiply it by 2 so I get
1 4 3
4 10 6
7 16 9
Ideally I would be able to easily change the program so I apply any function to any column.
For anyone who is interested it is for modifying lab data for plotting. eg take the log of the first column.
Python is an excellent general purpose language however I might suggest that if you are on an Unix based system then maybe you should take a look at awk. The language awk is design for these kind of text based transformation. The power of awk is easily seen for your question as the solution is only a few characters: awk '{$2=$2*2;print}'.
$ cat file
1 2 3
4 5 6
7 8 9
$ awk '{$2=$2*2;print}' file
1 4 3
4 10 6
7 16 9
# Multiple the third column by 10
$ awk '{$3=$3*10;print}' file
1 2 30
4 5 60
7 8 90
In awk each column is referenced by $i where i is the ith field. So we just set the value of second field to be the value of second field multiplied by two and print the line. This can be written even more concisely like awk '{$2=$2*2}1' file but best to be clear at beginning.
Here is a very simple Python solution:
for line in open("myfile.txt"):
col = line.strip().split(' ')
print col[0],int(col[1])*2,col[2]
There are plenty of improvements that could made but I'll leave that as an exercise for you.
I would use pandas or just numpy. Read your file with:
data = pd.read_csv('file.txt', header=None, delim_whitespace=True)
then work with the data in a spreadsheet like style, ex:
data.values[:,1] *= 2
finally write again to file with:
data.to_csv('output.txt')
As #sudo_O said, there are much efficient tools than python for this task. However,here is a possible solution :
from itertools import imap, repeat
import csv
fun = pow
with open('m.in', 'r') as input_file :
with open('m.out', 'wb') as out_file:
inpt = csv.reader(input_file, delimiter=' ')
out = csv.writer(out_file, delimiter=' ')
for row in inpt:
row = [ int(e) for e in row] #conversion
opt = repeat(2, len(row) ) # square power for every value
# write ( function(data, argument) )
out.writerow( [ str(elem )for elem in imap(fun, row , opt ) ] )
Here it multiply every number by itself, but you can configure it to multiply only the second colum, by changing opt : opt = [ 1 + (col == 1) for col in range(len(row)) ] (2 for col 1, 1 otherwise )

Categories