Dataframe getting all data in one 'cell'

Dataframe getting all data in one 'cell' - python

I'm having some problems to put the data I want in a specific df .
When I print the value without df i get
link to image bc if i type the answer it doesn't shows up
Then I try to use pandas dataframe to insert it in a dataframe and I get:
dinner = pd.DataFrame([dinner])
dinner.head()
Home Made - Tuna Poke, 472 gm (4 Ounces) {'cal...
So, basically, everything gets in just one cell. I would like to get something like:
A | Calories | carbohydrates
Home made - tuna poke | 592 | 8
Does anyone know how can I do it?

dinner looks like a string parsed from html text. If it is the case and there is regular pattern for parsing data, then the following code may work.
nutritions = dinner.split('{')[1].split('}')[0].split(', ')
menu = dinner.split('{')[0].strip('<').strip()
dict_dinner = {}
for n in nutritions:
item, qty = n.split(': ')
dict_dinner[item.strip("'")] = qty
df = pd.DataFrame(dict_dinner, index=[menu])
print(df)
This outputs:

Related

Extract data from file.txt text file to csv

I have this .txt file and i would like to convert it to .csv in Python, could you help me?
File.txt
I need to extract the txt data to csv file in this format:
|Case code| Site of injury| If is not fatal,day away from home| Gender | Nationality |...........
| 239 | Head |1 | M | ITALY |...........
And so on for every tag before " : " .
This is the result i'm trying to achieve: Final results
Please let me know how can i solve this problem. I'm a beginner in programming and I don't know where to start. Thank you.

Here is a proposition with pandas.read_fwf and pandas.DataFrame.tranpose :
import pandas as pd
(
pd.read_fwf("input.txt")
.squeeze()
.loc[lambda x: x.str.contains(":", na=False)]
.str.split(":", expand=True)
.set_index(0)
.transpose()
.to_csv("output.csv", index=False)
)
# Output :
First nine columns :
0 Case code Site of injury Type of injury day away from home Gender Nationality Type of work contract Job Seniority of job
1 239 Head Fracture 1 M ITALY Permanent employee Other jobs over 3 years
Shape of the output : (1 row, 17 columns)

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():

Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'

Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex

If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.

I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

Python Panda's row selection

I've tried doing some searching, but I'm having troubles finding what I specifically need. I currently have this.
location = 'Location'
data = pd.read_csv('testbook.csv')
df = pd.DataFrame(data)
search = 'OR' # This will be replaced with an input
row = (df[df.eq(search).any(1)])
print(row)
Location = row.at[0, location]
print(Location)
This outputs this
row print out
Location City Price Etc
0 FL OR 50 123
Location print out
FL
this is the CSV information that it's pull the data from.
My main question and issue is what I'm trying to find out is at this specific line of code
Location = row.at[0, location]
for Location what I'm trying to do and see if possible is in the brackets [0, location].
I want it to automate in the future since for example if I need to find instead of 'OR' I need to find what data is in 'OR1'. The issue is that the [0] is related to the Row # hence this(this is the entire df).
Location City Price Etc
0 FL OR 50 123
1 FL1 OR1 501 1231
2 FL2 OR2 502 1232
I would have to manually change the code every single time which of course is unfeasible with what I'm trying to accomplish.
My main question is, how do I pull specific row numbers all the way on the left and take that output and make it a variable that I can input anywhere?

I'm having a bit of trouble figuring out what you are looking for but this is my best guess
import pandas as pd
data = {'Location':['FL', 'FL1', 'FL2'],
'City': ['OR', 'OR1', 'OR2'],
'Price':[50, 501, 502],
'Etc': [123,1231,1232]}
data = pd.DataFrame(data)
df = pd.DataFrame(data)
# Given search term -> find location
search = 'OR'
# Outputs 'FL'
df['Location'][df['City'] == search].any()

Pandas df loop + merge

Hello guys I need your wisdom,
I'm still new to python and pandas and I'm looking to achieve the following thing.
df = pd.DataFrame({'code': [125, 265, 128,368,4682,12,26,12,36,46,1,2,1,3,6], 'parent': [12,26,12,36,46,1,2,1,3,6,'a','b','a','c','f'], 'name':['unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','g1','g2','g1','g3','g6']})
ds = pd.DataFrame({'code': [125, 265, 128,368,4682], 'name': ['Eagle','Cat','Koala','Panther','Dophin']})
I would like to add a new column in the ds dataframe with the name of the highest parent.
as an example for the first row :
code | name | category
125 | Eagle | a
"a" is the result of a loop between df.code and df.parent 125 > 12 > 1 > a
Since the last parent is not a number but a letter i think I must use a regex and than .merge from pandas to populate the ds['category'] column. Also maybe use an apply function but it seems a little bit above my current knowledge.
Could anyone help me with this?
Regards,

The following is certainly not the fastest solution but it works if your dataframes are not too big. First create a dictionary from the parent codes of df and then apply this dict recursively until you come to an end.
p = df[['code','parent']].set_index('code').to_dict()['parent']
def get_parent(code):
while par := p.get(code):
code = par
return code
ds['category'] = ds.code.apply(get_parent)
Result:
code name category
0 125 Eagle a
1 265 Cat b
2 128 Koala a
3 368 Panther c
4 4682 Dophin f
PS: get_parent uses an assignment expression (Python >= 3.8), for older versions of Python you could use:
def get_parent(code):
while True:
par = p.get(code)
if par:
code = par
else:
return code

Rename Columns in Python using Regular Expressions

I have a data set that has columns for number of units sold in a given month - the problem being that the monthly units columns are named in MM/yyyy format, meaning that I have 12 columns of units information per record.
So for instance, my data looks like:
ProductID | CustomerID | 04/2018 | 03/2018 | 02/2018 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
What causes this to be problematic is that a new file comes in every month, with the same file name, but different column headers for the units information based on the last 12 months.
What I would like to do, is rename the monthly units columns to Month1, Month2, Month3... based on a simple regex such as ([0-9]*)/([0-9]*) that will result in the output:
ProductID | CustomerID | Month1 | Month2 | Month3 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
I know that this should be possible using Python, but as I have never used Python before (I am an old .Net developer) I honestly have no idea how to achieve this.
I have done a bit of research on renaming columns in Python, but none of them mentioned pattern matching to rename a column, eg:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
UPDATE: The data that I am showing in my example is only a subset of the columns; total, in my data set I have 120 columns, only 12 of which need to be renamed, this is why I thought that regex might be the simplest way to go.

import re
# regex pattern
pattern = re.compile("([0-9]*)/([0-9]*)")
# get headers as list
headers = list(df)
# apply regex
months = 1
for index, header in enumerate(headers):
if pattern.match(header):
headers[index] = 'Month{}'.format(months)
months += 1
# set new list as column headers
df.columns = headers

If you have some set names that you want to convert to, then rather than using rename, it might easier to just pass a new list to the df.columns attribute
df.columns = ['ProductID','CustomerID']+['Month{}'.format(i) for i in range(12)]+['FileDate']
If you want to use rename, if you can write a function find_new_name that does the conversion you want for a single name, you can rename an entire list old_names with
df.rename(columns = {oldname:find_new_name(old_name) for old_name in old_names})
Or if you have a function that takes a new name and figures out what old name corresponds to it, then it would be
df.rename(columns = {find_old_name(new_name):new_name for new_name in new_names})
You can also do
for new_name in new_names:
old_name = find_new_name(old_name)
df[new_name] = df[old_name]
This will copy the data into new columns with the new names rather than renaming, so you can then subset to just the columns you want.

Since rename could take a function as a mapper, we could define a customized function which returns a new column name in the new format if the old column name matches regex; otherwise, returns the same column name. For example,
import re
def mapper(old_name):
match = re.match(r'([0-9]*)/([0-9]*)', old_name)
if match:
return 'Month{}'.format(int(match.group(1)))
return old_name
df = df.rename(columns=mapper)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe getting all data in one 'cell' - python

Related

Extract data from file.txt text file to csv

Python - Matching and extracting data from excel with pandas

Python Panda's row selection

Pandas df loop + merge

Rename Columns in Python using Regular Expressions

Categories

Resources