I'm not deep involved with dictionaries in python. However, I have structured text data (ASCII) which I would like to convert to CSV (to input in a database or spreadsheet). Not all values are available in each line:
name Smith city Boston country USA
name Meier city Berlin ZIP 12345 country Germany
name Grigoriy country Russia
not all fields are in each line. However, no spaces are in the field values. How can I convert such textfile in a CSV like
name, city, ZIP, country
Smith, Boston, , USA
Meier, Berlin, 12345, Germany
Grigory, , , Russia
Try this:
d = """name Smith city Boston country USA
name Meier city Berlin ZIP 12345 country Germany
name Grigoriy country Russia"""
keys = {} # will collect all keys
objs = [] # will collect all lines
for line in d.split("\n"): # split input by linebreak
ks = [x for x in line.split()[::2]] # even positions: 0, 2, 4, 6
vs = [x for x in line.split()[1::2]] # odd positions: 1, 3, 5, 7
objs.append(dict(zip(ks, vs))) # turn line into dictionary
for key in ks:
keys[key] = True # note all keys
print(",".join(keys)) # print header row
for obj in objs:
print(",".join([obj.get(k, "") for k in keys]))
Output:
country,city,name,ZIP
USA,Boston,Smith,
Germany,Berlin,Meier,12345
Russia,,Grigoriy,
Getting the columns in another order is left as an exercise to the reader.
Related
I am trying to get the zip code after the specific word 'zip_code' within a string.
I have a data frame with a column named "location", in this column there is a string, I want to identify the word "zip_code" and get the value after this word for each row.
Input
name location
Bar1 LA Jefferson zip_code 202378 Avenue free
Pizza Avenue 45 zip_code 45623 wichita st
Tacos Las Americas avenue 7 zip_code 67890 nicolas st
Expected output
name location
Bar1 202378
Pizza 45623
Tacos 67890
So far, following an example I was able to extract the zip code for any string
str = "address1 355 Turnpike Ste 4 address3 zip_code 02021 country US "
str.split("zip_code")[1].split()[0]
>> 02021
But I do not know how to do the same for each row of my column location
The best way is to use extract() which accepts regex and allows searching through each row.
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['Bar1', 'Pizza', 'Tacos'],
'location':['LA Jefferson zip_code 202378 Avenue free', 'Avenue 45 zip_code 45623 wichita st', 'Las Americas avenue 7 zip_code 67890 nicolas st']})
df['location'] = df['location'].str.extract('zip_code\s(.*?)\s')
>>> df
name location
0 Bar1 202378
1 Pizza 45623
2 Tacos 67890
I have an excel file which includes 5 sheet. I should create 5 graphs and plot them as x and y. but I should loop it. How can i do
You can load all the sheets:
f = pd.ExcelFile('users.xlsx')
Then extract sheet names:
>>> f.sheet_names
['User_info', 'purchase', 'compound', 'header_row5']
Now, you can loop over the sheet names above. For example one sheet:
>>> f.parse(sheet_name = 'User_info')
User Name Country City Gender Age
0 Forrest Gump USA New York M 50
1 Mary Jane CANADA Tornoto F 30
2 Harry Porter UK London M 20
3 Jean Grey CHINA Shanghai F 30
The loop looks like this:
for name in f.sheet_names:
df = f.parse(sheet_name = name)
# do something here
No need to use variables, create the output lists and use this simple loop:
data = pd.ExcelFile("DCB_200_new.xlsx")
l = ['DCB_200_9', 'DCB_200_15', 'DCB_200_23', 'DCB_200_26', 'DCB_200_28']
x = []
y = []
for e in l:
x.append(pd.read_excel(data, e, usecols=[2], skiprows=[0,1]))
y.append(pd.read_excel(data, e, usecols=[1], skiprows=[0,1]))
But, ideally you should be able to load the data only once and loop over the sheets/columns. Please update your question with more info.
I have an issue where I have multiple rows in a csv file that have to be converted to a pandas data frame but there are some rows where the columns 'name' and 'business' have multiple names and businesses that should be in separate rows and need to be split up while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J., Ricky B.
Unspecified, Unspecified, Self-employed
output I need:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
def
Ricky B
Self-employed
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Here's an alternative solution I was trying as well but it gives me an error too:
review_path = r'data/base_data'
review_files = glob.glob(review_path + "/test_data.csv")
review_df_list = []
for review_file in review_files:
df = pd.read_csv(io.StringIO(review_file), sep = '\t')
print(df.head())
df["business"] = (df["business"].str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list))
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
outPutPath = Path('data/base_data/test_data.csv')
df.to_csv(outPutPath, index=False)
Error Message for alternative solution:
Read:data/base_data/review_base.csv
Success!
Empty DataFrame
Columns: [data/base_data/test_data.csv]
Index: []
Try:
df["business"] = (
df["business"]
.str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))")
.groupby(level=0)
.agg(list)
)
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
Prints:
software name business
0 abc Andrew Johnson Outsourcing/Offshoring, 201-500 employees
0 abc Steve Martin Health, Wellness and Fitness, 5001-10,000 employees
1 xyz Jack Jones Banking, 1001-5000 employees
1 xyz Rick Paul Construction, 51-200 employees
1 xyz Johnny Jones Consumer Goods, 10,001+ employees
2 def Tom D. Unspecified
2 def Connie J. Unspecified
2 def Ricky B. Self-employed
I have a dataframe df such that:
df['user_location'].value_counts()
India 3741
United States 2455
New Delhi, India 1721
Mumbai, India 1401
Washington, DC 1354
...
SpaceCoast,Florida 1
stuck in a book. 1
Beirut , Lebanon 1
Royston Vasey - Tralfamadore 1
Langham, Colchester 1
Name: user_location, Length: 26920, dtype: int64
I want to know the frequency of specific countries like USA, India from the user_location column. Then I want to plot the frequencies as USA, India, and Others.
So, I want to apply some operation on that column such that the value_counts() will give the output as:
India (sum of all frequencies of all the locations in India including cities, states, etc.)
USA (sum of all frequencies of all the locations in the USA including cities, states, etc.)
Others (sum of all frequencies of the other locations)
Seems I should merge the frequencies of rows containing the same country names and merge the rest of them together! But it appears complex while handling the names of the cities, states, etc. What is the most efficient way to do it?
Adding to #Trenton_McKinney 's answer in the comments, if you need to map different country's states/provinces to the country name, you will have to do a little work to make those associations. For example, for India and USA, you can grab a list of their states from wikipedia and map them to your own data to relabel them to their respective country names as follows:
# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
states = in_states + us_states
# Make a sample dataframe
df = pd.DataFrame({'Country': states})
Country
0 Andhra Pradesh
1 Arunachal Pradesh
2 Assam
3 Bihar
4 Chhattisgarh
... ...
73 Virginia[E]
74 Washington
75 West Virginia
76 Wisconsin
77 Wyoming
Map state names to country names:
# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)
Country
0 India
1 India
2 India
3 India
4 India
... ...
73 USA
74 USA
75 USA
76 USA
77 USA
But from your data sample it looks like you will have a lot of edge cases to deal with as well.
Using the concept of the previous answer, firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :
# Trying to get all the locations of USA and India
import pandas as pd
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()
us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions
usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind])
# Country name checker function
def checkl(T):
TSplit_space = [x.lower().strip() for x in T.split()]
TSplit_comma = [x.lower().strip() for x in T.split(',')]
TSplit = list(set().union(TSplit_space, TSplit_comma))
res_ind = [ele for ele in ind if(ele in T)]
res_us = [ele for ele in us if(ele in T)]
if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
T = 'India'
elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
T = 'USA'
elif len(T.split(','))>1 :
if T.split(',')[0] in indToStr or T.split(',')[1] in indToStr :
T = 'India'
elif T.split(',')[0] in usToStr or T.split(',')[1] in usToStr :
T = 'USA'
else:
T = "Others"
else:
T = "Others"
return T
# Appling the function on the dataframe column
print(df['user_location'].dropna().apply(checkl).value_counts())
Others 74206
USA 47840
India 20291
Name: user_location, dtype: int64
I am quite new in python coding. I think this code can be written in a better and more compact form. And as it is mentioned in the previous answer, there are still a lot of edge cases to deal with. So, I have added it on
Code Review Stack Exchange too. Any criticisms and suggestions to improve the efficiency and readability of my code would be greatly appreciated.
So basically i need to create a dictionary (or other data structure similar) in which i give an array as one parameter and it maps it to a key word. One example of this could be if you have an array
a=['Ohio','California','Colorado']
and
b=['Los Angeles','San Diego','Denver']
Real example:
data['OCC_Desc'] = ['Oncology','Market_ana','67W045','Fret678',etc..]
data['LoB'] = ['7856op','Ran0p','Mkl45',etc..]
parameter:
param['column_name']=['OCC_Desc','LoB','OCC',etc..]
param['parameter'] = ['Oncology','7856op','Fret678',etc...]
param['Offering'] = ['Medicine','Transport','Supplies',etc...]
and the output if i used this "dictionary" would be in this brief example
data['Offering'] = ['Medicine','Transport']
Example of the structure of the dataframe for the parameters
Column Parameter Offering
0 Location Los Angeles City
1 Team Los Angeles Lakers
2 Location Colorado State
3 Food Italy Pizza
4 Location Germany Country
you will link 'a' to the string 'State' and 'b' to the string 'City'. So later on when i implement it on a dataframe i can give a column for input and this "dictionary" will check each row of the dataframe into this arrays and return either 'City' or 'State'.
The example above is just for understanding the problem, in reality i have to do this to a dataset for school and i have the parameters for multiple columns linking to multiple categories (14 columns that act as parameters and 16 different categories as result, which this categories can be the result from multiple parameters [literally hundreds] and from different columns)
import pandas
a = ['Ohio','California','Colorado']
b = ['Los Angeles','San Diego','Denver']
param_dict = {
'State':a,
'City':b}
df_dict = {
'x':['Ohio','ToothBrush','San Diego'],
'y':['Colorado','Hammer','Denver'],
'z':['California','Gun','Los Angeles']}
df = pandas.DataFrame(data = df_dict)
print(df)
for index,row in df.iterrows():
row = row[:]
if set(param_dict.get('State')).issubset(set(row)):
print(f"Row {list(row)} at index {index} is a State")
if set(param_dict.get('City')).issubset(set(row)):
print(f"Row {list(row)} at index {index} is a City")
OUTPUT
x y z
0 Ohio Colorado California
1 ToothBrush Hammer Gun
2 San Diego Denver Los Angeles
Row ['Ohio', 'Colorado', 'California'] at index 0 is a State
Row ['San Diego', 'Denver', 'Los Angeles'] at index 2 is a City