I'm learning python and machine learning and trying to create a very simple csv from synthetic data.
Can anyone help me tweak this to get it to work in PyCharm?
I'm trying to input a random value from the selection in each column.
Much appreciated
import random
import pandas as pd
marriage_status = {'single', 'married', 'divorced', 'widowed', 'complicated'}
children = {'yes', 'no'}
employment = {'employed', 'self_employed', 'unemployed', 'student'}
income_abroad = {'yes', 'no'}
gender = {'M', 'F'}
response = {'refund', 'payment'}
columns = ['marriage_status', 'children', 'employment',
'income_abroad', 'age', 'gender', 'income', 'expenses', 'response']
df = pd.DataFrame(columns=columns)
for i in range(1000):
marriage_status = random.choice(list(marriage_status))
children = random.choice(list(children))
employment = random.choice(list(employment))
income_abroad = random.choice(list(income_abroad))
gender = random.choice(list(gender))
response = random.choice(list(response))
age = random.randint(18, 70)
income = random.randint(0, 100000)
expenses = random.randint(0, 10000)
df = [marriage_status, children, employment, income_abroad, age, gender, income, expenses, response]
df[6].to_csv('taxfix_data.csv')
index = False
If you're going to use pandas the easiest way is to do it like this
import pandas as pd
df = pd.DataFrame(
{"marriage_status" : ['single' ,'married', 'divorced', 'widowed', 'complicated],
"children" : ['yes', 'no'],
"employment" : ['employed', 'self_employed', 'unemployed', 'student'],
"gender" : ['M', 'F'],
"response" : ['refund', 'payment'],
"income_abroad" : ['yes', 'no']}
index = [1, 2, 3])
Also here's a really useful cheatsheet for pandas https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Related
I got a csv file with two headers, and I don't know how to express it either, I pasted it and this is what it looks like, I need to reorder it to be a normal csv file,No information in "age" key,I just want to retrieve "name" and "age",I need to output "first_name","last_name","age". And use "first_name","last_name","age" as the title,
"ID","meta_key","meta_data"
1,"nickname","dale ganger"
2,"first_name","ganger"
3,"last_name","dale"
4,"age",
5,"sex","F"
6,"nickname","dale ganger"
7,"first_name","ganger"
8,"last_name","dale"
9,"age",
10,"sex","F"
11,"nickname","dale ganger"
12,"first_name","ganger"
13,"last_name","dale"
14,"age",
15,"sex","F"
I used this code, but it doesn't merge the headers,
import pandas as pd
pd.read_csv('input.csv', header=None).T.to_csv('output.csv', header=False, index=False)
output
ID,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
meta_key,nickname,first_name,last_name,age,sex,nickname,first_name,last_name,age,sex,nickname,first_name,last_name,age,sex
meta_data,dale ganger,ganger,dale,,F,dale ganger,ganger,dale,,F,dale ganger,ganger,dale,,F
The final look I want
nickname,first_name,last_name,age,sex
dale ganger,ganger,dale,,F
dale ganger,ganger,dale,,F
dale ganger,ganger,dale,,F
try:
df_csv = pd.DataFrame('data.csv')
df = df_csv.drop('ID', axis=1).transpose()
df.columns = df.iloc[0]
df = df.iloc[1: , :].reset_index(drop=True)
df = df[['first_name', 'last_name', 'age']]
to replicate everything:
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5],
'meta_key': ['nickname', 'first_name', 'last_name', 'age', 'sex'],
'meta_data': ['dale ganger', 'ganger', 'dale', '', 'F']}
df_csv = pd.DataFrame(data)
print(df_csv) # before change
df = df_csv.drop('ID', axis=1).transpose()
df.columns = df.iloc[0] # or use below
# df.columns = ['nickname', 'first_name', 'last_name', 'age', 'sex']
df = df.iloc[1: , :].reset_index(drop=True)
df = df[['first_name', 'last_name', 'age']]
print(df) # after change
output is:
first_name
last_name
age
0
ganger
dale
I see you changed your question, now the data is repeating every 5 rows. Then I would do this below:
df = pd.read_csv('unstructured.csv')
# create a dictionary to store the data each iteration
data_dict = {'nickname': [], 'first_name': [], 'last_name': [], 'age': [], 'sex': []}
for i in range(0, len(df), 5):
data_dict['nickname'].append(df['meta_data'][i])
data_dict['first_name'].append(df['meta_data'][i+1])
data_dict['last_name'].append(df['meta_data'][i+2])
data_dict['age'].append(df['meta_data'][i+3])
data_dict['sex'].append(df['meta_data'][i+4])
new_df = pd.DataFrame(data_dict)
print(new_df)
I'm trying to create a search but I'm facing an error, according to some tests I can search for 'name', but I would like to search for 'number_order', does anyone have a solution? Remembering that 'number_order' cannot be changed inside the dataframe EX: 'number_order' : [202204000001] -> 'number_order' : ['202204000001']
import pandas as pd
import matplotlib.pyplot as plt
d = {'number_order' : [202204000001, 202204000002, 202204000003, 202204000004,
202204000005, 202204000006],
'client' : ['Roger Nascimento', 'Rodrigo Peixato', 'Pedro',
'Rafael', 'Maria', 'Emerson'],
'value' : ['120', '187.74', '188.7', '300', '563.2', '198.0']
}
df = pd.DataFrame(data = d)
src_field_data = '202004'
filtered_data = df['number_order']
filtered_data = df.loc[filtered_data.str.contains(f'^{src_field_data}', case = False)]
print(f'number_order FILTERED {filtered_data}\n')
I want to search like this example below, using only a part of the text:
import pandas as pd
import matplotlib.pyplot as plt
d = {'number_order' : [202204000001, 202204000002, 202204000003, 202204000004,
202204000005, 202204000006],
'client' : ['Roger Nascimento', 'Rodrigo Peixato', 'Pedro',
'Rafael', 'Maria', 'Emerson'],
'value' : ['120', '187.74', '188.7', '300', '563.2', '198.0']
}
df = pd.DataFrame(data = d)
src_field_data = 'R'
filtered_data = df['client']
filtered_data = df.loc[filtered_data.str.contains(f'^{src_field_data}', case = False)]
print(f'number_order FILTERED {filtered_data}\n')
Convert values to strings:
filtered_data = df.loc[filtered_data.astype(str).str.contains(f'^{src_field_data}', case = False)]
fellow developers in the StackOverflow.
I have string data in
'key=apple; age=10; key=boy; age=3'
How can we convert it into the pandas' data frame such that key and age will be the header and all the values in the column?
key age
apple 10
boy 3
Try this:
import pandas as pd
data = 'key=apple; age=10; key=boy; age=3'
words = data.split(";")
key = []
age = []
for word in words:
if "key" in word:
key.append(word.split("=")[1])
else:
age.append(word.split("=")[1])
df = pd.DataFrame(key, columns=["key"])
df["age"] = age
print(df)
You can try this:
import pandas as pd
str_stream = 'key=apple; age=10; key=boy; age=3'
lst_kv = str_stream.split(';')
# lst_kv => ['key=apple', ' age=10', ' key=boy', ' age=3']
res= [{s.split('=')[0].strip(): s.split('=')[1] for s in lst_kv[i:i+2]}
for i in range(len(lst_kv)//2)
]
df = pd.DataFrame(res)
df
Output:
key age
0 apple 10
1 boy 10
More explanation for one line res :
res = []
for i in range(len(lst_kv)//2):
dct_tmp = {}
for s in lst_kv[i:i+2]:
kv = s.split('=')
dct_tmp[kv[0].strip()] = kv[1]
res.append(dct_tmp)
res
Output:
[{'key': 'apple', 'age': '10'}, {'age': '10', 'key': 'boy'}]
I am trying to learn OOP and want to convert some code I have so.
My code:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
# assign previous day to variable, then group and sum
prev_day = pd.read_csv('C:/Users/Name/PycharmProjects/Corona Stats/TimeSeries/03-28-2020.csv')
prev_day = prev_day.replace(np.nan, 'Other', regex=True)
prev_day = prev_day.groupby(['Country_Region']).sum()
prev_day = prev_day.reset_index()
# assign current day to variable, removed unwanted columns, then group and sum
stats_reader = pd.read_csv('C:/Users/Name/PycharmProjects/Corona Stats/TimeSeries/03-29-2020.csv')
stats_reader = stats_reader.replace(np.nan, 'Other', regex=True)
stats_clean = stats_reader.drop(['FIPS', 'Last_Update', 'Lat', 'Long_'], axis=1)
stats_clean = stats_clean.rename(columns={
'Admin2': 'County', 'Province_State': 'State', 'Country_Region': 'Country', 'Combined_Key': 'City'})
stats_clean = stats_clean.groupby(['Country']).sum()
stats_clean = stats_clean.reset_index()
# add in new columns to show difference between days
stats_clean['New Cases'] = stats_clean['Confirmed'] - prev_day['Confirmed']
stats_clean['New Deaths'] = stats_clean['Deaths'] - prev_day['Deaths']
stats_clean['New Recovered'] = stats_clean['Recovered'] - prev_day['Recovered']
stats_clean = stats_clean[[
'Country', 'Confirmed', 'New Cases',
'Deaths', 'New Deaths', 'Recovered', 'New Recovered', 'Active']]
stats_clean = stats_clean.replace(np.nan, 0, regex=True)
# calculate for global cases from previous day
prev_sum = prev_day.sum()
prev_sum['Country'] = 'World'
prev_sum = prev_sum[['Country', 'Confirmed', 'Deaths', 'Recovered']]
prev_sum = prev_sum.replace(np.nan, 0, regex=True)
# calculate for global cases for current day
sum_stats = stats_clean.sum()
sum_stats['Country'] = 'World'
sum_stats['New Cases'] = sum_stats['Confirmed'] - prev_sum['Confirmed']
sum_stats = sum_stats.replace(np.nan, 0, regex=True)
sum_stats = sum_stats[[
'Country', 'Confirmed', 'New Cases', 'Deaths', 'New Deaths', 'Recovered', 'New Recovered', 'Active']]
My first attempt:
class Corona:
def __init__(self):
pass
def country_sum(self, country):
country = stats_clean['Country'].isin([country])
print(country)
Corona.country(US)
If I make this a static method, it runs, but I am not using the argument in country_sum. I want to filter for whatever country is passed.
I don't know how to use the argument in a method to filter for values in a column.
Sample rows from the original csv file:
FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-03-29 23:08:25,34.22333378,-82.46170658,3,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-03-29 23:08:25,30.295064899999996,-92.41419698,9,1,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-03-29 23:08:25,37.76707161,-75.63234615,3,0,0,0,"Accomack, Virginia, US"
16001,Ada,Idaho,US,2020-03-29 23:08:25,43.4526575,-116.24155159999998,92,1,0,0,"Ada, Idaho, US"
19001,Adair,Iowa,US,2020-03-29 23:08:25,41.33075609,-94.47105874,1,0,0,0,"Adair, Iowa, US"
If I am not mistaken, you should not perform all the calculation somewhere else outside the class and then access variable e.g. stats_clean defined in the global scope.
You should rather be doing it like this:
class Corona:
root_dir = "<path-to-your-data-dir>"
def __init__(self, date):
self.file = os.path.join(self.root_dir, str(date) + ".csv") # use glob or something if you want to process multiple files etc.
self._calculate_stats()
def _calculate_stats(self):
<do all your reading dataset and calculations here>
<....>
self.stats_clean = ...
self.prev_sum = ...
def country_sum(self, country = 'US'):
return self.stats_clean['Country'].isin([country])
Then you can simply do:
corona = Corona('03-29-2020')
print(corona.country_sum(<your-country>))
This is just one way of doing it.
Facing following error when running code
Level None not found
pt = df.pivot_table(index = 'User Name',values = ['Threat Score', 'Score'],
aggfunc = {
'Threat Score': np.mean,
'Score' :[np.mean, lambda x: len(x.dropna())]
},
margins = True)
pt = pt.sort_values('Score', ascending = False)
I want to take the average value of Threat Score & Score, also count of the user name. Then sort by Threat Score high to low.
Its a bug in pandas this is a github link for the same. This error comes with with multiple aggregations per column and margins=True, it won't come if you choose flag margins = False. you can add them later if you want. That sure will work:
pt = df.pivot_table(index = 'User Name',values = ['Threat Score', 'Score'],
aggfunc = {
'Threat Score': np.mean,
'Score' :[np.mean, lambda x: len(x.dropna())]
},
margins = False)
pt = pt.sort_values('Score', ascending = False)
let me know if this works for you
pt = df.pivot_table(index = 'User Agent', values = ['Threat Score', 'Score','Source IP'] ,
aggfunc = {"Source IP" : 'count',
'Threat Score':np.mean,
'Score': np.mean})
pt = pt.sort_values('Threat Score', ascending = False)
new_cols = ['Avg_Score', 'Count', 'Avg_ThreatScore']
pt.columns = new_cols
pt.to_csv(Path3 + '\\AllUserAgent.csv')