[Boostcamp Day-5] Python

Boostcamp AI Tech

[Boostcamp Day-5] Python - Pandas

ju_young 2021. 8. 6. 13:16

pandas 설치

conda install pandas

series

하나의 Column에 해당하는 데이터의 모음 Object를 말한다. index에 index 이름을 지정해줄 수 있다.

a = Seried([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

지정한 index에 다음과 같이 간단하게 접근할 수 있고 값도 할당해줄 수 있다.

a['a']
a['a'] = 2

다음과 같이 시리즈에 대한 정보를 얻을 수 있다.

a.values #값만
a.index #index만
a.name #시리즈 이름
a.index.name #index 이름

DataFrame

DataTable 전체를 포함하는 Object를 말한다. 다음과 같이 data를 지정해주고 칼럼 값들을 지정해주면 dataframe이 생성된다.

df = pd.DataFrame(data, columns=['first_name', 'last_name', 'age', 'city'])

칼럼을 지정해주면 해당 칼럼에 있는 시리즈를 추출해준다.

df['first_name']

indexing

인덱싱은 loc, iloc 두 가지로 수행할 수 있는데 loc는 인덱스 이름을 지정해주는 것이고 iloc는 index 위치를 지정해주는 것이다. 말로하니 헷갈리니 다음과 같은 코드를 확인하자.

df.loc['b']
df.iloc[1:]

데이터 할당

다음과 같이 debt라는 칼럼이 생기면서 bool 값이 이루어진 시리즈가 추가된다.

df.debt = df.age >40

transpose & 값 출력 & csv 변환

df.T
df.values
df.to_csv()

삭제

del df["debt"]

selection & drop

df['account'] #한개의 칼럼 선택
df[['account', 'street', 'state']] #한개 이상의 칼럼 선택
df[:3] #row 기준 표시
df['account'][:3] #해당 칼럼을 row 값 슬라이싱

a = df['account']
a[[0,1,2]] #한개 이상의 index
a[a<250000] #boolean index

df.index = df['account'] #index를 account로 변경

df.drop(1) #index number로 drop
df.drop('city', axis=1) #axis 지정으로 축을 기준으로 drop

operations

s1 + s2 # s1: 시리즈, s2 : 시리즈
df1.add(df2, fill_value=0) #dataframe끼리 계산할때 fill_value값을 설정하면 nan 값을 대체시켜준다
df.add(s2, axis=0) # dataframe과 시리즈끼리 계산할때 axis기준으로 boadcasting이 일어난다

map

각 값들에 지정한 함수를 적용시킨다

s1.map(lambda x: x**2) #시리즈
df.replace({'male':0, 'female':1}) #각 성별을 0과 1로 대체시켜준다
df.replace({'male':0, 'female':1}, inplace=True) #inplace=True를 지정해줘야 실제 데이터에 적용된다

apply

각 칼럼에 지정한 함수를 적용시킨다.

df.apply(lambda x:x.max() - x.min())
df.apply(sum)이

#series갑싕 반환도 가능
def f(x):
    return Series([x.min(), x.max()], index=["min", "max"])
df.apply(f)

applymap

시리즈 단위가 아닌 element 단위로 함수를 적용시킨다.

df.applymap(lambda x : -x)

built-in function

df = pd.read_csv("test.csv") #csv파일을 read
df.describe #데이터의 요약 정보를 보여줌
df.unique() #유일한 값 list
df.isnull() #null 값인지의 여부를 boolean값으로 바꿔준다
df.sort_values(['age', 'earn'], ascending=True) #column값을 기준으로 데이터를 sorting
df.corr(df.earn) #상관계수
df.cov(df.earn) #공분산

groupby

df.groupby("team")["points'] # team : 묶음이 기준이 되는 칼럼, points : 적용받는 칼럼
h_index.unstack() #group으로 묶여진 데이터를 matrix 형태로 전환해줌
h_index.swaplevel().sortlevel(0) #index level을 변경
h_index.sum(level=0) #level을 기준으로 연산가능
h_index.sum(level=1)

#groupby에의해 spli된 상태를 추출 가능
grouped = df.groupby('team')
for name, group in grouped:
    print(name)
    print(group)

grouped.get_group('devils') #특정 key값을 가진 그룹의 정보만 추출

#aggregation : 요약된 통계정보를 추출 (여러 개의 함수를 적용시킬 수 있다)
grouped.agg(sum)

#transformation : 해당 정보를 변환
score = lambda x : (x - x.mean()) / x.std()
grouped.transform(score)

#filtration : 특정 정보를 제거하여 보여주는 필터링 기능
df.groupby('team').filter(lambda x: len(x) >= 3)

Pivot Table

엑셀에서 봐왔던 피벗테이블과 같다고 생각하면된다.

df.pivot_table(["duration"], index=[df_phone.month, df_phone.item], columns=df_phone.network, aggfunc="sum", fill_value=0)

crosstab

pd.crosstab(index=df_movie.critic, columns=df_movie.title, values=df_movie.rating, aggfunc="first").fillna(0)

merge

두개 개의 데이터를 하나로 합친다

pd.merge(df_a, df_b, on='subgect_id') #on이 기준이된다
pd.merge(df_a, df_b, on='subgect_id', how='left') #left join
pd.merge(df_a, df_b, on='subgect_id', how='right') #right join
pd.merge(df_a, df_b, on='subgect_id', how='outer') #full(outer) join
pd.merge(df_a, df_b, on='subgect_id', how='inner') #inner join
pd.merge(df_a, df_b, right_index=True, left_index=True) #index based join

concat

같은 형태의 데이터를 붙이는 연산작업

df = pd.concat([df_a, df_b], axis=1)

database

sqlite3를 사용하여 db connection 기능을 제공한다

import sqlite3
conn = sqlite3.connect('./test.db')
cur = conn.cursor()
cur.execute('select * from airlines limit 5;')
results = cur.fetchall()

XLS

Xls엔진으로 데이터프레임을 엑셀로 추출하거나 사용할 수 있다.

writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df_routes.to_excel(writer, sheet_name='Sheet1')

Pickle

가장 일반적인 python 파일 persistence로 to_pickle, read_pickle 함수를 사용한다.

df_routes.to_pickle("./test.pickle")
df_routes_pickle = pd.read_pickle("./test.pickle")

'Boostcamp AI Tech' 카테고리의 다른 글

[Boostcamp 선택 과제 - 2] RNN Backpropagation 구현 (0)	2021.08.08
[Boostcamp 선택 과제 - 1] Gradient Descent 구현 (0)	2021.08.08
[Boostcamp Day-5] Python - Data Structure (0)	2021.08.06
[Boostcamp Day-5] Python - Numpy (0)	2021.08.06
[Boostcamp Day-5] Python - Data Handling (0)	2021.08.06

현재글[Boostcamp Day-5] Python - Pandas

JADE's Repository