python: pandas 라이브러리 정리

Python 2021. 6. 6. 20:41

이전 포스팅에서 numpy 라이브러리를 알아보았고

데이터 처리에 있어서 강력한 기능을 제공하는 것을 확인했다.

하지만 numpy는 데이터의 속성을 표시하는 행이나 열의 레이블을 가지고 있지 않다는 한계가 있다.

그래서 pandas 라는 라이브러리를 사용하는데

이를 이용하면 행과 열로 구조화하여 엑셀과 같은 데이터를 효과적으로 처리할 수 있다.

판다스의 개요

1. 특징

- 빠르고 효율적이며 다양한 표현력을 갖춘 자료구조

- 다양한 형태의 데이터에 적합

- 데이터프레임을 이용한 2차원 데이터 표현

2. 장점

- 결측 데이터 처리

- 데이터 추가 삭제

- 데이터 정렬과 다양한 데이터 조작

3. 판다스가 하는 일

- 데이터 불러오기 및 저장하기

- 데이터 보기 및 검사

- 필터, 정렬 및 그룹화

- 데이터 정제

다음 저장소 링크를 참조하여 실습에 사용해 볼 데이터를 다운 받는다.

dongupak/DataSciPy

이 저장소는 "생능 출판사"의 "따라하며 배우는 파이썬과 데이터 과학(2020 출판)" 저장소입니다. - dongupak/DataSciPy

github.com

먼저 판다스에 대해서 알아보기전

csv 모듈에 대한 이해가 필요하므로 잠깐 csv 라이브러리에 대해 살펴본다.

1. 데이터의 내용 읽기(헤더 제거)

import csv
f = open('D:\study\data\csv\weather.csv')
data = csv.reader(f)
header = next(data)  # 헤더를 버린다. 
#  next를 이용하여 현재행을 뽑아오고 다음 행으로 커서를 옮긴다.
for row in data:
    print(row)
f.close()

['2010-08-01', '28.7', '8.3', '3.4']
['2010-08-02', '25.2', '8.7', '3.8']
['2010-08-03', '22.1', '6.3', '2.9']
['2010-08-04', '25.3', '6.6', '4.2']
['2010-08-05', '27.2', '9.1', '5.6']
['2010-08-06', '26.8', '9.8', '8']
['2010-08-07', '27.5', '9.1', '5']
['2010-08-08', '26.6', '5.9', '4']

2. 원하는 데이터 뽑아 내기

평균 풍속 데이터에서 최댓값을 추출해보자.

import csv
f = open('D:\study\data\csv\weather.csv')
data =csv.reader(f)
max_wind = 0
header = next(data)
for row in data:
    if row[3] == '':
        continue
    else:
        max_wind = max(max_wind, float(row[3]))
print("최대 평균 풍속:" + str(max_wind))

최대 평균 풍속:14.9

※ 실습

그래프를 이용하여 "평균 풍속"의 평균이 가장 큰 달을 구하라.

import csv
import matplotlib.pyplot as plt
import numpy as np
f = open('D:\study\data\csv\weather.csv')
data =csv.reader(f)
next(data)
monthly_wind = np.array([0 for _ in range(13)])
days_counted = np.array([0 for _ in range(13)])

for row in data:
    if row[3] != '':
        month = int(row[0][5:7])  
        monthly_wind[month] += float(row[3])  # 해당하는 달에 평균 풍속 더하기
        days_counted[month] += 1  # 해당하는 달의 갯수 더하기 

plt.plot(range(1, 13), monthly_wind[1:]/days_counted[1:])
plt.show()

정답: 4월

※ 추가

pandas를 이용하면 다음과 같이 코드를 짤 수도 있다.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('D:\study\data\csv\weather.csv', encoding='CP949')
data['month'] = pd.DatetimeIndex(data['일시']).month
average_wind = data.groupby('month').mean()['평균풍속']
plt.plot(np.arange(12), average_wind)
plt.show()

판다스의 데이터 구조

- 시리즈

동일 유형의 데이터를 저장하는 1차원 배열

# 시리즈 출력
import pandas as pd
import numpy as np
series = pd.Series([1, 3, 4, np.nan, 6, 8])
print(series)

-데이터 프레임

시리즈 데이터가 여러개 모여서 2차원 구조를 갖는 것

import pandas as pd
name = pd.Series(['aaa', 'bbb', 'ccc', 'ddd'])
age = [10, 20, 20, 30]
gender = ['남', '여', '남', '여']
# 딕셔너리를 넣어서 dataframe으로 바꾸어 줄 수도 있다.
# 이 때 key는 열이 된다. 
df = pd.DataFrame({'이름': name, '나이': age, '성별': gender}) # 리스트를 넣든, series를 넣든 결과는 같다.
print(df)

    이름  나이 성별
0  aaa  10  남
1  bbb  20  여
2  ccc  20  남
3  ddd  30  여

판다스 실습

1. csv 데이터를 dataframe으로 바꾸기

# csv -> dataframe
import pandas as pd
df = pd.read_csv('D:\study\data\csv\countries.csv', index_col=0)  # 열을 0번째 인덱스으로 하겠다.
print(df)
print(df['population']) # 열 값을 뽑아온다.

   country      area     capital  population
KR   Korea     98480       Seoul    51780579
US     USA   9629091  Washington   331002825
JP   Japan    377835       Tokyo   125960000
CN   China   9596960     Beijing  1439323688
RU  Russia  17100000      Moscow   146748600

KR      51780579
US     331002825
JP     125960000
CN    1439323688
RU     146748600
Name: population, dtype: int64

2. 데이터 가시화하기

# 판다스로 데이터 가시화하기
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('D:\study\data\csv\countries.csv', index_col=0)  # 인덱스(행)을 0번으로 하겠다.
df['population'].plot(kind='pie')
plt.show()

import pandas as pd
import matplotlib.pyplot as plt
weather = pd.read_csv('D:\study\data\csv\weather.csv', index_col=0, encoding='CP949')
weather['평균풍속'].plot(kind='hist', bins=33)
plt.show()

3. 슬라이싱으로 행 선택하기

import pandas as pd
weather = pd.read_csv('D:\study\data\csv\weather.csv', index_col=0, encoding='CP949')
print(weather.head())
print(weather.tail())
print(weather[:3]) # 0, 1, 2번째 행만 가져온다.

import pandas as pd
countries = pd.read_csv('D:\study\data\csv\countries.csv', index_col=0)  # 인덱스(행)을 0번으로 하겠다.
print(countries.head())
print(countries.loc['KR'])  # 행값 가져오기, 단 열 값은 못 가져온다. 열 값은 그냥 [] 참조
print(countries.loc['KR', 'area']) # 대신 동시에 가져올 때는 열값을 가져올 수 있다.
print(countries['area']) # 열 값 가져오기 대신 행은 못가져온다.
print(countries['area'].loc['KR']) # loc['KR', 'area']와 동치

# print(countries.head())
country      area     capital  population
KR   Korea     98480       Seoul    51780579
US     USA   9629091  Washington   331002825
JP   Japan    377835       Tokyo   125960000
CN   China   9596960     Beijing  1439323688
RU  Russia  17100000      Moscow   146748600

# print(countries.loc['KR'])
country          Korea
area             98480
capital          Seoul
population    51780579
Name: KR, dtype: object

# print(countries.loc['KR', 'area'])
98480

# print(countries['area'])
KR       98480
US     9629091
JP      377835
CN     9596960
RU    17100000
Name: area, dtype: int64

# print(countries['area'].loc['KR'])
98480

4. 새로운 열 생성하기

# 새로운 열을 생성하기
import pandas as pd
countries = pd.read_csv("D:\study\data\csv\countries.csv", index_col = 0)
countries['density'] = countries['population']/countries['area']
print(countries)

   country      area     capital  population     density
KR   Korea     98480       Seoul    51780579  525.797918
US     USA   9629091  Washington   331002825   34.375293
JP   Japan    377835       Tokyo   125960000  333.373033
CN   China   9596960     Beijing  1439323688  149.977044
RU  Russia  17100000      Moscow   146748600    8.581789

5. 데이터 분석하기

# 데이터 분석하기
import pandas as pd
weather = pd.read_csv('D:\study\data\csv\weather.csv', index_col=0, encoding='CP949')
print(weather.describe())
print(weather.mean()) # 평균구하기
print(weather.std())  # 표준편차 구하기
print(weather.min())  # 최소 구하기
print(weather.sum())  # 합 구하기

              평균기온         최대풍속         평균풍속
count  3653.000000  3649.000000  3647.000000
mean     12.942102     7.911099     3.936441
std       8.538507     3.029862     1.888473
min      -9.000000     2.000000     0.200000
25%       5.400000     5.700000     2.500000
50%      13.800000     7.600000     3.600000
75%      20.100000     9.700000     5.000000
max      31.300000    26.000000    14.900000

평균기온    12.942102
최대풍속     7.911099
평균풍속     3.936441
dtype: float64

평균기온    8.538507
최대풍속    3.029862
평균풍속    1.888473
dtype: float64

평균기온   -9.0
최대풍속    2.0
평균풍속    0.2
dtype: float64

평균기온    47277.5
최대풍속    28867.6
평균풍속    14356.2
dtype: float64

6. 데이터를 특정한 값에 기반하여 묶기

groupby 함수를 사용하여 그룹핑

import pandas as pd
weather = pd.read_csv('D:\study\data\csv\weather.csv', encoding='CP949')
# 일시 열에서 날짜의 month값을 가져와 month라는 새로운 열을 생성한다. 
weather['month'] = pd.DatetimeIndex(weather['일시']).month
# month라는 열을 기준으로 그룹을 나누어 평균을 구한다. 
means = weather.groupby('month').mean()
print(means)

            평균기온      최대풍속      평균풍속
month                               
1       1.598387  8.158065  3.757419
2       2.136396  8.225357  3.946786
3       6.250323  8.871935  4.390291
4      11.064667  9.305017  4.622483
5      16.564194  8.548710  4.219355
6      19.616667  6.945667  3.461000
7      23.328387  7.322581  3.877419
8      24.748710  6.853226  3.596129
9      20.323667  6.896333  3.661667
10     15.383871  7.766774  3.961613
11      9.889667  8.013333  3.930667
12      3.753548  8.045484  3.817097

7. 필터링

# 최대 풍속이 10이상인 경우만 필터링하기
import pandas as pd
weather = pd.read_csv('D:\study\data\csv\weather.csv', index_col=0, encoding='CP949')
print(weather[weather['최대풍속'] > 10])

            평균기온  최대풍속  평균풍속
일시                          
2010-08-10  25.6  10.2   5.5
2010-08-13  24.3  10.9   4.6
2010-08-14  25.0  10.8   4.4
2010-08-15  24.5  16.9  10.3
2010-08-30  26.2  10.5   6.2
...          ...   ...   ...
2020-07-01  16.8  19.7   8.7
2020-07-11  20.1  10.3   4.1
2020-07-13  17.8  10.3   4.6
2020-07-14  17.8  12.7   9.4
2020-07-20  23.0  11.2   7.3

[795 rows x 3 columns]

8. 결손값 찾기 및 메우기

import pandas as pd
weather = pd.read_csv('D:\study\data\csv\weather.csv', index_col=0, encoding='CP949')
print(weather[weather.isna()])  # missing data찾기 

# axis = 0 이면 na가 포함된 행을 삭제하고 
# axis = 1 이면 na가 포함된 열을 삭제한다. 
print(weather.dropna(axis=1))

# inplace = True라면 원본 데이터를 수정한다. 
# nan인 데이터에 0을 대입한다. 
print(weather.fillna(0, inplace=True))
print(weather.loc['2012-02-11'])

print(weather[weather.isna()])
			평균기온  최대풍속  평균풍속
일시                          
2010-08-01   NaN   NaN   NaN
2010-08-02   NaN   NaN   NaN
2010-08-03   NaN   NaN   NaN
2010-08-04   NaN   NaN   NaN
2010-08-05   NaN   NaN   NaN
...          ...   ...   ...
2020-07-27   NaN   NaN   NaN
2020-07-28   NaN   NaN   NaN
2020-07-29   NaN   NaN   NaN
2020-07-30   NaN   NaN   NaN
2020-07-31   NaN   NaN   NaN

[3653 rows x 3 columns]


print(weather.dropna(axis=1))
            평균기온
일시              
2010-08-01  28.7
2010-08-02  25.2
2010-08-03  22.1
2010-08-04  25.3
2010-08-05  27.2
...          ...
2020-07-27  22.1
2020-07-28  21.9
2020-07-29  21.6
2020-07-30  22.9
2020-07-31  25.7


print(weather.fillna(0, inplace=True))
[3653 rows x 1 columns]


print(weather.loc['2012-02-11'])
None
평균기온   -0.7
최대풍속    0.0
평균풍속    0.0
Name: 2012-02-11, dtype: float64

9. 데이터 구조 변경하기

# 딕셔너리를 이용하여 데이터프레임 생성하기
import pandas as pd
dic = {'item': ['ring0', 'ring0', 'ring1', 'ring2'],
       'type': ['Gold', 'Silver', 'Gold', 'Bronze'],
       'price': [10, 5, 7, 3]}
df = pd.DataFrame(dic)
print(df)

# pivot함수를 이용한 변경
df2 = df.pivot(index='item', columns='type', values='price')
print(df2)

    item    type  price
0  ring0    Gold     10
1  ring0  Silver      5
2  ring1    Gold      7
3  ring2  Bronze      3

type   Bronze  Gold  Silver
item                       
ring0     NaN  10.0     5.0
ring1     NaN   7.0     NaN
ring2     3.0   NaN     NaN

10. 데이터 프레임 합치기

concat 함수를 이용한다.

import pandas as pd
dic = {'item': ['ring0', 'ring1', 'ring0'],
       'type': ['Gold', 'Silver', 'Gold'],
       'price': [7, 5, 7]}
df1 = pd.DataFrame(dic, index=['가', '나', '다'])
print(df1)

# pivot함수를 이용한 변경
df2 = pd.DataFrame(dic, index=(['다', '라', '나',]))
print(df2)


df3 = pd.concat([df1, df2], axis=0) # axis=0일 때 행을 늘린다.
print(df3)

    item    type  price
가  ring0    Gold      7
나  ring1  Silver      5
다  ring0    Gold      7

    item    type  price
다  ring0    Gold      7
라  ring1  Silver      5
나  ring0    Gold      7

    item    type  price
가  ring0    Gold      7
나  ring1  Silver      5
다  ring0    Gold      7
다  ring0    Gold      7
라  ring1  Silver      5
나  ring0    Gold      7

11. 데이터 베이스 join 방식의 데이터 병합

# # 데이터베이스 join 방식
import pandas as pd
dic = {'item': ['ring0', 'ring1', 'ring0'],
       'type': ['Gold', 'Silver', 'Gold'],
       'price': [7, 5, 7]}
dic2 = {'item': ['ring0', 'ring1', 'ring0'],
       'quantity': [3, 3, 10]}
df1 = pd.DataFrame(dic)
print(df1)

df2 = pd.DataFrame(dic2)
print(df2)
print("after merge")
df3 = df1.merge(df2, how='outer', on='item')
print(df3)

    item    type  price
0  ring0    Gold      7
1  ring1  Silver      5
2  ring0    Gold      7


    item  quantity
0  ring0         3
1  ring1         3
2  ring0        10


after merge
    item    type  price  quantity
0  ring0    Gold      7         3
1  ring0    Gold      7        10
2  ring0    Gold      7         3
3  ring0    Gold      7        10
4  ring1  Silver      5         3

12. 데이터를 크기에 따라 정렬

import pandas as pd
countries = pd.read_csv("D:\study\data\csv\countries.csv", index_col = 0)
countries.sort_values('population', inplace=True)  # 인구수를 정렬하여 보여줄 때
print(countries)

   country      area     capital  population
KR   Korea     98480       Seoul    51780579
JP   Japan    377835       Tokyo   125960000
RU  Russia  17100000      Moscow   146748600
US     USA   9629091  Washington   331002825
CN   China   9596960     Beijing  1439323688

정렬기준이 두개 이상일때

import pandas as pd
countries = pd.read_csv("D:\study\data\csv\countries.csv", index_col=0)
countries.sort_values(['population', 'area'], ascending=False, inplace=True)
# 정렬의 기준이 두개 이상일 때
print(countries)

   country      area     capital  population
CN   China   9596960     Beijing  1439323688
US     USA   9629091  Washington   331002825
RU  Russia  17100000      Moscow   146748600
JP   Japan    377835       Tokyo   125960000
KR   Korea     98480       Seoul    51780579

'Python' 카테고리의 다른 글

Python: sqlite3 라이브러리 정리 (0)	2021.06.08
python: Pillow 라이브러리 정리 (0)	2021.06.07
Python: BeautifulSoup 라이브러리 정리(find, find_all, 태그, 클래스, id, 속성) (0)	2021.06.02
Python: matplot 라이브러리 정리 (선, 막대, 산포도, 파이, 히스토그램) (0)	2021.05.31
Python: numpy 라이브러리 정리(indexing, slicing, 정규분포 난수) (0)	2021.05.30

ABOUT ME

Memo Memo Memo Memo

'Python' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바