2018-03-31

pandas~DataFrame~編

前書き

前回は、pandasライブラリのSeriesについて説明しました。今回は、DataFrameについて説明します。

データフレーム(DataFrame)

データフレームは、テーブル形式のデータ構造となります。イメージとしては、エクセルのように行と列の両方にインデックスを持つ2次元配列のようなオブジェクトです。

データフレームの作成

データフレームの作成には様々な方法があります。基本的な作成方法は、リスト型を値としてもつディクショナリを使います。この際、ディクショナリの各要素のリストは全て同じ長さでなければいけません。

# sample.py
import pandas as pd

data = {'Language': ['Python', 'C++', 'Java',], # 文字列
        'Year': [1991, 1983, 1995], # 数値
        'Static': [False, True, True]} # 真偽値
df = pd.DataFrame(data)

print(df)

# 実行結果
  Language  Static  year
0   Python   False  1991
1      C++    True  1983
2     Java    True  1995

実行結果は上記のようになります。ディクショナリからデータフレームが作成できたことがわかると思います。また、

データフレームの作成~インデックスを添えて~

データフレームもシリーズと同じようにインデックスを指定し、作成することができます。

import pandas as pd

data = {'Language': ['Python', 'C++', 'Java',],
        'Year': [1991, 1983, 1995],
        'Static': [False, True, True]}
df = pd.DataFrame(data, index=['one', 'two', 'three']) # インデックスを指定

print(df)

実行結果はほとんど同じなので省略します。

データフレームの作成~列の順番の指定~

データフレームは列の順番を指定することができます。

import pandas as pd

data = {'Language': ['Python', 'C++', 'Java',],
        'Year': [1991, 1983, 1995],
        'Static': [False, True, True]}
df = pd.DataFrame(data, index=['one', 'two', 'three'], columns=['Language', 'Year', 'Static']) # columnsで列の順番を指定

print(df)

データの抽出

	行列の指定方法
loc	行ラベル・列ラベル
iloc	行番号・列番号

import pandas as pd

data = {'Language': ['Python', 'C++', 'Java',],
        'Year': [1991, 1983, 1995],
        'Static': [False, True, True]}
df = pd.DataFrame(data, index=['one', 'two', 'three'], columns=['Language', 'Year', 'Static'])

print(df.loc[['two']]) # 'two'行を抽出
print(df.loc[:, 'Year']) # 'Year'列を抽出
print(df.loc['one', 'Language']) # 'one'行、'Language'列を抽出

print(df.iloc[:,1]) # 1列目を抽出
print(df.iloc[[1]]) # 1行目を抽出 
print(df.iloc[1, 2]) # 1行2列目抽出

# print(df.loc[['two']])
    Language  Year  Static
two      C++  1983    True
# print(df.loc[:, 'Year']) 
one      1991
two      1983
three    1995
Name: Year, dtype: int64
# print(df.loc['one', 'Language'])
Python
# print(df.iloc[:,1])
one      1991
two      1983
three    1995
# print(df.iloc[[1]])
Name: Year, dtype: int64
    Language  Year  Static
two      C++  1983    True
# print(df.iloc[1, 2])
True

実行結果は上記のようになります。適切に行や列を抽出できていることがわかります。

補足
- 上記の例で"[[]]"と"[]"で要素を指定しているものがあります。これらによる違いは抽出されるデータ型です。"[[]]"を使った場合はDataFrame型、"[]"を使った場合はSeries型となります。

import pandas as pd

data = {'Language': ['Python', 'C++', 'Java',],
        'Year': [1991, 1983, 1995],
        'Static': [False, True, True]}
df = pd.DataFrame(data, index=['one', 'two', 'three'], columns=['Language', 'Year', 'Static'])

print(type(df.iloc[[1]]))
print(type(df.iloc[1]))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

実行結果を見るとデータ型が違うことがわかります。型を変換せずに使いたい時は、"[]"を使いましょう。

後書き

お疲れ様でした。データ型についての説明は終わりです。次は、pandasのより詳しい使い方について説明します。
そういえば、明日から4月です。心機一転していきたいですね。

2018-03-26

pandas~Series~編

前書き

今回から、データ分析でよく用いられるpandasライブラリを紹介していきます。

インストール

まずは、pip install pandasでライブラリをインストールしましょう。
すでにインストールされている方は必要ありません。

pandasとは

pandasとは､行と列の形式で保存されたデータを効率的に処理する機能を有するライブラリです｡行と列の形式で保存されたデータとはエクセルで用いられるcsvファイルやデータベースなどがあります。

pandasのデータ構造

pandasはシリーズ(Series)とデータフレーム(DataFrame)という2つのデータ構造を提供しています。これらのデータ構造に慣れることがpandasを使いこなすことに繋がります。今回はシリーズ就いて説明します。

シリーズ(Series)

シリーズは1次元配列のようなオブジェクトです。シリーズが配列と違う点は、データ配列に関連づけられたインデックスというデータラベル配列を含むという点です。

シリーズの作成

シリーズはlistやarrayといった１次元データ配列を引数とし、構成されます。サンプルを見ましょう。

# sample.py
import pandas as pd # 慣例
import numpy as np

obj = pd.Series([1, 2, 3, 5, 8]) # リストをシリーズに変換

print("object:", obj)
print("values:", obj.values) # objの要素を返す
print("index:", obj.index) # objのインデックスを返す
print()

obj = pd.Series(np.array([2, 4, 6, 8, 10])) # arrayをシリーズに変換

print("object:", obj)

# 実行結果
object: 
0    1
1    2
2    3
3    5
4    8
dtype: int64
values: [1 2 3 5 8]
index: RangeIndex(start=0, stop=5, step=1)

object:
0     2
1     4
2     6
3     8
4    10
dtype: int64

実行結果は上記のようになります。1次元のデータ配列からシリーズオブジェクトを作成できました。また、シリーズのvalues属性とindex属性を利用することで、データ配列とそのインデックスを取得できることもわかると思います。dtype: int64という記述は要素のタイプを表しています。

シリーズの作成~インデックスを添えて~

今度はインデックスを指定しながらシリーズを作成します。

# sample.py
import pandas as pd

obj = pd.Series([1, 2, 3, 5], index=['a', 'b', 'c', 'd'])

print("object:", obj)
print("index:", obj.index)

object: 
a    1
b    2
c    3
d    5
dtype: int64
index: Index(['a', 'b', 'c', 'd'], dtype='object') #

実行結果を見てわかるように、インデックを指定することができました。また、obj.indexによって得られるインデックスの表示も変わっていることに注意しましょう。
また、ディクショナリ形式のデータを引数にすることもできます。この場合、ディクショナリのキーがインデックスになります。

import pandas as pd

dict = {'a': 10, 'b': 20, 'c': 30}
obj = pd.Series(dict)

print("object:", obj)
print("index:", obj.index)

# 実行結果
object:
a    10
b    20
c    30
dtype: int64
index: Index(['a', 'b', 'c'], dtype='object')

シリーズに対する操作

シリーズはNumpyの配列のように操作することができます。

# インタプリタで実行してます
>>> import pandas as pd
>>> values = [2, 6, 3, 7, 4]
>>> index = ['a', 'd', 'b', 'e', 'b']
>>> obj = pd.Series(values, index) 
>>> obj
a    2
d    6
b    3
e    7
b    4
dtype: int64
>>> obj[0] # オフセットの指定
2
>>> obj['a'] # インデックスの指定
2
>>> obj * 2 # 掛け算の適用(ブロードキャスト)
a     4
d    12
b     6
e    14
b     8
dtype: int64
>>> obj > 5 # フィルタリング
a    False
d     True
b    False
e     True
b    False
dtype: bool
>>> 'a' in obj # インデックの中に含まれているか
True

実行結果は上記のようになります。今までやってきたリストや配列、辞書に対する操作と似ていますね。この他にも色々な操作があるので試してください。

後書き

今回はpandasのSeriesについて説明しました。次回はDataFrameについて説明します。

2018-03-26

ファイルの操作[2]

前書き

今回は、テキストファイルから文字列の読み込みの説明をします。

read()によるテキストファイルの読み出し

read()は1度にファイルの全体を読み出すことができます。サンプルを見ましょう。

# read.py
f = open('theZenOfPython', 'rt') # 前回作成したファイルを開く
the_zen_of_python = f.read() # 読み出す
f.close() # ファイルを閉じる

print(len(the_zen_of_python))

# 実行結果 
857

ファイルの読み出しができていることがわかると思います。
read()には、引数に読み出したいデータ量を与えることができす。

# read.py
f = open('theZenOfPython', 'rt')
the_zen_of_python = f.read(100)
f.close()

print(len(the_zen_of_python))

# 実行結果
100

readline()によるテキストファイルの読み出し

readline()はファイルを1行ずつ読み出すことができます。サンプルを見ましょう。

f = open('theZenOfPython', 'rt')
the_zen_of_python = f.readline()
f.close()

print(len(the_zen_of_python))
print(the_zen_of_python, end="")

# 実行結果
33
The Zen of Python, by Tim Peters

実行結果は上記のようになり、1行目が読み出されていることがわかります。

readlines()によるテキストファイルの読み出し

readlines()は1度に1行ずつ読み出して、それらをリストとして保存します。サンプルを見ましょう。

f = open('theZenOfPython', 'rt')
the_zen_of_python = f.readlines()
f.close()

print(len(the_zen_of_python))
print(the_zen_of_python, end="")

# 実行結果
21
['The Zen of Python, by Tim Peters\n', '\n', 'Beautiful is better than ugly.\n', 'Explicit is better than implicit.\n', 'Simple is better than complex.\n', 'Complex is better than complicated.\n', 'Flat is better than nested.\n', 'Sparse is better than dense.\n', 'Readability counts.\n', "Special cases aren't special enough to break the rules.\n", 'Although practicality beats purity.\n', 'Errors should never pass silently.\n', 'Unless explicitly silenced.\n', 'In the face of ambiguity, refuse the temptation to guess.\n', 'There should be one-- and preferably only one --obvious way to do it.\n', "Although that way may not be obvious at first unless you're Dutch.\n", 'Now is better than never.\n', 'Although never is often better than *right* now.\n', "If the implementation is hard to explain, it's a bad idea.\n", 'If the implementation is easy to explain, it may be a good idea.\n', "Namespaces are one honking great idea -- let's do more of those!"]

実行結果は上記のようになります。結果からわかるように、読み出した行をリストに追加しています。
readlines()はreadline()を次のように実行したものと同値となります。

f = open('theZenOfPython', 'rt')
the_zen_of_python = []
for line in f: # ファイルから1行ずつ読み出す
    the_zen_of_python.append(line) # リストに追加する
f.close()

print(len(the_zen_of_python))

# 実行結果
21

後書き

お疲れ様でした。簡単にですがテキストファイルの取り扱いを説明しました。その他のファイルの取り扱い方法は各自調べてみてください。

2018-03-24

ファイルの操作[1]

前書き

今回からファイルの操作について説明します。

ファイルを開く

ファイルを開くにはopen関数を仕様します。サンプルを見ましょう。open()の使い方は以下のようになります。

fileobj = open(filename, mode)

fileobj : open()が返すファイルオブジェクトです。
filename : 開きたいファイル名(文字列)を指定します。
mode : ファイルのタイプやファイルをどのように操作するかを知らせる文字列を指定します。

modeを指定する文字列

ファイルの操作

記号	意味
r	読み出し
w	書き込み(ファイルが存在しない場合、新しいファイルを作成、存在する場合は上書き)
x	書き込み(ファイルが存在する場合はエラーとなる)
a	追記

ファイルの種類

記号	意味
t(またはなし)	テキストファイル
b	バイナリファイル

指定例
以下に使用例をあげておきます。

fileobj = open('sample.txt', 'wt') # テキストで開き、書き込み 
fileobj = open('sample.txt', 'a')  # テキストで開き、追記
fileobj = open('sample.txt', 'rb') # バイナリで開き、読み出し

write()による書き込み

テキストファイルに書き込むにはwrite()を使います。サンプルを見ましょう。

#write.py
the_zen_of_python = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
'''

f = open('theZenOfPython', 'wt') # ファイルを開く
f.write(the_zen_of_python) # ファイルに書き込む
f.close() # ファイルを閉じる

上記のコードを実行すると、実行ファイル(write.py)と同じディレクトリにファイル(theZenOfPython)が作成されます。また、ファイルを開き処理が終わったら、必ず最後にclose()で閉じましょう。

print()による書き込み

先ほどの例ではwrite()を使いましたが、print()でも書き込むことができます。

# print.py
the_zen_of_python = ''' ... ''' # 先程と同じ文字列

f = open('theZenOfPython', 'wt') # ファイル開く
print(the_zen_of_python, file=f) # print()の出力先にfを指定する
f.close() # ファイルを閉じる

上記のコードを実行すると、先程と同じようになります。print()は出力先を指定することができ、そこにファイル(f)を指定するとファイルに書き込めます。

write()とprint()の違い

デフォルトのprint()は、ここの引数の後にスペースを追加し、全体の末尾に改行を追加します。そのため、write()とprint()に違いが生じます。print()をwrite()と同じように仕様するためには、sep引数とend引数を指定してあげる必要があります。

print()の引数

引数	意味
sep	セパレータ。デフォルトではスペース(' ')。
end	末尾の文字列。デフォルトでは改行('\n')。
file	出力先。デフォルトでは標準出力(stdout)。

後書き

今回はファイルの開き方とファイルへの書き込みを説明しました。次回は、ファイルからの読み出しを説明します。

2018-03-22

書籍の紹介

前書き

今回は、Pythonの入門書を紹介したいと思います。

入門Python3

言わずと知れたO'REILLY出版の本です。基本的なpythonの文法の他に、ウェブ、並行処理、ネットワークの簡単な導入方法が書いてあります。pythonで何かしたいけど何ができるかわからないと思っている人にはとてもいいと思います。情報量が多く、説明が簡素な部分があるので、こなしていくのは大変ですが網羅的な基礎知識の獲得には繋がります。最初から順番にこなしていけば、最後までこなすことは難しくありません。

みんなのPython

この本も初心者の方に人気の高い本となっております。こちらの本は環境構築から乗っていて、初心者の方が難しいと思いやすいところを丁寧に解説しています。上記の入門Python3と比べて、ページ数も少ないのであっという間に読むことができると思います。とにかく基礎を早く身につけたい人にはこちらがオススメです。

後書き

これから少しずつPythonや機械学習関連の書籍を紹介していこうと思います。

初心者のためのpython入門

自分がつまづいたところやつまづきやすいところを中心に書いていきます。また、役に立つライブラリの紹介などをしていきます。