Introduction¶

1. Introduction in Introduction¶

What is the common base of knowledge for our graduate school?¶

That's the problem.

この授業を組み立てるに当たっての悩みであり解決はしていない。

The subject of the former lecture starting in 2016 was "Numerical Simulation Methods", where I intended to provide some example of numerical models to describe widely-meaning physical systems and the methods to solve the models as well as to find common features among systems.

Through considering what the common (numerical) skills in the graduate course, I have decided to start this "practical data science". This is also reflected the current booming of data science.

Historical overview of "data science as a tool"¶

The term of "Data Science" has been attracting lots of attention in the last decade.

According to Wikipedia:

an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

意訳すれば

科学的手法、アルゴリズム、(具体的な)システムを使って様々なデータから「知識」を抽出する学際的な分野
関連ワード：データマイニング、機械学習、ビッグデータ

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data.

データを使って実際の現象を理解、分析

The field encompasses preparing data for analysis, formulating data science problems, analyzing data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains.

分析のためのデータ準備（整理）、データサイエンス的な課題の定式化、分析、データに基づいた解決法の開発、意思決定などに資する

Many statisticians, including Nate Silver, have argued that data science is not a new field, but rather another name for statistics.

多くの統計学(statistics)者は、データサイエンスは新しい分野ではなく、統計学の別名であると主張している。

とはいえ、2000年代以降、"data science"をタイトルに使った学術誌、学術誌の新セクションがいくつか立ち上がった。機械学習を一つの技術として、様々な分野で用いられるようになっている。

The following research areas have conventionally been using data science methods as essential and necessary tools (They might be termed statistics rather than data science. )

Medical statistics: It is said that at least one statistician need to join the coauthors in order to submit a medical paper in most cases.
Statistics for Decision Making in Business, Politics, Administration
Social Research, Field Work
Meteorology, Ecology and other kind of Environmental and Earth Sciences: To analyze many data obtained through social research (社会調査) or field work, skills of data analysis are required such as sociology, ecology, biology and etc.

Two ways to analyse complex phenomena¶

Physics, Chemistry, astronomy and related systems

Fundamental idea (assumption) (天文学、物理学などの近代科学の"信念"): Various phenomena in nature should be understood on the basis of a small number of principles, formula or laws, which are mathematically describable.　

Data Analysis¶

There are many objects not necessarily governed by simple principles (Causality logic is not always clear but often left as a black box.), especially in biological, medical and social systems.

The development of computers has facilitated the analysis of such complex systems. We can say that the mathematical and computational framework is Data Sciences.

Recently, the methods of data sciences are applied (exported) to "precise sciences" such as physics, chemstry, etc.

(The number of papers on the analysis of experimental data using machine learning methods is increasing.)

2. Some aspects of data analysis: a simple example¶

Given the following data, how should you interpret? It's important to set the right context.

simple data

If we have a theoretical conjecture for the object that $y$ is linearly dependent on $x$, you may examine fitting it to $y=ax+b$ by least-square method (linear regression)
In other cases, some other fitting functions would be used, such as polynomials, trig functions, etc.

For some intrinsically complex objects such as an economic trend or its prediction, finding the fitting function is not the issue but the problem is to predict the value of $y$ for a given $x$.

Powerful prediction (regression) methods such as "Support Vector Regression" (SVR), which uses "kernel Method". It doesn't assum a fitting function function $y=f(x)$, so we can not get an explicit resultant function to compare the theory (simple principle). That is focusing on getting an excellent prediction. It is also noted that the result includes $x$-dependent probability distribution of $y$.

Thus, which method you use is crucial for the conclusion you derive. Therefore, this class is aimed to understand the characteristics of popular computer-aided statistical methods to analyze your experimental or observational data. Due to the term of the lecture being short, it will be focusing only on regression methods. (Another main class of methods is "classification", which is not covered in this class.)

another example¶

The temperature data of Kofu and its curve fittings. (An example in the pre-lecture: plotted by plotly)

temperature_kofu_linear_fitting

linear fitting (plotted by seaborn)

cf. 気象庁のページ https://www.data.jma.go.jp/cpdinfo/temp/an_jpn.html

non-linear fitting (up to 3rd power)
Support Vector Regression (projecting a curve without assuming a power of function)
Mixed Linear Model (Assuming the data are described by two kind of curves)

Data before 1920 were truncated.

3. Overview of this class¶

There are students from all courses every year, so their skills and academic history would be spread over a wide range. Therefore,

Simple examples and exercises are provided so that each students can learn based on their own present skills.
- A excellent high-level textbook referred much in the lecture is 「パターン認識と機械学習」(Pattern Recognition and Machine Learning), of which the original version can be downloaded from https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/ (abbreviated as PRML)
Checking the behavior of each methods and exercising on Jupyter Notebook
The content is focusing on "regression" methods.
The class covers
- polinomial fitting by least-square method and over-fitting problem
- multiple regression and feature selection
- Bayes Approach
- (kernel method)
- Support Vector Machine and its application to regression
- Neural Network and Decision Tree and their application to regression
- Validation Methods
- Other topics (hierarchal and mixed models, etc)

受講生数の推移¶

Number of Students

多様なスキル、履修履歴の学生がいるので、

話は短く
できるだけ多くのドキュメントとプログラム例を用意
提示したプログラムを実行しながら理解を深める
- 各自の経験・スキルに応じて、プログラムの改変、独自のプログラム作成、別のデータへの適用などを行ってみる
個々の質問に答える
最終的に、試したこと(Jupyter Notebook形式あるいはPDF形式）で提出
個々の質問は、当面、zoomのチャットで。(moodleのチャットの方が便利ならそちらに移行するが。）
- 小さな質問でも遠慮なく。
- 質問への回答ができる人がいたら、回答も遠慮なく。

I have described the text in both Japanese and English so far. At this point, I became aware that most of Web browsers have the function of translation. Probably you can understand the following Japanese texts also in English by using that function.

In the class materials given in the following lecture, I create the text in Japanese except for the comments in python codes, which will be written in English. So, please learn the contents using the translation tools if necessary.

扱わないこと¶

教師なし学習
ディープラーニング
画像、音声認識（分類問題）
自然言語解析・テキストマイニング

もっぱら数値データの分析（回帰分析）への機械学習の適用を扱う。