3 Erste Schritte in ›Data Science‹ [43:12]

In den vergangenen Jahren werden immer mehr Daten publiziert, deren Menge zukünftig noch schneller anwachsen wird. Es ist nicht leicht, all den Publikationen dieser erfreulich großen Mengen immer neuer Informationen zu folgen. Jupyter Notebooks wurden gerade auch dafür entwickelt, auf große Mengen an Daten zuzugreifen, diese zu visualisieren, analysieren und auf Problemstellungen anzuwenden. Natürlich ist das nur möglich, wenn die Daten in entsprechenden Formaten vorliegen. Oftmals ist daher im ersten Schritt ein ›Daten clean-up‹ notwendig. Seit 2020 gibt es nun endlich mehrere Initiativen, darunter so große wie die Nationale Forschungsdatenbank Initiative (NFDI) der DFG, GWK und anderer, die versuchen gemeinsam mit internationalen Partnern Daten besser verfügbar zu machen. Schon bestehende Datenbanken wie GeoROC, EarthChem oder auch MetBase erlauben schon seit langer Zeit solchen Zugang, und damit Data Science in z.B. der Geo- und Kosmochemie. In dieser Einheit steigen wir in das neue Feld der Data Science in der Mineralogie ein.

3.1 Was Datensätze sind und wie diese aufgebaut sein müssen [06:09]

Zunächst müssen wir uns etwas anschauen, wie Datensätze und eine Datenbank überhaupt aufgebaut sind. Wir sind zwar sehr damit vertraut, Tabellen zu erstellen, als Datenbank taugen die meisten jedoch selten etwas. Datenbanken, bzw. einzelne Datensätze erstellt man mit einigen wenigen, einfachen Regeln, die wir nun kennen lernen wollen – danach sehen auch die meisten Tabellen besser aus. Außerdem lernen wir noch ein wenig hilfreiches, und Datenbank bezogenen Vokabular kennen.

3.2 Basic- Introduction to working with data using pandas [11:24]

pandas: shape(), info, columns, set_option(›display.max_rows’, 20), python: type() set_option(‘display.max_rows’, 20), drop_duplicates(), df[›category name‹], >, <, &
Available Notebooks: Lecture, Exercise, Solution

Python selbst stellt nur einen sehr begrenzten Umfang an Befehlen zur Daten-Manipulation zur verfügung. Allerdings gibt es sehr viele zusätzliche Befehlspakete – genannt: libraries oder packages – für alles mögliche, auch eines zur Daten-Manipulation. Diese heißt (warum auch immer) ›pandas‹, und dessen Grundzüge lernen wir nun kennen.

Video nb

pip install mag4

Requirement already satisfied: mag4 in /Users/dominik/anaconda3/lib/python3.11/site-packages (0.0.214)
Note: you may need to restart the kernel to use updated packages.

Import pandas

import mag4 as mg
import pandas as pd

Display the data

df = mg.get_data('Mineral Data Example Dataset')

pd.set_option('display.max_rows', df.shape[0])
pd.set_option('display.max_columns', df.shape[1])

print(df.shape)

print(df.columns)

(25, 28)
Index(['mineral', 'oxygens', 'sample', 'no.', 'SiO2', 'TiO2', 'Al2O3', 'Cr2O3',
       'FeO', 'MnO', 'NiO', 'MgO', 'CaO', 'Na2O', 'K2O', 'total oxides', 'Si',
       'Ti', 'Al', 'Cr', 'Fe', 'Mn', 'Ni', 'Mg', 'Ca', 'Na', 'K',
       'total cations'],
      dtype='object')

Select the data of a single category

df['SiO2']

0     42.51
1     42.16
2     52.33
3     51.54
4     57.09
5     59.19
6     58.33
7     53.87
8     47.10
9     47.72
10    42.72
11    42.83
12    40.43
13    42.46
14    45.09
15    46.72
16    46.36
17    47.78
18     0.59
19    35.13
20    35.52
21     3.24
22     0.16
23    51.72
24    32.79
Name: SiO2, dtype: float64

df['mineral'].drop_duplicates()

0          ol
2         cpx
4         opx
8          fs
16    fsp/mes
18         sp
19      ol/sp
Name: mineral, dtype: object

Using a filter

filter = (df['MgO'] > 50) & (df['MgO'] > 55)

df[filter]

	mineral	oxygens	sample	no.	SiO2	TiO2	Al2O3	Cr2O3	FeO	MnO	NiO	MgO	CaO	total oxides	Si	Ti	Al	Cr	Fe	Mn	Mg	Ca	total cations
0	ol	4	chd#2	1	42.51	0.04	0.07	0.04	0.18	0.00	0.00	55.26	1.58	99.68	1.003	0.001	0.002	0.001	0.004	0.000	1.944	0.040	2.995
11	ol	4	chd#3	17	42.83	0.03	0.11	0.16	0.45	0.00	0.01	56.10	0.46	100.15	1.004	0.001	0.003	0.003	0.009	0.000	1.961	0.012	2.992
13	ol	4	chd#3	21	42.46	0.05	0.27	0.10	0.32	0.03	0.01	55.26	0.57	99.07	1.006	0.001	0.008	0.002	0.006	0.001	1.951	0.014	2.989

Pandas is converting the data into a pandas format. What is this pandas data format called? Check your answer with the function type() with df being the variable in this function.

type(df)

pandas.core.frame.DataFrame

What are the numbers in the first column? Is there soemthing special about the first row?

This is the Pandas DataFrame dataset (=row) numbering, starting with the mathematical 0.

Yes, the first row contains the column category names, sometimes also called attributes, not to be confused with attributes as part of a code.

What is the difference between & (and) and | (or)? Start with an example and explain this.

&

fil = (df['SiO2'] > 50) & (df['CaO'] < 10)
df[fil]

	mineral	oxygens	sample	no.	SiO2	TiO2	Al2O3	Cr2O3	FeO	MnO	NiO	MgO	CaO	Na2O	total oxides	Si	Ti	Al	Cr	Fe	Mn	Ni	Mg	Ca	Na	total cations
4	opx	6	chd#2	8	57.09	0.22	0.76	0.48	1.92	0.08	0.03	36.87	0.67	0.01	98.13	1.971	0.006	0.031	0.013	0.055	0.002	0.001	1.897	0.025	0.001	4.002
5	opx	6	chd#2	10	59.19	0.19	0.63	0.51	0.67	0.20	0.03	38.06	0.67	0.01	100.16	1.988	0.005	0.025	0.014	0.019	0.006	0.001	1.906	0.024	0.001	3.988
6	cpx	6	chd#2	11	58.33	0.30	0.98	0.53	0.33	0.14	0.00	34.70	5.02	0.01	100.34	1.977	0.008	0.039	0.014	0.009	0.004	0.000	1.754	0.182	0.001	3.989

Only those datasets are shown for which the SiO2 values are larger than 50 wt% and the CaO values are larger than 10 wt%.

|

fil = (df['SiO2'] > 50) | (df['CaO'] < 10)
df[fil]

	mineral	oxygens	sample	no.	SiO2	TiO2	Al2O3	Cr2O3	FeO	MnO	NiO	MgO	CaO	Na2O	K2O	total oxides	Si	Ti	Al	Cr	Fe	Mn	Ni	Mg	Ca	Na	K	total cations
0	ol	4	chd#2	1	42.51	0.04	0.07	0.04	0.18	0.00	0.00	55.26	1.58	0.00	0.00	99.68	1.003	0.001	0.002	0.001	0.004	0.000	0.000	1.944	0.040	0.000	0.000	2.995
1	ol	4	chd#2	9	42.16	0.10	0.06	0.20	3.14	0.07	0.03	53.81	0.24	0.00	0.00	99.81	1.004	0.002	0.002	0.004	0.063	0.001	0.001	1.910	0.006	0.000	0.000	2.992
2	cpx	6	chd#2	4	52.33	2.00	4.09	0.81	0.32	0.10	0.00	19.44	21.00	0.01	0.01	100.11	1.872	0.054	0.172	0.023	0.010	0.003	0.000	1.037	0.805	0.001	0.000	3.977
3	cpx	6	chd#2	6	51.54	1.74	5.72	0.56	0.67	0.09	0.00	19.12	20.75	0.04	0.00	100.23	1.842	0.047	0.241	0.016	0.020	0.003	0.000	1.019	0.795	0.003	0.000	3.984
4	opx	6	chd#2	8	57.09	0.22	0.76	0.48	1.92	0.08	0.03	36.87	0.67	0.01	0.00	98.13	1.971	0.006	0.031	0.013	0.055	0.002	0.001	1.897	0.025	0.001	0.000	4.002
5	opx	6	chd#2	10	59.19	0.19	0.63	0.51	0.67	0.20	0.03	38.06	0.67	0.01	0.00	100.16	1.988	0.005	0.025	0.014	0.019	0.006	0.001	1.906	0.024	0.001	0.000	3.988
6	cpx	6	chd#2	11	58.33	0.30	0.98	0.53	0.33	0.14	0.00	34.70	5.02	0.01	0.00	100.34	1.977	0.008	0.039	0.014	0.009	0.004	0.000	1.754	0.182	0.001	0.000	3.989
7	cpx	6	chd#2	12	53.87	1.10	3.09	0.51	0.21	0.08	0.00	22.00	19.24	0.00	0.00	100.10	1.911	0.029	0.129	0.014	0.006	0.002	0.000	1.164	0.731	0.000	0.000	3.988
10	ol	4	chd#3	15	42.72	0.05	1.08	0.11	0.37	0.01	0.00	54.32	0.68	0.53	0.00	99.87	1.005	0.001	0.030	0.002	0.007	0.000	0.000	1.904	0.017	0.024	0.000	2.991
11	ol	4	chd#3	17	42.83	0.03	0.11	0.16	0.45	0.00	0.01	56.10	0.46	0.00	0.00	100.15	1.004	0.001	0.003	0.003	0.009	0.000	0.000	1.961	0.012	0.000	0.000	2.992
12	ol	4	chd#3	18	40.43	0.02	0.03	0.11	11.16	0.00	0.01	46.57	0.33	0.01	0.00	98.67	1.009	0.000	0.001	0.002	0.233	0.000	0.000	1.733	0.009	0.000	0.000	2.989
13	ol	4	chd#3	21	42.46	0.05	0.27	0.10	0.32	0.03	0.01	55.26	0.57	0.00	0.00	99.07	1.006	0.001	0.008	0.002	0.006	0.001	0.000	1.951	0.014	0.000	0.000	2.989
18	sp	4	#24 - CAI	74	0.59	0.27	67.71	0.63	5.54	0.03	0.00	23.43	0.14	0.07	0.00	98.41	0.015	0.005	1.972	0.012	0.115	0.001	0.000	0.863	0.004	0.003	0.000	2.990
19	ol/sp	4	#24 - CAI	76	35.13	0.58	14.39	0.30	8.06	0.05	0.00	33.12	4.09	1.04	0.13	96.89	0.886	0.011	0.428	0.006	0.170	0.001	0.000	1.246	0.111	0.051	0.004	2.913
21	sp	4	#24 - CAI	84	3.24	1.36	65.06	0.08	3.05	0.01	0.01	27.71	0.39	0.02	0.00	100.93	0.077	0.024	1.826	0.002	0.061	0.000	0.000	0.984	0.010	0.001	0.000	2.985
22	sp	4	#24 - CAI	85	0.16	0.51	69.36	0.26	2.41	0.00	0.00	25.71	0.08	0.01	0.00	98.50	0.004	0.009	1.988	0.005	0.049	0.000	0.000	0.932	0.002	0.000	0.000	2.990
23	cpx	6	#24 - CAI	75	51.72	0.20	3.27	0.05	0.27	0.02	0.00	17.69	24.25	0.43	0.05	97.95	1.908	0.006	0.142	0.001	0.008	0.001	0.000	0.973	0.959	0.031	0.002	4.031

All datasets with SiO2 values larger than 50 wt% or with CaO values are larger than 10 wt% are shown.
Alternatively stated: All datasets with SiO2 values larger than 50 wt% are shown and all datasets with CaO values larger than 10 wt% are shown.

How can we get only specific datasets from a dataframe?

✓With a filter

✗With the .sel() method

✗With the sel() function

Practise I

Start with importing the pandas library.

Then we want to get some information about the dataset: First we need to read in some data. Use the ‘Mineral Data Example Dataset’ exercise file.
Store the data in the variable df.
Print some information about this file, such as its number of rows, columns.
In addition, use the .info() method to display the data types of the categories. What data type, for example, is MgO?
Then use the set_option('display.max_xxx') with xxx being rows/columns attribute to automatically display all rows and columns of the data.
Finally, display the category names.

Solution

import mag4 as mg
import pandas as pd

df = mg.get_data('Mineral Data Example Dataset')

print(df.shape)
print(df.info())

pd.set_option('display.max_rows', df.shape[0])
pd.set_option('display.max_columns', df.shape[1])

print(df.columns)

(25, 28)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 28 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   mineral        25 non-null     object 
 1   oxygens        25 non-null     int64  
 2   sample         25 non-null     object 
 3   no.            25 non-null     int64  
 4   SiO2           25 non-null     float64
 5   TiO2           25 non-null     float64
 6   Al2O3          25 non-null     float64
 7   Cr2O3          25 non-null     float64
 8   FeO            25 non-null     float64
 9   MnO            25 non-null     float64
 10  NiO            25 non-null     float64
 11  MgO            25 non-null     float64
 12  CaO            25 non-null     float64
 13  Na2O           25 non-null     float64
 14  K2O            25 non-null     float64
 15  total oxides   25 non-null     float64
 16  Si             25 non-null     float64
 17  Ti             25 non-null     float64
 18  Al             25 non-null     float64
 19  Cr             25 non-null     float64
 20  Fe             25 non-null     float64
 21  Mn             25 non-null     float64
 22  Ni             25 non-null     float64
 23  Mg             25 non-null     float64
 24  Ca             25 non-null     float64
 25  Na             25 non-null     float64
 26  K              25 non-null     float64
 27  total cations  25 non-null     float64
dtypes: float64(24), int64(2), object(2)
memory usage: 5.6+ KB
None
Index(['mineral', 'oxygens', 'sample', 'no.', 'SiO2', 'TiO2', 'Al2O3', 'Cr2O3',
       'FeO', 'MnO', 'NiO', 'MgO', 'CaO', 'Na2O', 'K2O', 'total oxides', 'Si',
       'Ti', 'Al', 'Cr', 'Fe', 'Mn', 'Ni', 'Mg', 'Ca', 'Na', 'K',
       'total cations'],
      dtype='object')

Practise II

Extracting data
Extract a subset of from the df data imported above, e.g., all MnO values. Now extract all MnO values larger the detection limit of 0.02 wt%.
Finally, we would like to know which samples are in the dataset. Therefore display all sample names in the dataset, but each no more than once.

Solution

df['MnO']

0     0.00
1     0.07
2     0.10
3     0.09
4     0.08
5     0.20
6     0.14
7     0.08
8     0.00
9     0.03
10    0.01
11    0.00
12    0.00
13    0.03
14    0.00
15    0.00
16    0.00
17    0.00
18    0.03
19    0.05
20    0.02
21    0.01
22    0.00
23    0.02
24    0.03
Name: MnO, dtype: float64

fil = df['MnO'] > 0.02

df[fil]

	mineral	oxygens	sample	no.	SiO2	TiO2	Al2O3	Cr2O3	FeO	MnO	NiO	MgO	CaO	Na2O	K2O	total oxides	Si	Ti	Al	Cr	Fe	Mn	Ni	Mg	Ca	Na	K	total cations
1	ol	4	chd#2	9	42.16	0.10	0.06	0.20	3.14	0.07	0.03	53.81	0.24	0.00	0.00	99.81	1.004	0.002	0.002	0.004	0.063	0.001	0.001	1.910	0.006	0.000	0.000	2.992
2	cpx	6	chd#2	4	52.33	2.00	4.09	0.81	0.32	0.10	0.00	19.44	21.00	0.01	0.01	100.11	1.872	0.054	0.172	0.023	0.010	0.003	0.000	1.037	0.805	0.001	0.000	3.977
3	cpx	6	chd#2	6	51.54	1.74	5.72	0.56	0.67	0.09	0.00	19.12	20.75	0.04	0.00	100.23	1.842	0.047	0.241	0.016	0.020	0.003	0.000	1.019	0.795	0.003	0.000	3.984
4	opx	6	chd#2	8	57.09	0.22	0.76	0.48	1.92	0.08	0.03	36.87	0.67	0.01	0.00	98.13	1.971	0.006	0.031	0.013	0.055	0.002	0.001	1.897	0.025	0.001	0.000	4.002
5	opx	6	chd#2	10	59.19	0.19	0.63	0.51	0.67	0.20	0.03	38.06	0.67	0.01	0.00	100.16	1.988	0.005	0.025	0.014	0.019	0.006	0.001	1.906	0.024	0.001	0.000	3.988
6	cpx	6	chd#2	11	58.33	0.30	0.98	0.53	0.33	0.14	0.00	34.70	5.02	0.01	0.00	100.34	1.977	0.008	0.039	0.014	0.009	0.004	0.000	1.754	0.182	0.001	0.000	3.989
7	cpx	6	chd#2	12	53.87	1.10	3.09	0.51	0.21	0.08	0.00	22.00	19.24	0.00	0.00	100.10	1.911	0.029	0.129	0.014	0.006	0.002	0.000	1.164	0.731	0.000	0.000	3.988
9	fs	8	chd#2	5	47.72	0.06	31.87	0.01	0.71	0.03	0.00	0.58	16.88	2.12	0.00	99.98	2.200	0.002	1.732	0.000	0.027	0.001	0.000	0.040	0.834	0.190	0.000	5.026
13	ol	4	chd#3	21	42.46	0.05	0.27	0.10	0.32	0.03	0.01	55.26	0.57	0.00	0.00	99.07	1.006	0.001	0.008	0.002	0.006	0.001	0.000	1.951	0.014	0.000	0.000	2.989
18	sp	4	#24 - CAI	74	0.59	0.27	67.71	0.63	5.54	0.03	0.00	23.43	0.14	0.07	0.00	98.41	0.015	0.005	1.972	0.012	0.115	0.001	0.000	0.863	0.004	0.003	0.000	2.990
19	ol/sp	4	#24 - CAI	76	35.13	0.58	14.39	0.30	8.06	0.05	0.00	33.12	4.09	1.04	0.13	96.89	0.886	0.011	0.428	0.006	0.170	0.001	0.000	1.246	0.111	0.051	0.004	2.913
24	cpx	6	#24 - CAI	81	32.79	1.35	26.06	0.46	2.53	0.03	0.00	17.46	15.75	0.80	0.20	97.43	1.228	0.038	1.150	0.014	0.079	0.001	0.000	0.975	0.632	0.058	0.010	4.185

df['sample'].drop_duplicates()

0         chd#2
10        chd#3
18    #24 - CAI
Name: sample, dtype: object

3.3 Basics– Introduction to making a plot using matplotlib [10:49]

read_csv(), maptplotlib.pyplot, plt, plot(), scatter(), show(), xlabel(), legend(), xlim()
Available Notebooks: Lecture, Exercise, Solution

Eine weitere library ist die ›matplotlib‹, die praktische alle nur denkbaren Plots und deren noch so komplizierten Darstellungen ermöglicht. In diese nehmen wir nun einen ersten, kleinen Einblick.

Video nb

import mag4 as mg
import pandas as pd
import matplotlib.pyplot as plt

df_minData = mg.get_data('Mineral Data Example Dataset')
df_mydata = mg.get_data('Mineral Data Example Dataset 2')

A simple line plot

el = 'MnO'

plt.plot(df_minData[el])

plt.show()

A simple scatter plot

el1 = 'MgO'
el2 = 'SiO2'

plt.scatter(df_minData[el1], df_minData[el2])

plt.show()

Formating elements

el1 = 'MnO'
el2 = 'SiO2'

plt.scatter(df_minData[el1], df_minData[el2], label = 'Mineral Example Data')
plt.scatter(df_mydata[el1], df_mydata[el2], label = 'Mineral Example Data 2')
plt.xlabel(el1)
plt.ylabel(el2)
plt.legend(loc = 'lower right')
plt.xlim(0, 0.22)
plt.ylim(0, 60)

plt.show()

Briefly explain what a ‘dataframe’ is.

This is the name in pandas for a table with pandas specific formatting, indices, structure, etc.

Describe the difference between a line and a scatter plot.

A line plot has categories on the x-axis and a corresponding value on the y-axis. A category could simply be a number, but also chemical elements, e.g., the REE. The values on the y-axis would then be e.g., their concentrations.
A scatter plot has values on the x- and y-axis, e.g., Mg- and Si- values of a rock or mineral on the x- and y-axes.

How do you call the plotting library we use?

✗import plt

✓import matplotlib.pyplot as plt

✗import matplotlib as plt

✗import pyplot as plt

All data we want to plot from a database need to be in a single file.

✗True

✓False

Practise

Start with importing the pandas and matplotlib libraries.

Then read the example file Mineral Data Example Dataset 3, and store it a variable.
Display the categories and the data table. Use .tolist() to display the categories in a compact form.
Then plot the FeO data on the y-axis and label it with FeO.
Make the code flexibel, so that FeO can be quickly replaced by another oxide.
Then plot two oxides against each other, again flexible, and add a legend.
Finally, add data from the file Mineral Data Example Dataset to the latter plot.

Solution

import mag4 as mg
import pandas as pd
import matplotlib.pyplot as plt

df = mg.get_data('Mineral Data Example Dataset 3')

print(df.columns.tolist())

df

['No.', 'SiO2', 'TiO2', 'Al2O3', 'FeO', 'MnO', 'MgO', 'CaO', 'Na2O', 'K2O', 'P2O5', 'Cr2O3', 'NiO', 'Total', 'Comment']

	No.	SiO2	TiO2	Al2O3	FeO	MnO	MgO	CaO	Na2O	K2O	P2O5	Cr2O3	NiO	Total	Comment
0	1	57.2700	1.6275	17.9100	6.4200	0.1614	2.2800	3.46	4.4600	5.4500	0.8939	0.0026	0.0000	99.9354	GProbe Mitte Grid 010x011 Grid 020x021
1	2	57.2800	1.5939	18.0100	6.4400	0.1372	2.3100	3.49	4.4300	5.4300	0.8622	0.0000	0.0070	99.9903	GProbe Mitte Grid 010x011 Grid 020x021
2	4	57.1600	1.6335	18.0300	6.4500	0.1535	2.3000	3.48	4.4700	5.4300	0.9481	0.0109	0.0163	100.0823	GProbe Mitte Grid 010x011 Grid 020x021
3	5	57.3100	1.6235	17.9800	6.4300	0.1502	2.2700	3.48	4.3800	5.4300	0.9043	0.0000	0.0000	99.9580	GProbe Mitte Grid 010x011 Grid 020x021
4	7	57.3900	1.6314	18.0800	6.5100	0.1466	2.2900	3.51	4.4000	5.4400	0.8663	0.0161	0.0000	100.2804	GProbe Mitte Grid 010x011 Grid 020x021
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
192	287	1.3845	0.0000	0.0045	0.0010	0.0000	1.3767	49.92	0.0713	0.0000	0.0104	0.0070	0.0000	52.7754	GProbe Mitte Grid 010x011 Grid 020x021
193	289	0.7298	0.0082	0.0000	0.0333	0.0000	1.4190	49.99	0.0640	0.0000	0.0128	0.0000	0.0000	52.2571	GProbe Mitte Grid 010x011 Grid 020x021
194	290	0.6210	0.0072	0.0006	0.0060	0.0000	1.3827	50.26	0.0655	0.0000	0.0000	0.0046	0.0000	52.3476	GProbe Mitte Grid 010x011 Grid 020x021
195	292	0.5870	0.0000	0.0000	0.0072	0.0000	1.3888	50.19	0.0542	0.0026	0.0000	0.0000	0.0000	52.2299	GProbe Mitte Grid 010x011 Grid 020x021
196	293	0.6773	0.0000	0.0084	0.0000	0.0000	1.4192	50.44	0.0713	0.0000	0.0002	0.0216	0.0143	52.6523	GProbe Mitte Grid 010x011 Grid 020x021

197 rows × 15 columns

el = 'FeO'

plt.plot(df[el])
plt.ylabel(el)

plt.show()

el1 = 'MnO'
el2 = 'FeO'

plt.scatter(df[el1], df[el2], label = 'Mineral Example Dataset 3')
plt.xlabel(el1)
plt.ylabel(el2)
plt.legend(loc = 'lower right')

plt.show()

df_minData = mg.get_data('Mineral Data Example Dataset')

el1 = 'MnO'
el2 = 'FeO'

plt.scatter(df[el1], df[el2], label = 'Mineral Example Dataset 3')
plt.scatter(df_minData[el1], df_minData[el2], label = 'Mineral Data')
plt.xlabel(el1)
plt.ylabel(el2)
plt.legend(loc = 'upper left')

plt.show()

3.4 Basics- Selecting data using locations and indices [12:36]

.iloc(), .loc(), .set_index()
Available Notebooks: Lecture, Exercise, Solution

Nun wollen wir lernen, wie leicht mit Pandas jegliche gewünschten Daten aus einer Datenbank, also z.B. einer einfachen Excel-Tabelle extrahiert werden können. Das ist besonders wichtig und hilfreich, wenn die Datenbank sehr groß ist, oder wenn man immer andere Datenbanken verwendet, die aber immer anders aufgebaut sind. Anders aufgebaut meint, dass z.B. die Kategorienamen in anderer Reihenfolge stehen, oder evtl. auch etwas anders lauten. Dann müsste bei einem Code mit Pandas nur der entsprechende Kategoriename geändert werden, der Code selbst bliebe gleich.

Video nb

import mag4 as mg
import pandas as pd

df = mg.get_data("Banda Arc")  #could add: , index_col = 'xxx'

#pd.set_option('display.max_rows', df.shape[0])
#pd.set_option('display.max_columns', df.shape[1])

print(df.columns.tolist())

df

['Citations', 'Tectonic Setting', 'Location', 'Location Comment', 'Latitude (Min)', 'Latitude (Max)', 'Longitude (Min)', 'Longitude (Max)', 'Land or Sea', 'Elevation (Min)', 'Elevation (Max)', 'Sample Name', 'Rock Name', 'Age (a, Min)', 'Age (a, Max)', 'Geol', 'Age', 'Eruption Day', 'Eruption Month', 'Eruption Year', 'Rock Texture', 'Rock Type', 'Drill Depth (Min)', 'Drill Depth (Max)', 'Alteration', 'Mineral', 'Material', 'Si', 'Ti', 'B', 'Al', 'Cr', 'Fe3+', 'Fe2+', 'Fetot(2+)', 'Ca', 'Mg', 'Mn', 'Ni', 'K', 'Na', 'P', 'H2O', 'H2OP', 'H2OM', 'H2Otot', 'CO2', 'CO', 'F', 'Cl', 'Cl2', 'OH', 'CH4', 'SO2', 'SO3', 'SO4', 'S', 'LOI', 'Volatiles', 'O', 'Others', 'HE(CCM/G)', 'HE(CCMSTP/G)', 'HE3(CCMSTP/G)', 'HE3(AT/G)', 'HE4(CCM/G)', 'HE4(CCMSTP/G)', 'HE4(AT/G)', 'HE4(MOLE/G)', 'HE4(NCC/G)', 'HE(NCC/G)', 'Li', 'Be', 'B.1', 'C', 'CO2.1', 'F.1', 'Na.1', 'Mg.1', 'Al.1', 'P.1', 'S.1', 'Cl.1', 'K.1', 'Ca.1', 'Sc', 'Ti.1', 'V', 'Cr.1', 'Mn.1', 'Fe', 'Co', 'Ni.1', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Th', 'U', 'Nd143/Nd144', 'Nd143/Nd144 initial', 'e Nd', 'Sr87/Sr86', 'SR87_SR86_INI', 'PB206_PB204', 'PB206_PB204_INI', 'PB207_PB204', 'PB207_PB204_INI', 'PB208_PB204', 'PB208_PB204_INI', 'OS184_OS188', 'OS186_OS188', 'OS187_OS186', 'OS187_OS188', 'RE187_OS186', 'RE187_OS188', 'HF176_HF177', 'HE3_HE4', 'HE3_HE4(R/R(A))', 'HE4_HE3', 'HE4_HE3(R/R(A))', 'K40_AR40', 'AR40_K40', 'Unique Id', 'Unnamed: 171']

	Citations	Tectonic Setting	Location	Location Comment	Latitude (Min)	Latitude (Max)	Longitude (Min)	Longitude (Max)	Land or Sea	Elevation (Min)	Elevation (Max)	Sample Name	Rock Name	Age (a, Min)	...	OS186_OS188	OS187_OS186	OS187_OS188	RE187_OS186	RE187_OS188	HF176_HF177	HE3_HE4	HE3_HE4(R/R(A))	HE4_HE3	HE4_HE3(R/R(A))	K40_AR40	AR40_K40	Unique Id	Unnamed: 171
0	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / KISU / KISU / INDIAN O...	NaN	-8.38	-8.38	124.41	124.41	SUBAERIAL	NaN	NaN	s BUA [3085]	NOT GIVEN [3085]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.57	NaN	NaN	NaN	NaN	104872	NaN
1	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / ROMANG / PACIFIC OCEAN	NaN	-7.55	-7.55	127.42	127.42	SUBAERIAL	NaN	NaN	s RO7B [3085]	NOT GIVEN [3085]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.99	NaN	NaN	NaN	NaN	104873	NaN
2	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / BANDA / PACIFIC OCEAN	NaN	-4.52	-4.52	129.87	129.87	SUBAERIAL	NaN	NaN	s BA113A [3085]	NOT GIVEN [3085]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.01	NaN	NaN	NaN	NaN	104874	NaN
3	[3910]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / TERNATE / TERNATE / PA...	NaN	0.80	0.80	127.33	127.33	SUBAERIAL	NaN	NaN	s T7 [3910]	NOT GIVEN [3910]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11648-T7	NaN
4	[5178]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / TREWEG / PACIFIC OCEAN	NaN	-8.00	-8.00	124.00	124.00	SUBAERIAL	NaN	NaN	s TR1.2.1 [5178]	ANDESITE, BASALTIC [5178]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	120443	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
465	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s A-4 [2708]	BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-A-4	NaN
466	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s A-5 [2708]	BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-A-5	NaN
467	[3147]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AM104A1 [3147]	ANDESITE [3147]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AM104A1	NaN
468	[3140][2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AS-2 [3140]	ANDESITE, BASALTIC [3140] / BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AS-2	NaN
469	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AS-9 [2708]	BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AS-9	NaN

470 rows × 172 columns

Using .iloc[] (-> integer location)

# General Syntax:  iloc[row_position, column_position]


# just '.tolist()' everything to obtain the desired values only.

# Select a single dataset/row by its index
df.iloc[5]

# Select a sequence of datasets/row by their indices
df.iloc[4:9]   

# Select a single series by its index
df.iloc[:, 5]   

# Select multiple datasets/rows by their index
df.iloc[[4, 8]]

# Select a sequence of datasets/rows by their index
df.iloc[4:8]
df.iloc[4:]
df.iloc[:8]

# Select only values from one sereies, but of only specific rows by their index
df.iloc[[4, 8], 4]

# Select values from multiple datasets/rows and multiple series by their index
df.iloc[[4, 8], [4, 6]]

# Select values from multiple series and a sequence of datasets/rows by their index
df.iloc[4:8, [4, 6]]

# Select values from a sequence of series and a sequence of datasets/rows by their index
df.iloc[4:8, 4:7]

# Select a single value using its index
df.iloc[120, 100]

3.0

Using .loc[] (-> location)

# General Syntax:  loc[row_label, column_label]


# Single value using index and location
df.loc[3, 'Mg']

# The first part only selects datasets indexed 3 and 5, the latter selects the desired series
df.loc[[3, 5], 'Mg']

# The first part only selects datasets indexed 3 and 5, the latter selects the desired series
df.loc[[3, 5], ['Mg', 'Si']]

# The first part selects a sequence of datasets, the latter selects the desired series
df.loc[3:5, ['Mg', 'Si']]

	Mg	Si
3	27378.016	251106.082
4	20322.448	260174.321
5	21106.400	262885.444

Making a series indices to use it in .loc[]

df['Rock Type'].drop_duplicates()

0         VOLCANIC ROCK
47        PLUTONIC ROCK
50     METAMORPHIC ROCK
358     MANTLE XENOLITH
Name: Rock Type, dtype: object

df.set_index('Rock Type', inplace = True)   # this will move this index series to the left
df

	Citations	Tectonic Setting	Location	Location Comment	Latitude (Min)	Latitude (Max)	Longitude (Min)	Longitude (Max)	Land or Sea	Elevation (Min)	Elevation (Max)	Sample Name	Rock Name	Age (a, Min)	...	OS186_OS188	OS187_OS186	OS187_OS188	RE187_OS186	RE187_OS188	HF176_HF177	HE3_HE4	HE3_HE4(R/R(A))	HE4_HE3	HE4_HE3(R/R(A))	K40_AR40	AR40_K40	Unique Id	Unnamed: 171
Rock Type
VOLCANIC ROCK	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / KISU / KISU / INDIAN O...	NaN	-8.38	-8.38	124.41	124.41	SUBAERIAL	NaN	NaN	s BUA [3085]	NOT GIVEN [3085]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.57	NaN	NaN	NaN	NaN	104872	NaN
VOLCANIC ROCK	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / ROMANG / PACIFIC OCEAN	NaN	-7.55	-7.55	127.42	127.42	SUBAERIAL	NaN	NaN	s RO7B [3085]	NOT GIVEN [3085]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.99	NaN	NaN	NaN	NaN	104873	NaN
VOLCANIC ROCK	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / BANDA / PACIFIC OCEAN	NaN	-4.52	-4.52	129.87	129.87	SUBAERIAL	NaN	NaN	s BA113A [3085]	NOT GIVEN [3085]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.01	NaN	NaN	NaN	NaN	104874	NaN
VOLCANIC ROCK	[3910]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / TERNATE / TERNATE / PA...	NaN	0.80	0.80	127.33	127.33	SUBAERIAL	NaN	NaN	s T7 [3910]	NOT GIVEN [3910]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11648-T7	NaN
VOLCANIC ROCK	[5178]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / TREWEG / PACIFIC OCEAN	NaN	-8.00	-8.00	124.00	124.00	SUBAERIAL	NaN	NaN	s TR1.2.1 [5178]	ANDESITE, BASALTIC [5178]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	120443	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
VOLCANIC ROCK	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s A-4 [2708]	BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-A-4	NaN
VOLCANIC ROCK	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s A-5 [2708]	BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-A-5	NaN
VOLCANIC ROCK	[3147]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AM104A1 [3147]	ANDESITE [3147]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AM104A1	NaN
VOLCANIC ROCK	[3140][2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AS-2 [3140]	ANDESITE, BASALTIC [3140] / BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AS-2	NaN
VOLCANIC ROCK	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AS-9 [2708]	BASALT [2708]	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AS-9	NaN

470 rows × 171 columns

# Select all datasets with a specific characteristic in the index series

df.loc["MANTLE XENOLITH"]

df.loc["MANTLE XENOLITH", ['Mg']]

df.loc[["MANTLE XENOLITH"], ['Mg', 'Si']]

	Mg	Si
Rock Type
MANTLE XENOLITH	275589.280	210392.4935
MANTLE XENOLITH	264372.736	212028.5160
MANTLE XENOLITH	248693.696	206372.5525
MANTLE XENOLITH	252734.064	204923.5040
MANTLE XENOLITH	240854.176	207961.8315
MANTLE XENOLITH	241216.000	209130.4190
MANTLE XENOLITH	236572.592	204456.0690
MANTLE XENOLITH	236813.808	207167.1920
MANTLE XENOLITH	222702.672	213290.5905
MANTLE XENOLITH	212752.512	195855.2650
MANTLE XENOLITH	220411.120	201978.6635

df.reset_index(inplace = True)
df

	Rock Type	Citations	Tectonic Setting	Location	Location Comment	Latitude (Min)	Latitude (Max)	Longitude (Min)	Longitude (Max)	Land or Sea	Elevation (Min)	Elevation (Max)	Sample Name	Rock Name	...	OS186_OS188	OS187_OS186	OS187_OS188	RE187_OS186	RE187_OS188	HF176_HF177	HE3_HE4	HE3_HE4(R/R(A))	HE4_HE3	HE4_HE3(R/R(A))	K40_AR40	AR40_K40	Unique Id	Unnamed: 171
0	VOLCANIC ROCK	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / KISU / KISU / INDIAN O...	NaN	-8.38	-8.38	124.41	124.41	SUBAERIAL	NaN	NaN	s BUA [3085]	NOT GIVEN [3085]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.57	NaN	NaN	NaN	NaN	104872	NaN
1	VOLCANIC ROCK	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / ROMANG / PACIFIC OCEAN	NaN	-7.55	-7.55	127.42	127.42	SUBAERIAL	NaN	NaN	s RO7B [3085]	NOT GIVEN [3085]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.99	NaN	NaN	NaN	NaN	104873	NaN
2	VOLCANIC ROCK	[3085]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / BANDA / PACIFIC OCEAN	NaN	-4.52	-4.52	129.87	129.87	SUBAERIAL	NaN	NaN	s BA113A [3085]	NOT GIVEN [3085]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.01	NaN	NaN	NaN	NaN	104874	NaN
3	VOLCANIC ROCK	[3910]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / TERNATE / TERNATE / PA...	NaN	0.80	0.80	127.33	127.33	SUBAERIAL	NaN	NaN	s T7 [3910]	NOT GIVEN [3910]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11648-T7	NaN
4	VOLCANIC ROCK	[5178]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / TREWEG / PACIFIC OCEAN	NaN	-8.00	-8.00	124.00	124.00	SUBAERIAL	NaN	NaN	s TR1.2.1 [5178]	ANDESITE, BASALTIC [5178]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	120443	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
465	VOLCANIC ROCK	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s A-4 [2708]	BASALT [2708]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-A-4	NaN
466	VOLCANIC ROCK	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s A-5 [2708]	BASALT [2708]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-A-5	NaN
467	VOLCANIC ROCK	[3147]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AM104A1 [3147]	ANDESITE [3147]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AM104A1	NaN
468	VOLCANIC ROCK	[3140][2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AS-2 [3140]	ANDESITE, BASALTIC [3140] / BASALT [2708]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AS-2	NaN
469	VOLCANIC ROCK	[2708]	CONVERGENT MARGIN	BANDA ARC / BANDA ARC / AMBON / BANDA SEA / PA...	NaN	-3.61	-3.61	128.11	128.11	SUBAERIAL	NaN	NaN	s AS-9 [2708]	BASALT [2708]	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8829-AS-9	NaN

470 rows × 172 columns

What is the differnce between index and series name?

Index is an integer number that can be used to refer to a row or column. A series is a pandas object, more specific, a single column with still pandas formating. More than two series togehter then constitute the pandas dataframe.

Q1. Two methods to select data are called ______

✓.iloc[] and .loc[]

✗.integerloc[] and .location[]

✗.intergerlocation[] and .location[]

Pandas is a library.

✓True

✗False

Practise

Start with reading the file West African Craton into a variable and displaying the categories in a compact form.
Display the table, but with only 5 rows and columns.
Select the first 4 dataset.
Select the series number 4, 17, and 99.
What types of rock (-> ‘Rock Type’) are represented in the database?
Select all metamorphick rocks. (This is an already tricky one.)
Again, select all metamorphic rocks, but only display the series ‘Mg’ and ‘Ti’.
Reset the indices first, then select datasets 10 throughout and including 20, but only with the columns MgO and Cr2O3.

Solution

import mag4 as mg
import pandas as pd

df = mg.get_data('West African Craton')

pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 5)

print(df.columns.tolist())
df

['Citations', 'Tectonic Setting', 'Location', 'Location Comment', 'Latitude (Min)', 'Latitude (Max)', 'Longitude (Min)', 'Longitude (Max)', 'Land or Sea', 'Elevation (Min)', 'Elevation (Max)', 'Sample Name', 'Rock Name', 'Age (a, Min)', 'Age (a, Max)', 'Geol', 'Age', 'Eruption Day', 'Eruption Month', 'Eruption Year', 'Rock Texture', 'Rock Type', 'Drill Depth (Min)', 'Drill Depth (Max)', 'Alteration', 'Mineral', 'Material', 'Si', 'Ti', 'B', 'Al', 'Cr', 'Fe3+', 'Fe2+', 'Fetot(2+)', 'Ca', 'Mg', 'Mn', 'Ni', 'K', 'Na', 'P', 'H2O', 'H2OP', 'H2OM', 'H2Otot', 'CO2', 'CO', 'F', 'Cl', 'Cl2', 'OH', 'CH4', 'SO2', 'SO3', 'SO4', 'S', 'LOI', 'Volatiles', 'O', 'Others', 'HE(CCM/G)', 'HE(CCMSTP/G)', 'HE3(CCMSTP/G)', 'HE3(AT/G)', 'HE4(CCM/G)', 'HE4(CCMSTP/G)', 'HE4(AT/G)', 'HE4(MOLE/G)', 'HE4(NCC/G)', 'HE(NCC/G)', 'Li', 'Be', 'B.1', 'C', 'CO2.1', 'F.1', 'Na.1', 'Al.1', 'P.1', 'S.1', 'Cl.1', 'K.1', 'Ca.1', 'Sc', 'Ti.1', 'V', 'Cr.1', 'Mn.1', 'Fe', 'Co', 'Ni.1', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Th', 'U', 'Nd143/Nd144', 'Nd143/Nd144 initial', 'e Nd', 'Sr87/Sr86', 'SR87_SR86_INI', 'PB206_PB204', 'PB206_PB204_INI', 'PB207_PB204', 'PB207_PB204_INI', 'PB208_PB204', 'PB208_PB204_INI', 'OS184_OS188', 'OS186_OS188', 'OS187_OS186', 'OS187_OS188', 'RE187_OS186', 'RE187_OS188', 'HF176_HF177', 'HE3_HE4', 'HE3_HE4(R/R(A))', 'HE4_HE3', 'HE4_HE3(R/R(A))', 'K40_AR40', 'AR40_K40', 'Unique Id']

	Citations	Tectonic Setting	...	AR40_K40	Unique Id
0	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184947
1	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184948
...	...	...	...	...	...
41	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247031
42	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247032

43 rows × 170 columns

df.iloc[:4]

	Citations	Tectonic Setting	...	AR40_K40	Unique Id
0	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184947
1	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184948
2	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184949
3	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184950

4 rows × 170 columns

df.iloc[:, [4, 17, 99]]

	Latitude (Min)	Eruption Day	Rb
0	8.5	NaN	0.0
1	8.5	NaN	0.0
...	...	...	...
41	9.5	NaN	4.0
42	9.5	NaN	39.0

43 rows × 3 columns

df['Rock Type'].drop_duplicates()

0        VOLCANIC ROCK
23    METAMORPHIC ROCK
38       PLUTONIC ROCK
Name: Rock Type, dtype: object

df.set_index('Rock Type', inplace = True)
# plus the next line

df.loc['METAMORPHIC ROCK']

	Citations	Tectonic Setting	...	AR40_K40	Unique Id
Rock Type
METAMORPHIC ROCK	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247013
METAMORPHIC ROCK	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247014
...	...	...	...	...	...
METAMORPHIC ROCK	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247031
METAMORPHIC ROCK	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247032

19 rows × 169 columns

df.loc['METAMORPHIC ROCK', ['Mg', 'Ti']]

	Mg	Ti
Rock Type
METAMORPHIC ROCK	48243.200	6592.740
METAMORPHIC ROCK	34373.280	10788.120
...	...	...
METAMORPHIC ROCK	151845.472	659.274
METAMORPHIC ROCK	246643.360	1618.218

19 rows × 2 columns

df.reset_index(inplace = True)
df
# plus the next one

	Rock Type	Citations	...	AR40_K40	Unique Id
0	VOLCANIC ROCK	[5205]	...	NaN	184947
1	VOLCANIC ROCK	[5205]	...	NaN	184948
...	...	...	...	...	...
41	METAMORPHIC ROCK	[8624]	...	NaN	247031
42	METAMORPHIC ROCK	[8624]	...	NaN	247032

43 rows × 170 columns

df.loc[10:21, ['Si', 'Fe', 'Mn']]

	Si	Fe	Mn
10	226705.9750	NaN	1626.3618
11	234699.1135	NaN	1394.0244
...	...	...	...
20	230866.1465	NaN	1626.3618
21	249703.7770	NaN	1936.1450

12 rows × 3 columns

3.5 Basics- Plotting selected data [03:03]

Available Notebooks: Lecture, Exercise, Solution

Zum Schluss dieser Einheit wollen wir die geballte Kraft von Pandas mit der von maplotlib kombinieren. D.h., wir selektieren zunächst mit Pandas die gewünschten Daten aus einer Datenbank, und visualisieren diese umgehend und einfach mit Hilfe der matplotlib Bibliothek.

Video nb

import mag4 as mg
import pandas as pd
import matplotlib.pyplot as plt

df = mg.get_data('Banda Arc')

data = df.loc[3:9, ['Mg', 'Si', 'Mn']]

el1 = 'Mg'

plt.scatter(data[el1], data['Mn'])
plt.xlabel(el1)

Text(0.5, 0, 'Mg')

Which two libraries need to be imported when we want to plot a database?

pandas and matplotlib.pyplot

Pandas series need to be converted into a plot compatible format before used.

✗True

✓False

Practise

Import the 2 databases we require for plotting, plus the one to load data.
Then, we need a database, which shall be West African Craton, and as usual we want to display some sensible information.
Then display datasets 5 through 25, and flexibly plot two elements against each other that are also shown as lables on the respective axes, as well as a legend indicating what data are shown.
The data are in wt-ppm. Change this to wt%.
If we want to concatenate strings, we can simply do this using + between two strings, e.g.: 'a' + 'b', which will result in 'ab'. Try using this to add units to the axes labelling.

Solution

import mag4 as mg
import pandas as pd
import matplotlib.pyplot as plt

df = mg.get_data('West African Craton')

print(df.columns.tolist())

df

['Citations', 'Tectonic Setting', 'Location', 'Location Comment', 'Latitude (Min)', 'Latitude (Max)', 'Longitude (Min)', 'Longitude (Max)', 'Land or Sea', 'Elevation (Min)', 'Elevation (Max)', 'Sample Name', 'Rock Name', 'Age (a, Min)', 'Age (a, Max)', 'Geol', 'Age', 'Eruption Day', 'Eruption Month', 'Eruption Year', 'Rock Texture', 'Rock Type', 'Drill Depth (Min)', 'Drill Depth (Max)', 'Alteration', 'Mineral', 'Material', 'Si', 'Ti', 'B', 'Al', 'Cr', 'Fe3+', 'Fe2+', 'Fetot(2+)', 'Ca', 'Mg', 'Mn', 'Ni', 'K', 'Na', 'P', 'H2O', 'H2OP', 'H2OM', 'H2Otot', 'CO2', 'CO', 'F', 'Cl', 'Cl2', 'OH', 'CH4', 'SO2', 'SO3', 'SO4', 'S', 'LOI', 'Volatiles', 'O', 'Others', 'HE(CCM/G)', 'HE(CCMSTP/G)', 'HE3(CCMSTP/G)', 'HE3(AT/G)', 'HE4(CCM/G)', 'HE4(CCMSTP/G)', 'HE4(AT/G)', 'HE4(MOLE/G)', 'HE4(NCC/G)', 'HE(NCC/G)', 'Li', 'Be', 'B.1', 'C', 'CO2.1', 'F.1', 'Na.1', 'Al.1', 'P.1', 'S.1', 'Cl.1', 'K.1', 'Ca.1', 'Sc', 'Ti.1', 'V', 'Cr.1', 'Mn.1', 'Fe', 'Co', 'Ni.1', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Th', 'U', 'Nd143/Nd144', 'Nd143/Nd144 initial', 'e Nd', 'Sr87/Sr86', 'SR87_SR86_INI', 'PB206_PB204', 'PB206_PB204_INI', 'PB207_PB204', 'PB207_PB204_INI', 'PB208_PB204', 'PB208_PB204_INI', 'OS184_OS188', 'OS186_OS188', 'OS187_OS186', 'OS187_OS188', 'RE187_OS186', 'RE187_OS188', 'HF176_HF177', 'HE3_HE4', 'HE3_HE4(R/R(A))', 'HE4_HE3', 'HE4_HE3(R/R(A))', 'K40_AR40', 'AR40_K40', 'Unique Id']

	Citations	Tectonic Setting	...	AR40_K40	Unique Id
0	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184947
1	[5205]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	184948
...	...	...	...	...	...
41	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247031
42	[8624]	ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)	...	NaN	247032

43 rows × 170 columns

el1 = 'Ti'
el2 = 'Al'

xData = df.loc[5:26, el1]
yData = df.loc[5:26, el2]

plt.scatter(xData, yData, label = 'W-Africa Craton')
plt.xlabel(el1)
plt.ylabel(el2)
plt.legend(loc = 'lower right')

plt.show()

el1 = 'Ti'
el2 = 'Al'

xData = df.loc[5:26, el1] / 10000
yData = df.loc[5:26, el2] / 10000

plt.scatter(xData, yData, label = 'W-Africa Craton')
plt.xlabel(el1 + ' (wt%)')
plt.ylabel(el2 + ' (wt%)')
plt.legend(loc = 'lower right')

plt.show()