Pandas 文本處理

Pandas 文本處理操作實例

在本章中，我們將使用基本的Series / Index討論字符串操作。在隨后的章節(jié)中，我們將學習如何在DataFrame上應用這些字符串函數(shù)。

Pandas提供了一組字符串函數(shù)，可以輕松地對字符串數(shù)據(jù)進行操作。最重要的是，這些函數(shù)忽略（或排除）缺少的/ NaN值。

幾乎所有這些方法都可用于Python字符串函數(shù)（請參閱： https://docs.python.org/3/library/stdtypes.html#string-methods)。因此，將Series對象轉(zhuǎn)換為String對象，然后執(zhí)行該操作。

我們看看每個操作如何執(zhí)行。

方法	說明
lower()	將系列/索引中的字符串轉(zhuǎn)換為小寫。
upper()	將系列/索引中的字符串轉(zhuǎn)換為大寫。
len()	計算字符串length()。
strip()	幫助從兩側(cè)從系列/索引中的每個字符串中去除空格（包括換行符）。
split(' ')	用給定的模式分割每個字符串。
cat(sep=' ')/td>	用給定的分隔符連接系列/索引元素。
get_dummies()	返回具有一鍵編碼值的DataFrame。
contains(pattern)	如果子字符串包含在元素中，則為每個元素返回一個布爾值True，否則返回False。
replace(a,b)	a值替換成b。
repeat(value)	以指定的次數(shù)重復每個元素。
count(pattern)	返回每個元素中模式出現(xiàn)的次數(shù)。
startswith(pattern)	如果系列/索引中的元素以模式開頭，則返回true。
endswith(pattern)	如果系列/索引中的元素以模式結(jié)尾，則返回true。
find(pattern)	返回模式首次出現(xiàn)的第一個位置。
findall(pattern)	返回所有出現(xiàn)的模式的列表。
swapcase	大小寫互換
islower()<	檢查“系列/索引”中每個字符串中的所有字符是否都小寫。返回布爾值
isupper()	檢查“系列/索引”中每個字符串中的所有字符是否都大寫。返回布爾值。
isnumeric()	檢查“系列/索引”中每個字符串中的所有字符是否都是數(shù)字。返回布爾值。

我們來創(chuàng)建一個Series，看看以上所有功能如何工作。

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s

運行結(jié)果：

 0 Tom
 1 William Rick
 2 John
 3 Alber@t
 4 NaN
 5 1234
 6 Steve Smith
 dtype: object

lower()

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s.str.lower()

運行結(jié)果：

 0 tom
 1 william rick
 2 john
 3 alber@t
 4 NaN
 5 1234
 6 steve smith
 dtype: object

upper()

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s.str.upper()

運行結(jié)果：

 0 TOM
 1 WILLIAM RICK
 2 JOHN
 3 ALBER@T
 4 NaN
 5 1234
 6 STEVE SMITH
 dtype: object

len()

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 print s.str.len()

運行結(jié)果：

 0 3.0
 1 12.0
 2 4.0
 3 7.0
 4 NaN
 5 4.0
 6 10.0
 dtype: float64

strip()

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s
 print ("After Stripping:")
 print s.str.strip()

運行結(jié)果：

 0 Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object
 After Stripping:
 0 Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object

split(pattern)

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s
 print ("Split Pattern:")
 print s.str.split(' ')

運行結(jié)果：

 0 Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object
 Split Pattern:
 0 [Tom, , , , , , , , , , ]
 1 [, , , , , William, Rick]
 2 [John]
 3 [Alber@t]
 dtype: object

cat(sep=pattern)

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s.str.cat(sep='_')

運行結(jié)果：

   Tom _ William Rick_John_Alber@t

get_dummies()

示例

 import pandas as pd
 import numpy as np
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s.str.get_dummies()

運行結(jié)果：

   William Rick   Alber@t   John   Tom
0             0         0      0     1
1             1         0      0     0
2             0         0      1     0
3             0         1      0     0

contains ()

示例

 import pandas as pd
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s.str.contains(' ')

運行結(jié)果：

 0  True
 1  True
 2  False
 3  False
 dtype: bool

replace(a,b)

示例

 import pandas as pd
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s
 print ("After replacing @ with $:")
 print s.str.replace('@',')
 )

運行結(jié)果：

 0 Tom
 1 William Rick
 2 John
 3 Alber@t
 dtype: object
 After replacing @ with $:
 0 Tom
 1 William Rick
 2 John
 3 Alber$t
 dtype: object

repeat(value)

示例

 import pandas as pd
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s.str.repeat(2)

運行結(jié)果：

0   Tom            Tom
1   William Rick   William Rick
2                  JohnJohn
3                  Alber@tAlber@t
dtype: object

count(pattern)

示例

 import pandas as pd
  
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print ("每個字符串中的“ m”數(shù):")
 print s.str.count('m')

運行結(jié)果：

 每個字符串中的“ m”數(shù):
 0 1
 1 1
 2 0
 3 0

startswith(pattern)

示例

 import pandas as pd
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print ("Strings that start with 'T':")
 print s.str. startswith ('T')

運行結(jié)果：

 0  True
 1  False
 2  False
 3  False
 dtype: bool

endswith(pattern)

示例

 import pandas as pd
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print ("Strings that end with 't':")
 print s.str.endswith('t')

運行結(jié)果：

 Strings that end with 't':
 0  False
 1  False
 2  False
 3  True
 dtype: bool

find(pattern)

示例

 import pandas as pd
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s.str.find('e')

運行結(jié)果：

 0 -1
 1 -1
 2 -1
 3 3
 dtype: int64

“ -1”表示元素中沒有匹配到。

findall(pattern)

示例

 import pandas as pd
 s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
 print s.str.findall('e')

運行結(jié)果：

 0 []
 1 []
 2 []
 3 [e]
 dtype: object

空列表（[]）表示元素中沒有匹配到

swapcase()

示例

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
 print s.str.swapcase()

運行結(jié)果：

 0 tOM
 1 wILLIAM rICK
 2 jOHN
 3 aLBER@T
 dtype: object

islower()

示例

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
 print s.str.islower()

運行結(jié)果：

 0  False
 1  False
 2  False
 3  False
 dtype: bool

isupper()

示例

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
 print s.str.isupper()

運行結(jié)果：

 0  False
 1  False
 2  False
 3  False
 dtype: bool

isnumeric()

示例

 import pandas as pd
 s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
 print s.str.isnumeric()

運行結(jié)果：

 0  False
 1  False
 2  False
 3  False
 dtype: bool

Pandas SQL操作 Pandas 排序

Pandas 教程

Pandas 文本處理

lower()

upper()

len()

strip()

split(pattern)

cat(sep=pattern)

get_dummies()

contains ()

replace(a,b)

repeat(value)

count(pattern)

startswith(pattern)

endswith(pattern)

find(pattern)

findall(pattern)

swapcase()

islower()

isupper()

isnumeric()